Designing a low-latency Swahili voice agent for telephony deployments

Architecture design for a full voice-turn IVR agent combining SAUTI TTS + ASR with a language model backend, optimised for sub-500ms response latency on African telephony networks.

Overview

The SAUTI Voice Agent combines our TTS and ASR systems with a language model backend to create a full conversational voice agent for Swahili telephony. This research focuses on the architectural decisions required to achieve acceptable latency on African telephony infrastructure.

Motivation

Voice-based AI agents have transformative potential for markets where smartphone penetration is limited but feature phone and voice call usage is high. East Africa — with its large Swahili-speaking population — is a prime deployment target, but telephony networks introduce latency, jitter, and codec constraints that typical voice AI architectures don't account for.

Research areas

Latency budget

A natural conversation requires end-to-end response times under 500ms. We decompose this into: - ASR inference: target < 150ms - Language model generation: target < 200ms - TTS synthesis: target < 100ms - Network + codec overhead: ~50ms

Streaming architecture

Traditional request-response architectures cannot meet these latency targets. We are exploring streaming ASR (processing audio chunks as they arrive), speculative LLM decoding, and incremental TTS synthesis.

Codec handling

African telephony networks predominantly use AMR-NB (8 kHz, narrowband). Our models are trained on 16 kHz wideband audio. We need a robust upsampling/downsampling pipeline that preserves intelligibility across codec transitions.

Current status

Architecture design phase. We are benchmarking individual component latencies and designing the streaming pipeline. Implementation will begin after SAUTI ASR reaches its target WER.