How we halved Swahili speech recognition errors
Off-the-shelf multilingual ASR gets Swahili wrong one word in four. We fine-tuned on local Swahili data and halved that error rate to 13.5%. Here is what we learned.
Most speech recognition research focuses on English, Mandarin, and a handful of European languages. For Swahili — spoken by over 200 million people — off-the-shelf models get roughly one word in four wrong. That is not good enough for voice agents, transcription services, or real-time translation.
We set out to close that gap.
The problem
Multilingual ASR models support Swahili out of the box, but their accuracy lags far behind high-resource languages. At 27% word error rate, every fourth word is wrong — making the output unreliable for any production use case.
What we did
We fine-tuned a multilingual speech recognition model specifically for Swahili, using locally sourced Swahili speech data. The focus was on conversational and read speech — the kind of audio our voice agent and translation pipelines need to handle.
The result
| System | Word Error Rate |
|---|---|
| Multilingual baseline (zero-shot) | 27.2% |
| SAUTI ASR v1 (fine-tuned) | 13.5% |
A 50% reduction in errors. At 13.5% WER, Swahili speech recognition becomes viable for production applications.
What this unlocks
This is not just an accuracy improvement — it is an enabling capability:
- **Voice agents** can now understand Swahili speakers reliably enough to hold a conversation
- **Transcription services** can process Swahili audio at scale
- **Real-time translation** between English and Kiswahili becomes possible when the ASR stage is accurate enough
What is next
SAUTI ASR v1 powers the speech recognition stage of our voice agent and real-time translation pipelines. Try it in the [Speech to Text playground](/speech-to-text).
We are now working on streaming ASR for real-time applications and expanding to additional African languages.