How we halved Swahili speech recognition errors

Off-the-shelf multilingual ASR gets Swahili wrong one word in four. We fine-tuned on local Swahili data and halved that error rate to 13.5%. Here is what we learned.

Most speech recognition research focuses on English, Mandarin, and a handful of European languages. For Swahili — spoken by over 200 million people — off-the-shelf models get roughly one word in four wrong. That is not good enough for voice agents, transcription services, or real-time translation.

We set out to close that gap.

The problem

Multilingual ASR models support Swahili out of the box, but their accuracy lags far behind high-resource languages. At 27% word error rate, every fourth word is wrong — making the output unreliable for any production use case.

What we did

We fine-tuned a multilingual speech recognition model specifically for Swahili, using locally sourced Swahili speech data. The focus was on conversational and read speech — the kind of audio our voice agent and translation pipelines need to handle.

The result

System	Word Error Rate
Multilingual baseline (zero-shot)	27.2%
SAUTI ASR v1 (fine-tuned)	13.5%

A 50% reduction in errors. At 13.5% WER, Swahili speech recognition becomes viable for production applications.

What this unlocks

This is not just an accuracy improvement — it is an enabling capability:

**Voice agents** can now understand Swahili speakers reliably enough to hold a conversation
**Transcription services** can process Swahili audio at scale
**Real-time translation** between English and Kiswahili becomes possible when the ASR stage is accurate enough

What is next

SAUTI ASR v1 powers the speech recognition stage of our voice agent and real-time translation pipelines. Try it in the [Speech to Text playground](/speech-to-text).

We are now working on streaming ASR for real-time applications and expanding to additional African languages.