AfriVoices-KE: A Multilingual Speech Dataset for Kenyan Languages

Lilian Wanzare, Cynthia Amol, zekiel Maina, Nelson Odhiambo, Hope Kerubo et al.|April 9, 2026arXiv

Key Takeaway

This dataset solves a major gap: most speech AI is trained on English and a few European languages, leaving African languages behind. AfriVoices-KE provides the foundation needed to build fair, inclusive speech technology for Kenya.

Summary

AfriVoices-KE is a 3,000-hour multilingual speech dataset covering five Kenyan languages with recordings from nearly 5,000 native speakers. It combines scripted and spontaneous speech to enable building speech technology (like voice assistants and transcription tools) for underrepresented African languages.

data applications

Key Terms

multilingual-speech-corpus low-resource-languages automatic-speech-recognition text-to-speech signal-to-noise-ratio