Google has announced the release of WAXAL, a large open speech dataset designed to accelerate the development of voice technologies for African languages, including Hausa and Yoruba.
The technology company says the initiative targets one of the most persistent challenges facing voice-based artificial intelligence in Sub-Saharan Africa, a region home to more than 2,000 languages that remain largely underrepresented in the datasets used to train modern speech technologies.
“For people across much of the world, talking to devices is second nature,” Google says. “But this convenience disappears when technology doesn’t speak your language.”

The technology company says the initiative targets one of the most persistent challenges facing voice-based artificial intelligence in Sub-Saharan Africa, a region home to more than 2,000 languages that remain largely underrepresented in the datasets used to train modern speech technologies.
WAXAL to make African language speech data ‘openly available’
According to Google, the scarcity of accessible, high-quality speech data has been a major barrier to building practical voice tools for African users. Developed over a three-year period, the WAXAL dataset is intended to help close that gap by making African language speech data openly available to researchers, developers and technology companies.
Google explains that the name WAXAL is derived from the Wolof word for “speak”, underscoring the project’s goal of enabling people to interact with technology in their own languages.
The dataset spans 21 African languages, including Hausa, Yoruba, Acholi and Luganda, and contains more than 11,000 hours of speech data drawn from nearly two million individual recordings.
“This includes approximately 1,250 hours of transcribed speech for automatic speech recognition (ASR) and over 20 hours of studio-quality recordings for text-to-speech (TTS) voice synthesis,” Google says.
Beyond its scale, Google says WAXAL stands out for its collaborative approach, with African institutions and organisations playing a central role in data collection and project execution.
“A project built by and for the community,” the company describes it, calling the dataset “a collaborative achievement, powered by the expertise of leading African organisations”.
According to Google, Makerere University in Uganda and the University of Ghana led data collection efforts covering a combined 13 languages, while Digital Umuganda in Rwanda coordinated work on five major languages. Media Trust and Loud n Clear produced the studio-quality voice recordings, and the African Institute for Mathematical Sciences contributed multilingual data for future releases.
“This framework ensures our partners retain ownership of the data they collected, while working with us toward the shared goal of making these resources available to the global research community,” Google says.
On data collection methodology, Google notes that the project prioritised natural, everyday language use. Participants were asked to describe images in their native languages, while professional voice actors were recorded in studios to generate high-quality audio suitable for speech synthesis.
“We wanted to capture how people really talk,” the company says.
Google adds that beyond enabling new voice-driven applications, the dataset could also support the digital preservation of African languages.
“We hope WAXAL will not only fuel innovation but also aid in the digital preservation of African languages,” it says.
The full WAXAL dataset has been released under an open licence and is now available on Hugging Face, alongside a research paper detailing the methodology behind its development.
























Home