Translation systems rarely support African languages
A large-scale multimodal corpus for low-resource African languages
Thiomi NLP is an open research initiative advancing artificial intelligence for African languages through datasets, models, and tools like Dr. Lugha.
Our mission is to empower communication, education, and innovation across Africa by making language barriers a thing of the past.
The Problem
African languages remain underrepresented in modern AI systems
Artificial intelligence systems require large datasets to understand language. However, African languages often lack the data resources required to train modern machine learning systems.
Why datasets are critical
Thiomi NLP addresses this challenge by building large-scale datasets and machine learning systems for African languages.
Artificial intelligence systems require large datasets to understand language, and datasets are critical for translation, speech recognition, and language tools.
Speech recognition tools fail for many African communities
Voice assistants do not understand most African languages
Digital services exclude millions of speakers
Thiomi Dataset
A multimodal research dataset at the center of the initiative
The Thiomi Dataset is a large-scale multimodal dataset designed to support artificial intelligence research for African languages. The dataset combines text and speech data to enable research across multiple language technologies.
Multimodal research areas
The initiative centers around the Thiomi Dataset as the core research contribution and the basis for machine learning models and applications such as Dr. Lugha.
- Dataset highlight
- 601,000+ approved text sentences
- Dataset highlight
- 385,000+ audio recordings
- Dataset highlight
- ~1,500 hours of speech
- Dataset highlight
- 10 African languages
- Dataset highlight
- 100+ community contributors
Languages in the Dataset
Coverage across East and West Africa
The dataset includes languages from East and West Africa. These languages represent hundreds of millions of speakers and multiple linguistic families.
Regional coverage
Geographic diversity
East Africa
- Swahili
- Kikuyu
- Kamba
- Kimeru
- Luo
- Maasai
- Kipsigis
- Somali
West Africa
- Wolof
- Fulani
Dataset Methodology
Community-driven collection across translation and audio workflows
The Thiomi dataset is built using a community-driven collection platform. Two main approaches are used to produce aligned text and audio data suitable for training machine learning models.
Translation-based pipeline
English source sentences are provided
Contributors translate the sentences
Other contributors validate translations
Speakers record audio for the translated text
Audio-first pipeline
Contributors record speech in their native language
Other contributors transcribe the recordings
Transcriptions may be translated into English
Community data collection
These two pipelines produce aligned text and audio data suitable for training machine learning models.
The Thiomi Dataset
A large-scale multimodal dataset designed to support artificial intelligence research for African languages.
Quality Assurance
Quality control is built into the collection process
Peer moderation, expert review, and staged checks help keep both text and audio data usable for research.
Assurance Model
Quality is measured, moderated, and reviewed throughout the pipeline
The collection process combines peer moderation by native-language contributors with sampled expert review so that text and audio quality are checked before the corpus is used downstream.
Text approval
95-98% (primary languages)
Audio approval
78-86%
95-98% text approval rates across primary languages
78-86% audio approval rates
Peer moderation by native-language contributors (100% coverage)
Expert review for sampled submissions (10% sampling rate)
Multi-stage validation across text and audio
Who Thiomi NLP Is For
A research initiative for builders, institutions, and communities
The website should communicate clearly to developers, researchers, startups, NGOs, and general users.
Developers
Build multilingual applications that support African languages.
Startups
Expand products into African markets.
NGOs
Communicate effectively with multilingual communities.
Researchers
Study and preserve African languages.
Communities
Connect across cultures and languages.
Contact
Research collaboration and developer support
For research collaborations, dataset access, or developer support, contact the Thiomi NLP team.
Contact Focus
Reach out for research collaboration or developer support.
Research collaboration
Developer support