A large-scale multimodal corpus for low-resource African languages

Thiomi NLP is an open research initiative advancing artificial intelligence for African languages through datasets, models, and tools like Dr. Lugha.

Our mission is to empower communication, education, and innovation across Africa by making language barriers a thing of the past.

The Problem

African languages remain underrepresented in modern AI systems

Artificial intelligence systems require large datasets to understand language. However, African languages often lack the data resources required to train modern machine learning systems.

Why datasets are critical

Thiomi NLP addresses this challenge by building large-scale datasets and machine learning systems for African languages.

Artificial intelligence systems require large datasets to understand language, and datasets are critical for translation, speech recognition, and language tools.

Translation systems rarely support African languages

Speech recognition tools fail for many African communities

Voice assistants do not understand most African languages

Digital services exclude millions of speakers

Thiomi Dataset

A multimodal research dataset at the center of the initiative

The Thiomi Dataset is a large-scale multimodal dataset designed to support artificial intelligence research for African languages. The dataset combines text and speech data to enable research across multiple language technologies.

Multimodal research areas

Machine TranslationSpeech RecognitionText-to-Speech

The initiative centers around the Thiomi Dataset as the core research contribution and the basis for machine learning models and applications such as Dr. Lugha.

Dataset highlight
601,000+ approved text sentences
Dataset highlight
385,000+ audio recordings
Dataset highlight
~1,500 hours of speech
Dataset highlight
10 African languages
Dataset highlight
100+ community contributors

Languages in the Dataset

Coverage across East and West Africa

The dataset includes languages from East and West Africa. These languages represent hundreds of millions of speakers and multiple linguistic families.

Regional coverage

Highlighted coverage

Geographic diversity

East Africa

  • Swahili
  • Kikuyu
  • Kamba
  • Kimeru
  • Luo
  • Maasai
  • Kipsigis
  • Somali

West Africa

  • Wolof
  • Fulani

Dataset Methodology

Community-driven collection across translation and audio workflows

The Thiomi dataset is built using a community-driven collection platform. Two main approaches are used to produce aligned text and audio data suitable for training machine learning models.

Translation-based pipeline

  1. English source sentences are provided

  2. Contributors translate the sentences

  3. Other contributors validate translations

  4. Speakers record audio for the translated text

Audio-first pipeline

  1. Contributors record speech in their native language

  2. Other contributors transcribe the recordings

  3. Transcriptions may be translated into English

Community data collection

These two pipelines produce aligned text and audio data suitable for training machine learning models.

The Thiomi Dataset

A large-scale multimodal dataset designed to support artificial intelligence research for African languages.

Quality Assurance

Quality control is built into the collection process

Peer moderation, expert review, and staged checks help keep both text and audio data usable for research.

Assurance Model

Quality is measured, moderated, and reviewed throughout the pipeline

The collection process combines peer moderation by native-language contributors with sampled expert review so that text and audio quality are checked before the corpus is used downstream.

Text approval

95-98% (primary languages)

Audio approval

78-86%

1

95-98% text approval rates across primary languages

2

78-86% audio approval rates

3

Peer moderation by native-language contributors (100% coverage)

4

Expert review for sampled submissions (10% sampling rate)

5

Multi-stage validation across text and audio

Who Thiomi NLP Is For

A research initiative for builders, institutions, and communities

The website should communicate clearly to developers, researchers, startups, NGOs, and general users.

Developers

Build multilingual applications that support African languages.

Startups

Expand products into African markets.

NGOs

Communicate effectively with multilingual communities.

Researchers

Study and preserve African languages.

Communities

Connect across cultures and languages.

Contact

Research collaboration and developer support

For research collaborations, dataset access, or developer support, contact the Thiomi NLP team.

Contact Focus

Reach out for research collaboration or developer support.

Research collaboration

Developer support

Research collaboration and developer support inquiries are welcome.