Building Speech Technology for Garo: A Low-Resource ASR Breakthrough Using the Vaani Dataset

Summary

Many Indian languages remain technologically underserved despite having large speaker populations. Garo, a Tibeto-Burman language spoken by approximately 1.2 million people across Northeast India and Bangladesh, is one such language with virtually no digital speech infrastructure. To address this gap, Mwire Labs developed an Automatic Speech Recognition (ASR) system for Garo by fine-tuning the Whisper-small model using speech data from the Vaani dataset. The resulting system achieved a Word Error Rate (WER) of 9.74% and Character Error Rate (CER) of 3.82%, representing a 97.5% relative improvement over the zero-shot baseline. The model produced perfect transcriptions for more than 60% of test samples and achieved real-time inference speeds, making it practical for deployment in applications such as voice interfaces, language learning tools, and cultural documentation. This work demonstrates how high-quality multilingual speech datasets like Vaani can enable effective ASR systems even for low-resource languages.

The Challenge: Speech Technology for Low-Resource Languages

While modern ASR systems have achieved remarkable success for high-resource languages such as English, the same progress has not extended to many regional languages. Garo illustrates several of the challenges that hinder speech technology development for low-resource languages.

Limited Training Data

ASR systems require large volumes of labeled audio data. However, most low-resource languages have very limited annotated speech datasets, making it difficult to train robust models.

Absence from Large Multilingual Models

Large multilingual ASR models such as Whisper are trained on vast datasets covering many languages. However, Garo is not included in Whisper's original training corpus, leading to extremely poor zero-shot performance.

Linguistic Complexity

Garo presents unique linguistic challenges that make it difficult for standard ASR systems to accurately recognize and transcribe its speech:

It belongs to the Tibeto-Burman language family.
The language is highly agglutinative, meaning words often contain multiple prefixes and suffixes.
Complex morphological structures create long compound words, which increase transcription difficulty.

The Solution: Leveraging the Vaani Dataset for Multilingual Transfer

To overcome these challenges, Mwire Labs leveraged transfer learning from multilingual ASR models combined with targeted fine-tuning using the Vaani dataset. The Vaani dataset, developed by ARTPARK-IISc, contains spontaneous speech collected across 165 districts in India covering 109 languages. Speakers describe images presented to them, producing natural conversational speech that reflects real-world linguistic diversity.

For this project, the following training setup was used:

Whisper-small (244M parameters) was selected as the base model.
The model was fine-tuned on the Garo subset of the Vaani dataset.
Audio samples averaged around 4 seconds in duration, recorded at 16 kHz sampling rate.
Standard training techniques were applied: AdamW optimization, learning-rate scheduling, mixed-precision training, and gradient accumulation.

This approach allowed the model to transfer knowledge from multilingual speech patterns while adapting specifically to Garo phonology and morphology.

Results: Breaking Barriers — High-Accuracy ASR for Low-Resource Languages

The fine-tuned model significantly outperformed the baseline zero-shot Whisper model across all evaluation metrics.

ABenchmark Performance

The results demonstrate a dramatic improvement over the zero-shot baseline:

Model Comparison — WER and CER

Model	WER	CER
Whisper-small (Zero-Shot)	382.7	203.5
Fine-Tuned Garo ASR	9.74	3.82

97.5% relative improvement in Word Error Rate
98.1% relative improvement in Character Error Rate
Over 60% of test samples were transcribed perfectly (0% WER)

BReal-Time Performance

The system also demonstrated strong inference efficiency, making it practical for live deployment:

Real-time factor: 0.05× — the model processes audio 20× faster than its duration.
Suitable for live transcription and interactive voice applications.

Impact: Enabling Digital Access for a Low-Resource Language

The development of a high-accuracy ASR system for Garo opens the door to several practical applications:

Voice-Based Digital Services

Speech-to-text systems can enable voice assistants, accessibility tools, and voice search in the Garo language.

Education and Language Learning

Accurate transcription systems can support language learning tools and literacy programs for Garo speakers.

Cultural Preservation

ASR technology can assist in digitizing oral traditions, folklore, and cultural heritage recordings, preserving them for future generations.

Content Creation

The system can help creators produce media, subtitles, and digital content in the Garo language, increasing representation in the digital ecosystem.

The Vaani Significance

This project reinforces the broader importance of the Vaani dataset in enabling inclusive AI development.

Enabling Technology for Underrepresented Languages

The dataset provides high-quality speech data for languages that otherwise lack digital resources, making it possible to build ASR systems that were previously infeasible.

Accelerating Multilingual AI

By capturing spontaneous speech across hundreds of districts, Vaani provides models with exposure to accent diversity, dialect variation, and natural conversational patterns.

Supporting Inclusive AI Development

Projects like this demonstrate that inclusive datasets are critical to ensuring that AI technologies serve all linguistic communities, not just globally dominant languages.

Conclusion: A Step Toward Inclusive Speech Technology

The development of a Garo ASR system demonstrates that low-resource languages can achieve high-quality speech recognition through multilingual transfer learning and high-quality datasets. By fine-tuning Whisper using the Vaani dataset, Mwire Labs dramatically reduced transcription errors and achieved sub-10% WER — a major milestone for a language with no prior ASR infrastructure.

This work serves as a replicable blueprint: with the right dataset and transfer learning approach, it is possible to build practical, high-accuracy speech technology for any of India's hundreds of underserved languages.