Vaani Atypical Speech Corpus

Speech technology doesn't work for everyone yet.

The Gap in Speech Technology

Today's ASR systems are built for standard speech. But a large number of people communicate in ways that differ from these assumptions.

For them, voice interfaces break. Dictation fails. Assistive tools fall short.

This gap becomes even sharper in multilingual contexts like India.

A New Kind of Speech Dataset

The Vaani Atypical Speech Corpus is an early effort to address this gap. Built under Project Vaani, in collaboration with Project Euphonia (Google), this dataset focuses on atypical speech in Indic languages — a space that remains largely underrepresented.

Real-world speech, not lab-controlled recordings
Diverse speech patterns across conditions such as autism, cerebral palsy, Down syndrome, and speech & hearing impairments
No reliance on clinical labels — focused on how people actually speak

For anyone working on ASR robustness, accessibility, or personalization — this is a dataset that is hard to find elsewhere.

Built in collaboration with Project Euphonia — a Google Research initiative working to make speech recognition work for everyone.

How the Data Was Collected

🎙️

Data Collection

Participants describe images in their own words
~20 recordings per participant
20–40 seconds each
Data collected with Karya

📝

Transcription

Done by people familiar with the speaker
Captures intended meaning, not just verbatim text

✅

Validation

Automated + manual checks
Ensures audio quality
Verifies natural speech and safe content

Expanding What "Speech Diversity" Means

Project Vaani has focused on capturing linguistic diversity across India.

This dataset expands that vision — from how language varies to how speech itself varies.

It is an early step toward building speech technologies that work not just across languages, but across people.

Explore and Build

Early-stage dataset — small today, but highly relevant for research and experimentation.

Test robustness beyond standard benchmarks
Fine-tune models for real-world inclusivity
Explore personalization and adaptive systems

Explore the Dataset on Hugging Face

If you are working on similar problems or collecting related data, you may reach out to us at vaanicontact@gmail.com