Vaani Atypical Speech Corpus

Speech technology doesn't work for everyone yet.

The Gap in Speech Technology

Today's ASR systems are built for standard speech. But a large number of people communicate in ways that differ from these assumptions.

For them, voice interfaces break. Dictation fails. Assistive tools fall short.

This gap becomes even sharper in multilingual contexts like India.

A New Kind of Speech Dataset

The Vaani Atypical Speech Corpus is an early effort to address this gap. Built under Project Vaani, in collaboration with Project Euphonia (Google), this dataset focuses on atypical speech in Indic languages — a space that remains largely underrepresented.

  • Real-world speech, not lab-controlled recordings
  • Diverse speech patterns across conditions such as autism, cerebral palsy, Down syndrome, and speech & hearing impairments
  • No reliance on clinical labels — focused on how people actually speak

For anyone working on ASR robustness, accessibility, or personalization — this is a dataset that is hard to find elsewhere.

Project Euphonia

Built in collaboration with Project Euphonia — a Google Research initiative working to make speech recognition work for everyone.

How the Data Was Collected

🎙️

Data Collection

  • Participants describe images in their own words
  • ~20 recordings per participant
  • 20–40 seconds each
  • Data collected with Karya
📝

Transcription

  • Done by people familiar with the speaker
  • Captures intended meaning, not just verbatim text

Validation

  • Automated + manual checks
  • Ensures audio quality
  • Verifies natural speech and safe content

Expanding What "Speech Diversity" Means

Project Vaani has focused on capturing linguistic diversity across India.

This dataset expands that vision — from how language varies to how speech itself varies.

It is an early step toward building speech technologies that work not just across languages, but across people.

Explore and Build

Early-stage dataset — small today, but highly relevant for research and experimentation.

  • Test robustness beyond standard benchmarks
  • Fine-tune models for real-world inclusivity
  • Explore personalization and adaptive systems
Explore the Dataset on Hugging Face

If you are working on similar problems or collecting related data, you may reach out to us at vaanicontact@gmail.com