Explore Digital India's Linguistic Tapestry

Dive into India's rich linguistic tapestry with our platform, where you can filter and download speech data from different states and districts, contributing to language AI advancements for an inclusive digital future.

Total Files

Total Duration

Male Audio

Female Audio

Average Duration

Max Duration

Min Duration

Supported by

Support of

Access the previous versions of JSON here :

~V 1.0

Data is now completely hosted on Hugging Face

📝 Share Feedback

Use the slicer below to view district-wise statistics.

Choose states or districts

Guidelines to use the VAANI data.

Unlock the full potential of VAANI's diverse speech dataset with these tailored guidelines:

1. Identify Your Target Audience:
Before diving into downloads, pinpoint your intended audience. Whether it's Hindi speakers in urban areas or Kannada speakers in rural regions, understanding your target demographic is key.
2. Explore Region-Specific Data:
Delve into the dataset by selecting your desired region, be it a state or district. Each region offers a unique linguistic landscape, allowing you to tailor your data selection to match the demographics of your target audience.
3. Language-Specific Downloads:
If you want language specific data, download the data files from all states, utilize the metadata included with each file to filter and extract the language-specific data. By filtering based on language metadata, you can efficiently obtain all relevant files related to your target language across districts.
4. Assess Volume and Diversity:
Gain insights into the richness of available data for your selected region. Evaluate the volume and diversity of speech samples to ensure they align with your project requirements and audience preferences.
5. Download Data Files:
Upon completion of the form, download the JSON file containing links to all available data files corresponding to your selected region. This comprehensive file streamlines the download process, enabling easy access to the wealth of speech data VAANI offers.
6. Key note:
Approximately 10% of the vaani data is transcribed. Untranscribed speech data can be used for self-supervised speech representation learning approaches and for many applications including automatic speech recognition. Speech data, although untranscribed, can also be used for applications such as language identification, accent classification with accurate language and accent label

Explore Digital India's Linguistic Tapestry

Data is now completely hosted on Hugging Face

Guidelines to use the VAANI data.

1. Identify Your Target Audience:

2. Explore Region-Specific Data:

3. Language-Specific Downloads:

4. Assess Volume and Diversity:

5. Download Data Files:

6. Key note: