Explore Digital India's Linguistic Tapestry

Dive into India's rich linguistic tapestry with our platform, where you can filter and download speech data from different states and districts, contributing to language AI advancements for an inclusive digital future.

Supported bySupported by
Support of Support of

Select and effortlessly download speech data

Use the below slicer to download filtered data.

Choose states or districts

Guidelines to use the VAANI data.

Unlock the full potential of VAANI's diverse speech dataset with these tailored guidelines:

  1. 1. Identify Your Target Audience:

    Before diving into downloads, pinpoint your intended audience. Whether it's Hindi speakers in urban areas or Kannada speakers in rural regions, understanding your target demographic is key.

  2. 2. Explore Region-Specific Data:

    Delve into the dataset by selecting your desired region, be it a state or district. Each region offers a unique linguistic landscape, allowing you to tailor your data selection to match the demographics of your target audience.

  3. 3. Language-Specific Downloads:

    If you want language specific data, download the data files from all states, utilize the metadata included with each file to filter and extract the language-specific data. By filtering based on language metadata, you can efficiently obtain all relevant files related to your target language across districts.

  4. 4. Assess Volume and Diversity:

    Gain insights into the richness of available data for your selected region. Evaluate the volume and diversity of speech samples to ensure they align with your project requirements and audience preferences.

  5. 5. Download Data Files:

    Upon completion of the form, download the JSON file containing links to all available data files corresponding to your selected region. This comprehensive file streamlines the download process, enabling easy access to the wealth of speech data VAANI offers.

  6. 6. Key note:

    Approximately 10% of the vaani data is transcribed. Untranscribed speech data can be used for self-supervised speech representation learning approaches and for many applications including automatic speech recognition. Speech data, although untranscribed, can also be used for applications such as language identification, accent classification with accurate language and accent label