Capturing the language landscape for an inclusive digital India

Project Vaani is one of the largest datasets of Indian dialects ever to exist. Upon completion, it will contain more than 150,000 hours of audio across all districts in India.

Supported bySupported by
Support of Support of

About Project VAANI

59,08,252

10613.74hr

66,700

45.58 %

54.42 %

92.82 hrs

13

80

Digital India is marching ahead inexorably. Digital interfaces and communications have become critical for access to information, entertainment, economic opportunities and even essential services such as healthcare.

Project Vaani, by IISc, Bangalore and ARTPARK, is capturing the true diversity of India’s spoken languages to propel language AI technologies and content for an inclusive Digital India.

We expect to create data corpora of over 150,000 hours of speech, part of which will be transcribed in local scripts, while ensuring linguistic, educational, urban-rural, age, and gender diversity (among other potential diversity characteristics). These diligently collected and curated datasets of natural speech and text from about 1 million people across all 773 districts of India will be open-sourced. The current version of the data is open-sourced here. Going forward, we hope to open source through platforms like Bhashini (under the National Language Translation Mission, MeiTY).

This will boost the development of technologies such as automatic speech recognition (ASR), speech to speech translation (SST), and natural language understanding (NLU) that reflect the ground realities of how Indians speak.

Google is funding the Project Vaani.

India’s Vernacular Vista

Experience the linguistic diversity of India in a click. Our dataset is intended to be a treasure trove of speech data from across India’s districts. It offers a comprehensive overview of speech data from all districts, emphasizing on the language variety in each district, providing a unique glimpse into India’s rich cultural tapestry. Discover the richness of India’s linguistic landscape and delve into the statistics that bring our nation’s linguistic diversity to life.

Want to hear the voices out? Click on the states .
States Record
Duration(hrs)
Speaker
Count
Transcription
Duration (Hrs)
757.5747864.93
2942.921835819.50
1064.0269286.00
116.097122.23
276.5217683.29
1219.72755321.71
883.69546511.33
249.0516060.97
333.2820493.14
224.2213911.30
808.4854245.67
529.3631539.61
1208.8275206.65

The colors on the map indicate the recorded duration in hours for each state.

Unleash the Power of our Data!

Download the most diverse open source speech dataset for Indian languages. We encourage the use of this dataset to develop and improve your speech AI technologies and applications for an inclusive Digital India.

Our Team

Dr. Prasanta Kumar Ghosh

Assistant professor, Dept of Electrical Engineering (EE) at Indian Institute of Science (IISc), Bangalore.

Prasanta Kumar Ghosh received his Ph.D. in Electrical Engineering from University of Southern California (USC), Los Angeles, USA in 2011. During 2011-2012 he was with IBM India Research Lab (IRL) as a researcher. He was awarded the INSPIRE faculty fellowship from Department of Science and Technology (DST), Govt. of India in 2012. He was awarded Center of Excellence in Teaching’s award for excellence in teaching in the category of EE for the year 2010-11 in USC. His research interests are in human centered signal processing with applications to education and health care.

Raghu Dharmaraju

President ARTPARK

He is a highly experienced innovator with over two decades of experience conceiving and scaling pioneering institutions and innovations. He has launched a portfolio of eight AI innovations, including TRACE-TB, a major national initiative. He has raised $19 million from the Gates Foundation, USAID, and Google.org and has established. Raghu has also launched and scaled award-winning med-tech innovations, including Embrace infant warmers, which have reached approximately 1,000,000 babies via WHO and non-profits, governments, and 300+ private hospitals in 100+ small towns. He has led strategy, product management, and startup operations for a new global business at Corning Environmental Technologies, and has experience in a range of industries including digital health, med-tech, agriculture, financial inclusion, circular economy, and more. Raghu holds a B.Tech. from IIT Madras, an M.S. from the University of Massachusetts, Amherst, and an M.B.A. from Cornell.

Frequently asked questions

If you can’t find what you’re looking for, Email us, We will get back to you.

Why capturing the language landscape of India is important?

Capturing India's diverse language landscape is vital for an inclusive Digital India, as only 10% of the population speaks English. Existing language AI models may not meet the linguistic diversity of India, where languages blend continuously. Initiatives like the National Language Translation Mission and Project Vaani aim to collect authentic language data, addressing the limitations of biased language models.

Why is this data shown per district and not per language?

We believe that language in India is more like a fabric, with the color changing gradually as we move over a fabric. Similarly, language changes as we move every few kilometers. With this school of thought, we are collecting dataset that is representative of each district, which may contain multiple languages. Click on State>District to see which languages we have recorded till date

Who can use this data?

This dataset is open source and can be used by any individual or organization. Any startup is welcome to use this dataset. Feedback on the dataset is always welcome Comment end