Capturing the language landscape for an inclusive digital India

Project Vaani is one of the largest datasets of Indian dialects ever to exist. Upon completion, it will contain more than 150,000 hours of audio across all districts in India.

Supported bySupported by
Support of Support of

About Project VAANI

14434.23 hr

83,174

59

349.53 hr

80

12

129028

43.90 %

56.10 %

88,34,039

Digital India is marching ahead inexorably. Digital interfaces and communications have become critical for access to information, entertainment, economic opportunities and even essential services such as healthcare.

Project Vaani, by IISc, Bangalore and ARTPARK, is capturing the true diversity of India’s spoken languages to propel language AI technologies and content for an inclusive Digital India.

We expect to create data corpora of over 150,000 hours of speech, part of which will be transcribed in local scripts, while ensuring linguistic, educational, urban-rural, age, and gender diversity (among other potential diversity characteristics). These diligently collected and curated datasets of natural speech and text from about 1 million people across all 773 districts of India will be open-sourced. The current version of the data is open-sourced here. Going forward, we hope to open source through platforms like Bhashini (under the National Language Translation Mission, MeiTY).

This will boost the development of technologies such as automatic speech recognition (ASR), speech to speech translation (SST), and natural language understanding (NLU) that reflect the ground realities of how Indians speak.

Google is funding the Project Vaani.

Explore India’s Vernacular Vista with Vaani

Explore the linguistic diversity of India in a click. Our dataset is intended to be a treasure trove of speech data from across India’s districts. It offers a comprehensive overview of speech data from all districts, emphasizing the language variety in each district, providing a unique glimpse into India’s rich cultural tapestry. Discover the richness of India’s linguistic landscape and delve into the statistics that bring our nation’s linguistic diversity to life.

Want to hear the voices out? Click on the states .
States Record
Duration(hrs)
Speaker
Count
Transcription
Duration (Hrs)
Languages(HRS) Language Count
1134.82641738.11Telugu(985.09), Hindi(120.58), Urdu(18.75), Tamil(4.15), Bengali(2.91), Marathi(2.31), Kannada(0.59), Bhojpuri(0.20), Santali(0.12), English(0.12), Maithili(0.04) 11
3697.162152845.93Hindi(2848.09), Maithili(369.27), Bhojpuri(159), Magahi(94.97), Angika(81.83), Bajjika(59.77), Urdu(42.95), Surjapuri(25.91), Bengali(6.49), Marathi(5.15), Kurumali(0.86), Bantar(0.60), Telugu(0.43), Kannada(0.35), Sadri(0.28), Tamil(0.25), Khortha(0.23), Chattisgarhi(0.21), Kortha(0.18), Konkani(0.17), Nepalese(0.14), Marwari(0.04) 22
1699.93978742.08Hindi(1273.13), Chattisgarhi(325.77), Halbi(36.67), Sadri(22.73), Surgujia(11.28), Kurukh(10.99), Awadhi(5.75), Gondi(3.59), Nagpuri(1.99), Bengali(1.76), Odia(1.56), Dorli(1.36), Marathi(0.91), Bhatri(0.90), Duruwa(0.52), Maithili(0.48), Kudukh(0.28), Kannada(0.12), English(0.09), Agariya(0.03)20
162.928955.71Hindi(79.80), Konkani(59.62), Marathi(22.62), Bengali(0.38), Kannada(0.28), Gujarati(0.21)6
368.43220614.78Hindi(291.39), Bengali(42), Khortha(25.95), Santali(5.38), Bhojpuri(1.13), Marathi(0.92), Angika(0.63), Magahi(0.46), Maithili(0.36), Kurumali(0.13), English(0.05), Chattisgarhi(0.04) 12
1688.27965532.87Kannada(1392.75), Hindi(94.10), Telugu(89.63), Tulu(39.57), Marathi(27.85), Urdu(26.02), Bearybashe(6.86), Tamil(5.13), Malayalam(2.31), Bengali(1.90), Bhojpuri(0.52), English(0.49), Lambadi(0.21) 13
1216.77692050.19Marathi(840.15), Hindi(352.99), Malvani(13.86), Khandeshi(5.62), Bhili(2.76), Maithili(0.57), Kannada(0.25), Bengali(0.24), Chattisgarhi(0.16), Telugu(0.07), Gujarati(0.05), Urdu(0.04)12
349.3520143.23Rajasthani(152), Hindi(100.93), Marwari(91.27), Jaipuri(1.84), Shekhawati(0.77), Bengali(0.62), Mewari(0.56), Bagri(0.50), Harauti(0.27), Wagdi(0.25), Gujarati(0.19), English(0.09), Marathi(0.04) 13
427.36241312.17Telugu(411.86), Hindi(11.86), Urdu(1.73), Bengali(0.70), English(0.61), Lambadi(0.44), Malayalam(0.15) 7
357.1519977.31Hindi(217.14), Garhwali(115.25), Kumaoni(23.10), Bengali(0.66), Maithili(0.52), Marathi(0.49) 6
1852.611085851.91Hindi(1576.04), Bhojpuri(238.04), Khari Boli(15.97), Bundeli(8.38), Marathi(5.76), Urdu(3.93), Awadhi(1.91), Bengali(0.74), Badayuni(0.46), Maithili(0.26), Tamil(0.26), Chattisgarhi(0.25), English(0.21), Assamese(0.18), Kannada(0.13), Gujarati(0.09)16
1479.47850145.24Bengali(1422.67), Hindi(53.71), Marathi(1.87), Sadri(0.59), Bhojpuri(0.24), Santali(0.22), Rajbangshi(0.17) 7

The colors on the map indicate the recorded duration in hours for each state. Click on the map to go deeper

Discovering the Linguistic Gems

As we go from district to district, collecting the language of choice by speakers, we have come across some not-so-common languages, some may not even specifically be a part of the latest census of India.

Malvani

It is often classified as a dialect of Konkani, which is the official language of Goa. However, many speakers identify it as a distinct language due to its unique characteristics and cultural significance.. Malvani exhibits distinct vocabulary and pronunciation that set it apart from standard Marathi and Konkani

"व छोटी मोठी झाडे शोभेसाठी ठेवलेली असत."

Shekhawati

Shekhawati is classified as a dialect of the Rajasthani language, primarily spoken in the Shekhawati region of Rajasthan, which includes the districts of Jhunjhunu, Sikar, and Churu. It has approximately three million speakers and shares many similarities with the Marwari dialect of Rajasthani. It is frequently broadly categorized as broad Rajasthani languages.

"यो एक पार्क {park} है, जमे एक बुढों आदमी अ पार्क {park} की सफाई कर रियो हैं।"

Unleash the Power of our Data!

Download the most diverse open source speech dataset for Indian languages. We encourage the use of this dataset to develop and improve your speech AI technologies and applications for an inclusive Digital India.

Alternatively, you can also access and download the dataset directly from Hugging Face, ensuring seamless integration and ease of use for your projects.

Our Team

Dr. Prasanta Kumar Ghosh

Associate professor, Dept of Electrical Engineering (EE) at Indian Institute of Science (IISc), Bangalore.

Prasanta Kumar Ghosh received his Ph.D. in Electrical Engineering from University of Southern California (USC), Los Angeles, USA in 2011. During 2011-2012 he was with IBM India Research Lab (IRL) as a researcher. He was awarded the INSPIRE faculty fellowship from Department of Science and Technology (DST), Govt. of India in 2012. He was awarded Center of Excellence in Teaching’s award for excellence in teaching in the category of EE for the year 2010-11 in USC. His research interests are in human centered signal processing with applications to education and health care.

Raghu Dharmaraju

CEO Artpark

He is a highly experienced innovator with over two decades of experience conceiving and scaling pioneering institutions and innovations. He has launched a portfolio of eight AI innovations, including TRACE-TB, a major national initiative. He has raised $19 million from the Gates Foundation, USAID, and Google.org and has established. Raghu has also launched and scaled award-winning med-tech innovations, including Embrace infant warmers, which have reached approximately 1,000,000 babies via WHO and non-profits, governments, and 300+ private hospitals in 100+ small towns. He has led strategy, product management, and startup operations for a new global business at Corning Environmental Technologies, and has experience in a range of industries including digital health, med-tech, agriculture, financial inclusion, circular economy, and more. Raghu holds a B.Tech. from IIT Madras, an M.S. from the University of Massachusetts, Amherst, and an M.B.A. from Cornell.

Nihar Desai

Program Lead

Nihar comes with a decade of experience spanning strategy, operations & program management. A seasoned management generalist, he has a knack for understanding technology and is passionate for its applications for creating societal impact and to extract efficiencies across value chains. Having been an entrepreneur for half a decade, he has led strategy, product and technology operations at large startups. He is adept at establishing & governing large scale operations. He finished his B.Tech from National Institute of Technology, Surat and has earned his MBA from Indian School of Business (ISB).

Frequently asked questions

If you can’t find what you’re looking for, Email us, We will get back to you.

Why capturing the language landscape of India is important?

Capturing India's diverse language landscape is vital for an inclusive Digital India, as only 10% of the population speaks English. Existing language AI models may not meet the linguistic diversity of India, where languages blend continuously. Initiatives like the National Language Translation Mission and Project Vaani aim to collect authentic language data, addressing the limitations of biased language models.

Why is this data shown per district and not per language?

We believe that language in India is more like a fabric, with the color changing gradually as we move over a fabric. Similarly, language changes as we move every few kilometers. With this school of thought, we are collecting dataset that is representative of each district, which may contain multiple languages. Click on State>District to see which languages we have recorded till date

Who can use this data?

This dataset is open source and can be used by any individual or organization. Any startup is welcome to use this dataset. Feedback on the dataset is always welcome Comment end