Capturing the language landscape for an inclusive digital India
Project Vaani is one of the largest datasets of Indian dialects ever to exist. Upon completion, it will contain more than 150,000 hours of audio across all districts in India.
About Project VAANI
14434.23 hr
83,174
59
349.53 hr
80
12
129028
43.90 %
56.10 %
88,34,039
Digital India is marching ahead inexorably. Digital interfaces and communications have become critical for access to information, entertainment, economic opportunities and even essential services such as healthcare.
Project Vaani, by IISc, Bangalore and ARTPARK, is capturing the true diversity of India’s spoken languages to propel language AI technologies and content for an inclusive Digital India.
We expect to create data corpora of over 150,000 hours of speech, part of which will be transcribed in local scripts, while ensuring linguistic, educational, urban-rural, age, and gender diversity (among other potential diversity characteristics). These diligently collected and curated datasets of natural speech and text from about 1 million people across all 773 districts of India will be open-sourced. The current version of the data is open-sourced here. Going forward, we hope to open source through platforms like Bhashini (under the National Language Translation Mission, MeiTY).
This will boost the development of technologies such as automatic speech recognition (ASR), speech to speech translation (SST), and natural language understanding (NLU) that reflect the ground realities of how Indians speak.
Google is funding the Project Vaani.
Explore India’s Vernacular Vista with Vaani
Explore the linguistic diversity of India in a click. Our dataset is intended to be a treasure trove of speech data from across India’s districts. It offers a comprehensive overview of speech data from all districts, emphasizing on the language variety in each district, providing a unique glimpse into India’s rich cultural tapestry. Discover the richness of India’s linguistic landscape and delve into the statistics that bring our nation’s linguistic diversity to life.
States | Record Duration(hrs) | Speaker Count | Transcription Duration (Hrs) | Languages | Language Count |
---|---|---|---|---|---|
1134.82 | 6417 | 38.11 | Bengali, Bhojpuri, English, Hindi, Kannada, Maithili, Marathi, Santali, Tamil, Telugu, Urdu | 11 | |
3697.16 | 21528 | 45.93 | Angika, Bajjika, Bantar, Bengali, Bhojpuri, Chattisgarhi, Hindi, Kannada, Khortha, Konkani, Kortha, Kurumali, Magahi, Maithili, Marathi, Marwari, Nepalese, Sadri, Surjapuri, Tamil, Telugu, Urdu | 22 | |
1699.93 | 9787 | 42.08 | Agariya, Awadhi, Bengali, Bhatri, Chattisgarhi, Dorli, Duruwa, English, Gondi, Halbi, Hindi, Kannada, Kudukh, Kurukh, Maithili, Marathi, Nagpuri, Odia, Sadri, Surgujia | 20 | |
162.92 | 895 | 5.71 | Bengali, Gujarati, Hindi, Kannada, Konkani, Marathi | 6 | |
368.43 | 2206 | 14.78 | Angika, Bengali, Bhojpuri, Chattisgarhi, English, Hindi, Khortha, Kurumali, Magahi, Maithili, Marathi, Santali | 12 | |
1688.27 | 9655 | 32.87 | Bearybashe, Bengali, Bhojpuri, English, Hindi, Kannada, Lambadi, Malayalam, Marathi, Tamil, Telugu, Tulu, Unknown, Urdu | 14 | |
1216.77 | 6920 | 50.19 | Bengali, Bhili, Chattisgarhi, Gujarati, Hindi, Kannada, Khandeshi, Maithili, Malvani, Marathi, Telugu, Urdu | 12 | |
349.35 | 2014 | 3.23 | Bagri, Bengali, English, Gujarati, Harauti, Hindi, Jaipuri, Marathi, Marwari, Mewari, Rajasthani, Shekhawati, Wagdi | 13 | |
427.36 | 2413 | 12.17 | Bengali, English, Hindi, Lambadi, Malayalam, Telugu, Urdu | 7 | |
357.15 | 1997 | 7.31 | Bengali, Garhwali, Hindi, Kumaoni, Maithili, Marathi | 6 | |
1852.61 | 10858 | 51.91 | Assamese, Awadhi, Badayuni, Bengali, Bhojpuri, Bundeli, Chattisgarhi, English, Gujarati, Hindi, Kannada, Khari Boli, Maithili, Marathi, Tamil, Urdu | 14 | |
1479.47 | 8501 | 45.24 | Bengali, Bhojpuri, Hindi, Marathi, Rajbangshi, Sadri, Santali | 7 |
The colors on the map indicate the recorded duration in hours for each state. Click on the map to go deeper
Our Team
Dr. Prasanta Kumar Ghosh
Associate professor, Dept of Electrical Engineering (EE) at Indian Institute of Science (IISc), Bangalore.
Prasanta Kumar Ghosh received his Ph.D. in Electrical Engineering from University of Southern California (USC), Los Angeles, USA in 2011. During 2011-2012 he was with IBM India Research Lab (IRL) as a researcher. He was awarded the INSPIRE faculty fellowship from Department of Science and Technology (DST), Govt. of India in 2012. He was awarded Center of Excellence in Teaching’s award for excellence in teaching in the category of EE for the year 2010-11 in USC. His research interests are in human centered signal processing with applications to education and health care.
Raghu Dharmaraju
CEO Artpark
He is a highly experienced innovator with over two decades of experience conceiving and scaling pioneering institutions and innovations. He has launched a portfolio of eight AI innovations, including TRACE-TB, a major national initiative. He has raised $19 million from the Gates Foundation, USAID, and Google.org and has established. Raghu has also launched and scaled award-winning med-tech innovations, including Embrace infant warmers, which have reached approximately 1,000,000 babies via WHO and non-profits, governments, and 300+ private hospitals in 100+ small towns. He has led strategy, product management, and startup operations for a new global business at Corning Environmental Technologies, and has experience in a range of industries including digital health, med-tech, agriculture, financial inclusion, circular economy, and more. Raghu holds a B.Tech. from IIT Madras, an M.S. from the University of Massachusetts, Amherst, and an M.B.A. from Cornell.
Nihar Desai
Program Lead
Nihar comes with a decade of experience spanning strategy, operations & program management. A seasoned management generalist, he has a knack for understanding technology and is passionate for its applications for creating societal impact and to extract efficiencies across value chains. Having been an entrepreneur for half a decade, he has led strategy, product and technology operations at large startups. He is adept at establishing & governing large scale operations. He finished his B.Tech from National Institute of Technology, Surat and has earned his MBA from Indian School of Business (ISB).
Frequently asked questions
If you can’t find what you’re looking for, Email us, We will get back to you.
Why capturing the language landscape of India is important?
Capturing India's diverse language landscape is vital for an inclusive Digital India, as only 10% of the population speaks English. Existing language AI models may not meet the linguistic diversity of India, where languages blend continuously. Initiatives like the National Language Translation Mission and Project Vaani aim to collect authentic language data, addressing the limitations of biased language models.
Why is this data shown per district and not per language?
We believe that language in India is more like a fabric, with the color changing gradually as we move over a fabric. Similarly, language changes as we move every few kilometers. With this school of thought, we are collecting dataset that is representative of each district, which may contain multiple languages. Click on State>District to see which languages we have recorded till date
Who can use this data?
This dataset is open source and can be used by any individual or organization. Any startup is welcome to use this dataset. Feedback on the dataset is always welcome Comment end