Capturing the language landscape for an inclusive digital India

Project Vaani is one of the largest datasets of Indian dialects ever to exist. Upon completion, it will contain more than 150,000 hours of audio across all districts in India.

- Supported by
- support of

Supported by

Support of

About Project VAANI

31255.10 hr

1,56,534

109

2043.39 hr

165

288429

45.57 %

54.37 %

2,20,34,051

Digital India is marching ahead inexorably. Digital interfaces and communications have become critical for access to information, entertainment, economic opportunities and even essential services such as healthcare.

Project Vaani, by IISc, Bangalore and ARTPARK, is capturing the true diversity of India’s spoken languages to propel language AI technologies and content for an inclusive Digital India.

We expect to create data corpora of over 150,000 hours of speech, part of which will be transcribed in local scripts, while ensuring linguistic, educational, urban-rural, age, and gender diversity (among other potential diversity characteristics). These diligently collected and curated datasets of natural speech and text from about 1 million people across all 773 districts of India will be open-sourced. The current version of the data is open-sourced here. Going forward, we hope to open source through platforms like Bhashini (under the National Language Translation Mission, MeiTY).

This will boost the development of technologies such as automatic speech recognition (ASR), speech to speech translation (SST), and natural language understanding (NLU) that reflect the ground realities of how Indians speak.

Google is funding the Project Vaani.

Explore India’s Vernacular Vista with Vaani

Explore the linguistic diversity of India in a click. Our dataset is intended to be a treasure trove of speech data from across India’s districts. It offers a comprehensive overview of speech data from all districts, emphasizing the language variety in each district, providing a unique glimpse into India’s rich cultural tapestry. Discover the richness of India’s linguistic landscape and delve into the statistics that bring our nation’s linguistic diversity to life.

Want to hear the voices out? Click on the states .

Record Duration(hrs)	Speaker Count	Transcription Duration (Hrs)	Languages(HRS)	Language Count
1583.19	7957	105.72	Telugu(1392.82), Hindi(153.19), Urdu(19.59), English(8.19), Bengali(3.33), Marathi(2.56), Tamil(2.51), Kannada(0.71), Maithili(0.18), Vagavedu(0.06), Chittoor(0.05)	11
649.07	3171	52.47	Hindi(476.72), Wancho(121.40), English(30.44), Idu mishmi(9.95), Nyishi(4.86), Assamese(2.04), Tagin(1.99), Bengali(0.54), Galo(0.51), NissiDafla(0.34), Nepali(0.27)	11
751.42	4230	60.01	Assamese(354.21), Bengali(193.28), Karbi(85.69), Hindi(68.67), English(48.41), Sylheti(0.61), Odia(0.22), Sadri(0.20), Nepali(0.13)	9
5040.22	25189	252.77	Hindi(3993.07), Maithili(466.82), Bhojpuri(194.55), Magahi(101.18), Angika(100.25), Bajjika(84.76), Urdu(49.22), Surjapuri(25.78), Bengali(8.64), Marathi(5.48), English(4.32), Bihari(1.70), Kurmali(0.96), Thethi(0.73), Santali(0.73), Meitei(0.53), Khortha(0.52), Chhattisgarhi(0.27), Konkani(0.25), Bhatri(0.25), Kudmali(0.17), Marwari(0.04)	22
206.74	1002	21.33	Hindi(187.66), Punjabi(11.58), English(6.79), Chhattisgarhi(0.71)	4
2297.15	11154	112.54	Hindi(1753.38), Chhattisgarhi(407.12), Halbi(54.70), Sadri(26.07), Surgujia(16.95), Kurukh(11.41), Gondi(9.58), Awadhi(5.71), Bengali(3.58), Oriya(1.78), Dorli(1.32), Bhatri(1.21), Marathi(1.21), Nagpuri(0.89), English(0.60), Maithili(0.60), Duruwa(0.50), Kudukh(0.33), Agariya(0.16), Odia(0.06)	20
205.15	972	10.42	Hindi(201.57), Punjabi(2.07), English(0.81), Odia(0.48), Bengali(0.22)	5
187.76	919	14.00	Hindi(97.21), Konkani(67.64), Marathi(21.89), Bengali(0.59), Kannada(0.32), English(0.11)	6
387.19	2014	31.39	Gujarati(292.34), Hindi(83.29), English(9.04), Marathi(2.12), Rajasthani(0.24), Bengali(0.16)	6
596.48	2795	45.60	Hindi(554.21), Haryanvi(38.29), English(3.81), Punjabi(0.17)	4
201.51	1020	17.14	Hindi(196.02), English(3.19), Pahadi(1.25), Punjabi(0.99), Kashmiri(0.06)	5
110.93	510	0.00	Urdu(45.84), Kashmiri(32.80), Hindi(25.56), English(6.74)	4
1310.84	6242	108.97	Hindi(1209.26), Bengali(41.74), Khortha(25.58), Magadhi(6.68), Santali(6.09), Bhojpuri(5.49), English(3.60), KhorthKhotta(3.40), Magahi(3.12), Khorth(2.87), Angika(0.81), Marathi(0.70), Maithili(0.46), MagadhiMagahi(0.44), Kurmali(0.23), Sadri(0.21), Chhattisgarhi(0.14)	17
2638.69	13127	163.20	Kannada(2243.39), Hindi(123.55), Telugu(104.64), Urdu(42.55), Tulu(39.60), Marathi(36.64), English(15.77), Tamil(14.21), Bearybashe(6.89), Bengali(3.14), Malayalam(2.65), Punjabi(2.63), Odia(1.52), Assamese(0.75), Bhojpuri(0.30), Lambani(0.23), Sumi(0.17), Beary bashe(0.06)	18
353.47	1637	42.08	Malayalam(349.24), English(3.51), Tamil(0.56), Hindi(0.09), Paniya(0.07)	5
806.88	4049	63.92	Hindi(781.07), Nimadi(16.12), Sindhi(3.51), Bagheli(2.68), Malvi(1.38), English(0.65), Urdu(0.57), Bhili(0.25), Haryanvi(0.24), Bundeli(0.21), Dogari(0.20)	11
1886.68	8988	104.13	Marathi(1043.48), Hindi(799.21), Malvani(15.87), English(10.09), Khandeshi(5.66), Gujarati(4.41), Bhili(2.81), Powari(1.50), Urdu(1.21), Kannada(0.76), Kashmiri(0.73), Bengali(0.54), Maithili(0.22), Chhattisgarhi(0.19)	14
217.78	979	3.38	Manipuri(206.19), Hindi(3.35), Tangkhul(2.15), English(2.05), Nepali(1.59), Rongmei(1.54), Bengali(0.54), Meitei(0.20), Liangmai(0.17)	9
478.07	2663	47.77	Garo(471.01), Hindi(4.05), Bengali(0.79), Nagamese(0.77), English(0.63), Hajong(0.40), Chakhesang(0.25), Assamese(0.17)	8
220.00	1030	23.94	Mizo(201.50), Nagamese(6.97), English(5.51), Hindi(4.16), Bengali(0.49), Thadou(0.45), Telugu(0.39), Vaiphei(0.25), Manipuri(0.25), Mara(0.05)	10
430.94	2671	31.60	Nagamese(267.61), English(97.00), Lotha(22.10), Sumi(9.59), Angami(8.45), Hindi(5.52), Ao(4.69), Chakhesang(4.47), Rengma(4.36), Rongmei(2.40), Tenyidie(1.32), Yimchunger(1.18), Sangtam(0.97), Zeme(0.53), Liangmei(0.38), Phom(0.23), Assamese(0.10), Kuki(0.03)	18
741.35	3637	69.52	Odia(589.29), Hindi(84.29), Sambalpuri(35.29), Bengali(14.51), Desia(7.38), English(6.15), Koya(2.64), Telugu(0.78), Bhatri(0.25), Bagheli(0.24), Sirmauri(0.18), Awadhi(0.15), Bagri(0.13), Baghati(0.08)	14
622.50	2841	67.71	Hindi(387.80), Punjabi(219.07), English(14.80), Sindhi(0.46), Telugu(0.22), Dogri(0.14)	6
948.34	4643	69.33	Hindi(590.32), Rajasthani(189.32), Marwari(154.88), Marwadi(4.55), Jaipuri(2.31), English(2.30), Shekhawati(1.08), Bengali(0.85), Bagri(0.63), Wagdi(0.54), Mewati(0.35), Harauti(0.34), Mewari(0.30), Gujarati(0.27), Punjabi(0.18), Marathi(0.13)	16
216.25	948	18.12	Nepali(182.12), English(17.77), Lepcha(7.58), Hindi(5.95), Sikkimese(2.54), Limbu(0.23), Bengali(0.06)	7
834.14	4568	23.64	Tamil(833.09), English(1.06)	2
877.69	4196	60.58	Telugu(840.25), Hindi(20.89), English(12.04), Urdu(2.67), Bengali(0.76), Tamil(0.47), Lambani(0.43), Malayalam(0.17)	8
676.32	3760	68.60	Chakma(485.75), Bengali(135.10), Kokborok(43.59), Hindi(10.36), English(1.52)	5
414.47	2065	24.95	Hindi(247.51), Garhwali(136.80), Kumaoni(28.06), Bengali(0.83), Maithili(0.73), Marathi(0.55)	6
3031.79	15607	185.37	Hindi(2683.08), Bhojpuri(290.65), Khariboli(15.95), Urdu(11.46), Bundeli(10.10), Marathi(7.49), Khari boli(4.15), English(4.13), Awadhi(2.99), Bengali(0.72), Maithili(0.35), Chhattisgarhi(0.27), Kannada(0.25), Badayuni(0.09), Gujarati(0.09)	15
2332.07	11971	143.19	Bengali(1967.53), Nepali(236.08), Hindi(117.47), English(7.35), Marathi(2.52), Bhojpuri(0.55), Santali(0.22), Rajbanshi(0.19), Sadri(0.09), Rajbangshi(0.05)	10

The colors on the map indicate the recorded duration in hours for each state. Click on the map to go deeper

Discovering the Linguistic Gems

As we go from district to district, collecting the language of choice by speakers, we have come across some not-so-common languages, some may not even specifically be a part of the latest census of India.

Tulu

Tulu is a Dravidian language spoken primarily in the coastal regions of Karnataka and Kerala. It has a rich oral tradition and distinct phonetic features that set it apart from other Dravidian languages, reflecting the unique cultural identity of the Tulu-speaking community.

"ಮಾತಲ ಐನ್ ಉಲಯಿ ಉಪ್ಪುಂಡು ಉಂಡು ಪಂಡ್ ದು ನೆಟ್ಟ್ ತೋಜು ಪೋಪುಂಡು ಮಸ್ತ್ ಜನ ಉಪ್ಪುವೆರ್ ನೆಟ್ಟ್."

Bearybashe

Bearybashe, also known as Beary or Biryani, is a language spoken by the Beary community in coastal Karnataka. It incorporates elements from Kannada, Tulu, and Urdu, making it a unique linguistic blend that reflects the diverse cultural influences in the area.

"ಇದೊರ್‌ ತಾಲ್ಲೂಕ್‌ ಆಪಿಸ್‌{office} ಆಯಿಟಿಕ್ಕ್‌ರ್‌ ನಂಗ ಎಂದ್ರೆ ರೇಷನ್‌{ration} ಕಾರ್ಡ್‌{card} ಎಲ್ಲ ಆಧಾರ್‌{Adhar} ಕಾರ್ಡ್‌{card} ಎಲ್ಲ ಆಕನೆಂಗ್ ಇವ್ಡೆಗೇ ಪೋಂಡೆ."

Unleash the Power of our Data!

Our comprehensive multimodal dataset for Indian languages is now exclusively available on Hugging Face! Access and download the dataset directly for seamless integration and ease of use in your AI projects.

The data from Project Vaani is available under license : CC-BY-4.0

Our Team

Dr. Prasanta Kumar Ghosh

Associate professor, Dept of Electrical Engineering (EE) at Indian Institute of Science (IISc), Bangalore.

Prasanta Kumar Ghosh received his Ph.D. in Electrical Engineering from University of Southern California (USC), Los Angeles, USA in 2011. During 2011-2012 he was with IBM India Research Lab (IRL) as a researcher. He was awarded the INSPIRE faculty fellowship from Department of Science and Technology (DST), Govt. of India in 2012. He was awarded Center of Excellence in Teaching’s award for excellence in teaching in the category of EE for the year 2010-11 in USC. His research interests are in human centered signal processing with applications to education and health care.

Raghu Dharmaraju

CEO Artpark

He is a highly experienced innovator with over two decades of experience conceiving and scaling pioneering institutions and innovations. He has launched a portfolio of eight AI innovations, including TRACE-TB, a major national initiative. He has raised $19 million from the Gates Foundation, USAID, and Google.org and has established. Raghu has also launched and scaled award-winning med-tech innovations, including Embrace infant warmers, which have reached approximately 1,000,000 babies via WHO and non-profits, governments, and 300+ private hospitals in 100+ small towns. He has led strategy, product management, and startup operations for a new global business at Corning Environmental Technologies, and has experience in a range of industries including digital health, med-tech, agriculture, financial inclusion, circular economy, and more. Raghu holds a B.Tech. from IIT Madras, an M.S. from the University of Massachusetts, Amherst, and an M.B.A. from Cornell.

Nihar Desai

Program Lead

Nihar comes with a decade of experience spanning strategy, operations & program management. A seasoned management generalist, he has a knack for understanding technology and is passionate for its applications for creating societal impact and to extract efficiencies across value chains. Having been an entrepreneur for half a decade, he has led strategy, product and technology operations at large startups. He is adept at establishing & governing large scale operations. He finished his B.Tech from National Institute of Technology, Surat and has earned his MBA from Indian School of Business (ISB).

Our Partners

Vaani in the News

Stay updated with Project Vaani’s media appearances. Read about our mission, our work, and our impact as covered by the press

Frequently asked questions

If you can’t find what you’re looking for, Email us, We will get back to you.

Why capturing the language landscape of India is important?

Capturing India's diverse language landscape is vital for an inclusive Digital India, as only 10% of the population speaks English. Existing language AI models may not meet the linguistic diversity of India, where languages blend continuously. Initiatives like the National Language Translation Mission and Project Vaani aim to collect authentic language data, addressing the limitations of biased language models.

Why is this data shown per district and not per language?

We believe that language in India is more like a fabric, with the color changing gradually as we move over a fabric. Similarly, language changes as we move every few kilometers. With this school of thought, we are collecting dataset that is representative of each district, which may contain multiple languages. Click on State>District to see which languages we have recorded till date

Who can use this data?

This dataset is open source and can be used by any individual or organization. Any startup is welcome to use this dataset. Feedback on the dataset is always welcome Comment end

Capturing the language landscape for an inclusive digital India

About Project VAANI

Total Duration

Total Speakers

Total Languages

Transcription Duration

Districts Covered

States & UT Covered

Total Images

Male Audio

Female Audio

Total Files