Please suggest any other resources you may be aware of. Raise an issue to add more resources to the catalog. Put the proposed entry in the following format:
[Wikipedia Dumps](https://dumps.wikimedia.org/)
Add a small, informative description of the dataset and provide links to any paper/article/site documenting the resource.
- Major Indic Language NLP Repositories
- Text Corpora
- Speech Corpora
- OCR Corpora
- Multimodal Corpora
- Models
- Libraries
- Technology Development for Indian Languages (TDIL)
- Center for Indian Language Technology (CFILT)
- Language Technologies Research Center (LTRC)
- Linguistic Data Consortium For Indian Languages (LDCIL)
- University of Hyderabad - Sanskrit NLP
- Wikipedia Dumps
- LDCIL Monolingual Corpus
- Charles University Hindi Monolingual Corpus
- Charles University Urdu Monolingual Corpus
- IIT Bombay Hindi Monolingual Corpus
- EMILLE Corpus (multiple Indian languages)
- WMT CommonCrawl Corpus
- WMT NEWS Crawl
- Janmabhumi Malayalam Corpus
- Leipzig Corpus
- Sanskrit Monolingual and Sandhi-split Corpus
- IndoWordNet
- IIIT-Hyderabad Word Similarity Database: 7 Indian languages
- Facebook Hindi Analogy Dataset
- FIRE 2013 AUKBC NER Corpus
- FIRE 2014 AUKBC NER Corpus
- IIT Bombay Marathi NER Corpus
- WikiAnn NER Corpus (Noisy)
- a-mma NER data
- Indian Language Corpora Initiative: Available on TDIL portal on request
- IIT Bombay English-Hindi Parallel Corpus
- OPUS corpus
- WAT 2018 Parallel Corpus
- EILMT Corpus
- Joshua Decoder Corpus
- TED Parallel Corpus
- Charles University English-Hindi Parallel Corpus
- Charles University English-Tamil Parallel Corpus
- Charles University English-Odia Parallel Corpus
- Charles University English-Urdu Religious Parallel Corpus
- WikiMatrix Corpus
- FLORES dataset
- BrahmiNet Corpus: 110 language pairs
- Xlit-Crowd: Hindi-English Transliteration Corpus
- Xlit-IITB-Par: Hindi-English Transliteration Corpus
- XNLI corpus: Hindi and Urdu test sets and machine translated training sets (from English MultiNLI).
- IIT Bombay movie review datasets for Hindi and Marathi
- IIT Patna movie review datasets for Hindi
- IIIT-H LTRC Multi-domain dataset for Telugu
- Indian Language Corpora Initiative
- Universal Dependencies
- Code Mixed Dataset for Hindi, Bengali and Telugu, ICON 2016 shared task
- IIIT Hyderabad Hindi Treebank
- Universal Dependencies
- Universal Dependencies Hindi Treebank
- Universal Dependencies Urdu Treebank
- Microsoft Speech Corpus: Speech corpus for Telugu, Tamil and Gujarati
- IIT Madras TTS database
- BABEL Speech Corpus: includes some Indian languages
- English-Hindi Visual Genome: Images captioned in both English and Hindi.
- Shata-Anuvaadak: 110 language pairs
- LTRC Vanee
- Indic NLP Library: Python Library for various Indian language NLP tasks like tokenization, sentece splitting, normalization, script conversion, transliteration, etc
- pyiwn: Python Interface to IndoWordNet
- [Indic-OCR] (https://indic-ocr.github.io/) : OCR for Indic Scripts