View the project on huggingface over here
Project Baraat is an open-source initiative to leverage the power of LLMs on Indic-NLP tasks. We aim to build Continually pre-trained, Task Specific Language Models in a Mixture of Experts (MoE) setup. We plan on making a multilingual and cross-lingual LLM that is :
-
1) Pre-trained on a large text corpus containing various sources of knowledge including crawled wikipedia articles, textbooks, news, social media sites, magazines etc.
-
2) Fine-tuned on different downstream tasks. We first train a 7B LLaMa-2 model on a text corpus in the target language and save it as a base model. We have considered the following tasks as downstream tasks that will be incorporated in the fine-tuning process:
- Machine Translation
- Mathematical and Logical Reasoning
- Question Answering
- Instruct Fine-Tuning
Note
This list is subject to change and a few tasks may be added over time.
Model Tutorial | Notebook Link |
---|---|
Baraat-hindi-experts |
Project Baraat is dedicated to making indigenous (regional) languages more accessible. With a focus on the rich linguistic diversity of India. This project aims to break language barriers and promote inclusivity through technology.
Model Name | Description | Dataset Link |
---|---|---|
Baraat-hindi-pretrained | Base model pre-trained on a diverse collection of datasets: โข IndicCorp: A multilingual corpus covering 9 major Indic languages for various NLP tasks. โข Hindi Wikipedia Articles (172K): A dataset containing 172,000 Hindi Wikipedia articles. โข Hindi Corpus from Leipzig University: A Hindi corpus provided by the University of Leipzig. โข Animals: A Visual Encyclopedia: An encyclopedia of general animal sentences. โข Augmented rows using Bing AI to include worldly knowledge such as fruits, vegetables, animals. |
Link |
Baraat-kannada-pretrained | Base model pre-trained on a diverse collection of datasets: โข IndicCorp: A multilingual corpus covering 9 major Indic languages for various NLP tasks. โข Kannada Corpus from Leipzig University: A Kannada corpus provided by the University of Leipzig. |
Link |
- Tokenizers for Indian Languages: Robust tokenization tools tailored for the unique structures of regional Indian languages.
- Fine-tuned Language Models: Leveraging the power of Large Language Models (LLMs) fine-tuned for Indian languages to understand and generate text with high accuracy.
- Open Source Collaboration: We believe in the collective power of the community to drive innovation and inclusivity. ๐ค
- High Quality Datasets: Take a look at our suite of cleaned datasets ready for your own downstream training purposes.
To promote the spirit of building accessible models in native languages, fostering a world where technology speaks everyone's language. ๐ฃ๏ธ
- โ Prepare and setup dataset
- โ Prepare and setup tokenizers
- โ Start pre-training
- โ Fine-tune models
- โ Implement gating mechanism
- โ Implement MoE
- โ Simple Demo
Foundational model: LLaMa-2 7B
P.S. The project is still in its early stages and this is a Proof of Concept implementation for Hindi.
Baraat.Small.Demo.mp4
- We can see here that the model is sensitive to the prompts that are being passed to it and this is a feature prevelant in a wide variety of LLMs today. We aim to train our suite of models for a longer period of time with evaluation steps.
- The project is being worked on actively and is currently undergoing an update. All utility files are provided in the source directory.
In the future, we aim to expand Project Baraat's capabilities beyond text to include support for images and audio, enabling multimodal learning techniques.
We plan to develop a pipeline for dataset cleaning, leveraging small models like stabilityai/stablelm-zephyr-3b or microsoft/phi-2 for automated data cleaning processes.
We intend to introduce an additional step in fine-tuning to enhance the model's reasoning ability, integrating techniques for logical reasoning and inferencin using datasets like meta-math/MetaMathQA or microsoft/orca-math-word-problems-200k. We plan to release translated versions of the datasets to facilitate research in mathematical reasoning and question answering across diverse linguistic communities.
We welcome open-source contributions! Whether you're a coder, a linguist, or just someone passionate about language accessibility, there's a place for you in Project Baraat. Here's how you can get involved:
- Star and Fork: Give us a star โญ on GitHub and fork the repository to start contributing.
- Issue Tracker: Report bugs or suggest new features by creating an issue.
- Pull Requests: Submit your pull requests with new features, bug fixes, or documentation enhancements.
Check out our CONTRIBUTING.md for more detailed guidelines.
We partition sentences from datasets into chunks of predetermined maximum word count. This approach allows for the creation of extended sentences, thereby significantly augmenting the efficacy of the continual pretraining process. This can be applied to any dataset to combine sentences and produce a new dataset with more content per row.
A token counting mechanism has been integrated, capable of quantifying the number of tokens within any given dataset for any given tokenizer. This feature serves as a fundamental tool for analyzing token distributions and comprehending vocabulary dimensions across datasets. We built this by modifying Sayak Paul's count-tokens-hf-datasets project. We no longer require Google Cloud as a component to count tokens, and the entire process can be performed locally.
We also visualize token distributions within individual sentences of datasets. Additionally, a binning process has been implemented to enhance the interpretability of token distribution patterns. These enhancements provide valuable insights into the structural characteristics of textual data, benefiting both researchers and practitioners.
Project Baraat is released under the MIT License.
If you like Project Baraat, please consider starring the repository and sharing it with your network!
Made with โค๏ธ by Team Baraat,
Akash Kamalesh , Anirudh Lakhotia and Tanistha Hota, PES University, Bengaluru.