Skip to content

This project is a lightweight language model based on an LSTM architecture, built for educational purposes. It serves as an exploration into the fundamentals of recurrent neural networks and their application in processing sequential data. The RNN model with the given architecture has a total of 4,245,264 trainable parameters.

License

Notifications You must be signed in to change notification settings

Jasiel-Stark8/semblance_halo

Repository files navigation

Semblance Halo LSTM Model

Description

This project is a lightweight language model based on an LSTM architecture, built for educational purposes.
It serves as an exploration into the fundamentals of recurrent neural networks and their application in processing sequential data.
The RNN model with the given architecture has a total of 4,245,264 trainable parameters. This includes parameters from the embedding layer, the LSTM layer, and the linear decoder layer.

Installation

To get started with this project:

git clone https://github.com/Jasiel-Stark8/semblance_halo.git
cd semblance-halo
pip install -r requirements.txt

Usage

After installation, the model can be trained using the following command:

python train_halo.py

Dataset

The dataset used for training the Semblance Halo model is a synthetic text corpus generated using a GPT-based language model. The dataset is designed to simulate a natural language processing task and contains text relevant to COVID-19 discussions, representing a range of perspectives on the pandemic.

Composition

The dataset has been updated and includes approximately 327,235 words 2,710,711 characters of clean text, free from special characters except for basic punctuation and numerical values. This text serves as the training data for our LSTM model.

  • Synthetic data is very useful! GPT-3.5-turbo-16k generated a good part of the data. Not only did it reduce the preproccesing time but with high waulity data adn strictly controled output by GPT, the model halucination was cut significantly. This has already been proven by Mocrosoft's Orca-2 model of which its training data was generated by GPT-4.

A separate validation set, approximately three times the size of the training set with 18,612 words and 150,886 characters, is used to evaluate the model's performance. The validation set is generated using the same method as the training data to ensure consistency.

Preprocessing

Both the training and validation datasets underwent preprocessing steps which included tokenization, numericalization, and batching. The tokenization process converts text into individual word tokens, numericalization maps each token to a unique integer, and batching groups the data into subsets for model training and evaluation.

Availability

Some part of the dataset is curated specifically for this project and is not publicly available. It was generated in a controlled environment to facilitate the development and testing of the LSTM model in a manner that respects privacy and ethical considerations.

For more information on the generation and preprocessing of the dataset, please refer to the code documentation within the project repository.

I am currently working excitedly on expiramenting and I hope to find solid grounds and personal breakthrough on how to interact with the model. Results show its working well but that is to be pushed futher with interaction.

Contributing

Contributions to improve the project are welcome. Please follow these steps:

  1. Fork the repo
  2. Create your feature branch (git checkout -b feature/AmazingFeature)
  3. Commit your changes (git commit -m 'Add some AmazingFeature')
  4. Push to the branch (git push origin feature/AmazingFeature)
  5. Open a Pull Request

Credits

This project was inspired by the work on LSTM networks by Hochreiter & Schmidhuber (1997).
Microsoft - Borrowed their proven concept of using LLMs to generate synthetic data.

Fun comments from the Author

This is my first step and i have officially joind the ML train! haha I did this for fun and to enjoy th learning process, what i thought would take a month, took just 24 hours Yes, if you're an expert dont belittle me, I only have less than a year experience in tech and i was able to build this. Yipee I kinda learn better by doing.

Any opportunity to be part of a greater team will be a boost to enforce my learning prowess

Progress Report:

Currenlty increased the dataset to 8gb worth of CSV data on covid, across many scenarios and demographics. Data quantity should be in excess of 10~50 Million entries
Trying to add them so the model can learn through that pipeline.
Current issue being faced is how to run my model on the cuda GPU on kaggle (fixed 5 mins later haha)
The current updated file is on kaggle and I will update this repo tonight, any contributions are welcome
Let's learn together! :)

License

Distributed under the AGPL-3.0 License. See LICENSE for more information.

About

This project is a lightweight language model based on an LSTM architecture, built for educational purposes. It serves as an exploration into the fundamentals of recurrent neural networks and their application in processing sequential data. The RNN model with the given architecture has a total of 4,245,264 trainable parameters.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages