GitHub - mnauf/redditGPT: Finetuned GPT2 on the recent public anonymous conversations from Reddit to capture the genuine public sentiment regarding the recent unfolding events in Pakistan since last year

Finetuned GPT2 on the recent public anonymous conversations from Reddit to capture the genuine public sentiment regarding the recent unfolding events in Pakistan since last year.

Hosted project link here. Since this runs on CPU on huggingface free hosting, it takes about 500 seconds to respond. Alternatively, you can also run it on your GPU-enabled PC.

Download the model in the model directory from here

Run python sample.py to generate the text. You can change the prompt by changing the prompt variable value in sample.py

Run python app.py to run gradio app locally.

Example

Dataset

Data is collected from Pakistan, AskMiddleEast and WorldNews Reddit communities from last year until 25th May 2023. For more details on the dataset generation, checkout the code in the dataset_generation directory.

Training

To understand how finetuning works, please refer to Andrej Karpathy, nanoGPT project.

The loss for validation and training is attached.

Valid/Train Loss

Blue = Validation loss

Orange = Training Loss

Room for Improvement

When you will test it, you will realize that model "hallucinates" a lot, which is a genuine problem with the LLMs so far. Besides, moving from GPT2 to GPT3/4 will significantly improve the quality of generated texts.

Besides, finetuning on even a larger dataset may help. In the regard, we can collect data from large number of Reddit communities and maybe also from Twitter. Furthermore, in the pretraining phase, GPT2 can be trained on the latest articles on Internet to better understand the context what people are talking about.

In addition to that, GPT2 medium and large variants can also be tried, but I had limited GPU resource and kept going out of VRAM.

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
.idea		.idea
__pycache__		__pycache__
dataset_generation		dataset_generation
model		model
.gitattributes		.gitattributes
README.md		README.md
app.py		app.py
banner.png		banner.png
configurator.py		configurator.py
example.jpg		example.jpg
model.py		model.py
requirements.txt		requirements.txt
sample.py		sample.py
valid_train_loss.png		valid_train_loss.png
validation_loss.png		validation_loss.png

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Example

Dataset

Training

Valid/Train Loss

Room for Improvement

About

Releases

Packages

Languages

mnauf/redditGPT

Folders and files

Latest commit

History

Repository files navigation

Example

Dataset

Training

Valid/Train Loss

Room for Improvement

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages