- Ilias Merigh
- Nicolas Filimonov
- Thai-Nam Hoang
Introduction • Setup • Project structures • Gathering data • Notebooks • Best submissions • Contact information •
This project aims to classify tweets as negative (-1) or non-negative (1) using a blend of machine learning and deep learning techniques. We implement various methods to represent tweets (TF-IDF, Glove embeddings). We employed traditional machine learning models like Stochastic Gradient Descent (SGD) and Logistic Regression for their efficiency in text classification. To capture the nuanced context of language, we integrated the Gated Recurrent Unit (GRU), a neural network adept at processing sequential data. Additionally, we utilized a transformer-based model BERT and its variation RoBERTa, to enhance our understanding of complex language patterns in tweets. This combination of diverse approaches ensures a comprehensive and accurate sentiment analysis of Twitter data.
For more details, read the report.pdf
Make sure to install conda on your machine first. Then, run the following commands to create the environment and install packages:
conda env create -f environment.yml
conda activate tweet-sentiment
Alternatively, you can install the packages manually:
pip install -r requirements.txt
This structure was used during the project's development, and we recommend sticking to it because the locations of all files are organized according to this framework.
data
: contains the raw data and the preprocessed data.
models
: contains the GRU and BERT models, inherited from the Model
abstract class.
notebooks
: contains the notebooks used for data exploration and model training.
submissions
: contains the submissions to AIcrowd.
utility
: contains decorators, file path and resources for preprocessing the tweets.
weights
: contains the saved weights.
preprocessing.py
: contains the preprocessing pipeline.
run.py
: run the model that yields the best submission on AICrowd.
Download raw data from the AICrowd site,
extract and put into data
folder. The structure should be as followed:
├── data
│ ├── train_pos.txt
│ ├── train_neg.txt
│ ├── train_pos_full.txt
│ ├── train_neg_full.txt
│ └── test_data.txt
We used GloVe embedding from Stanford NLP. You can download it from
their website, extract and use the glove.twitter.27B.100d.txt
or using
this link to download directly
without extracting. Afterward, put it into the data folder.
Required space: 974Mb for glove.twitter.27B.100d.txt
├── data
│ ├── glove.twitter.27B.100d.txt
│ ├── test_data.txt
│ ├── train_neg.txt
│ ├── train_neg_full.txt
│ ├── train_pos.txt
│ └── train_pos_full.txt
We used Word2Vec embedding from Google. You can download it
from Kaggle, extract and put into data
folder.
├── data
│ ├── preprocessed
│ ...
│ ├── glove.twitter.27B.100d.txt
│ ├── GoogleNews-vectors-negative300.bin
│ └── train_pos_full.txt
Required space: 3.64Gb
Download the preprocessed data and put it into the data folder. You can download it from this link.
├── data
│ ├── preprocessed
│ │ ├── bert
│ │ │ ├── test.csv
│ │ │ ├── train.csv
│ │ ├── gru
│ │ │ ├── test.csv
│ │ │ ├── train.csv
│ │ ├── ml
│ │ │ ├── test.csv
│ │ │ ├── train.csv
│ ├── glove.twitter.27B.100d.txt
│ ...
│ └── train_pos_full.txt
Weights are essential to give predictions from model without retraining. You can download the weights from this link
├── weights
│ ├── bert
│ │ ├── config.json
│ │ └── tf_model.h5
│ ├── bert-large
│ │ ├── config.json
│ │ └── tf_model.h5
│ ├── gru
│ │ ├── config.json
│ │ └── model.keras
│ └── README.md
Required space: 2.14Gb for total weights
run.py
is the main script to load weights and run the model. You can run it with the following command:
python3 run.py -w
This will run the pretrained model and load the best weight for the model. You can also run the model without loading the weights by running:
python3 run.py
A more detailed help can be found by running:
python3 run.py -h
Submissions will be saved in the submissions/bert
folder, under the name submission_YYYY-MM-DD_HH:MM:SS.csv
.
The notebooks are used for data exploration and model training. They are located in the notebooks
folder.
Those notebooks are well-documented and will give relevant information in the processing of finishing this project.
They are structured as followed:
model_BERT.ipynb
: contains the BERT model training.
model_GRU.ipynb
: contains the GRU model training.
model_logistic_regression.ipynb
: contains the logistic regression model training.
model_RoBERTa.ipynb
: contains the RoBERTa model training.
model_SGD.ipynb
: contains the SGD model training.
preprocessing.ipynb
: contains the preprocessing pipeline.
preprocessing_exploration.ipynb
: contains the data exploration and preprocessing pipeline.
Our best submission on AICrowd was a BERT-based model with bert-large-uncased
variation. After downloading the weights
and loading the files for generating predictions, it will take roughly an hour to run on a normal laptop.
The best submission can be reproduced by running the command in Step 6. If any problems occured, please go to /submissions/bert/test_prediction_BERT_large.csv to download the submission.
For help or issues using this repo, please submit a GitHub issue. For personal communication related to this project, please contact Ilias Merigh [email protected], Nicolas Filimonov [email protected] or Thai-Nam Hoang [email protected].