This repository contains the code developed for an experiment researching fine-grained sentiment analysis models for Norwegian text. Our thesis documenting the experimentation done on this code can be found here. The slide deck used to present this thesis can be found here.
Using the IMN (He et al. 2019) and RACL (Chen et al. 2020) as baselines, we synthesis a our novel FgFlex
model for easy experimentation on attention relations between the subtasks of such models.
For reproducibility purposes, we outline the system requirements along with package dependencies needed to run this code.
Our Python runtime is 3.7.4. Both Windows 10 and Linux Red Hat 8.5 operating systems were used under development and testing, ensuring cross-platform functionality.
For faster training times, we recommend the use of a GPU node (especially for the larger models).
Specifically, we made use of cuda
interfaces, that are automatically enabled if cuda
is detected through PyTorch.
The machine learning framework we used for development was PyTorch.
Specific version of all the modules we used are presented below and in our requirements.txt
file, for easier pip-installing.
pip module | version |
---|---|
torch | 1.7.1 |
numpy | 1.18.1 |
transformers | 4.15.0 |
nltk | 3.6.7 |
pytest | 6.1.1 |
pandas | 1.3.5 |
sklearn | 1.0.2 |
There are three main ways to use this code: preprocessing, training single models, and studying hyperparameter configurations.
In order to begin preprocessing step, make sure the NoReC$_{fine}$ data is downloaded to the same directory level this repository is cloned to.
The output path to where the data will be written to can be configured in src/config.py
.
Once the raw data is downloaded, the file src/preprocess.py
can be run directly from the command-line.
cd src/
python preprocess.py
This code restructures the NoReC$_{fine}$ data to the same IMN format used in both baselines.
To train a single model, with our best configurations, one can call the src/train.py
file.
This file expects IMN data to be stored at the location specified in src/config.py
.
Note, if you want to test this architecture on English data, you should also update the BERT_PATH
in src/config.py
to ensure an English BERT model is used to generate embeddings, instead of the default Norwegian, NorBERT2.
Again, simple call the train file from the command-line:
python train.py
This will train a single instance of the FgFlex
model from scratch, except for the pre-trained NorBERT2 embeddings.
The final state of the model will be saved for later use at ../checkpoints/<model-name>.pt
.
The name of the model would need to be configured in top lines of the train.py
file, to store different variations of this best model.
Our hand-made Study
class can be used to test different hyperparameter configurations.
In addition to the preprocessed data, and correctly specified BERT paths, a Study
requires a JSON configuration stored in the studies/
directory.
For example, if we wanted to re-run the study on layers for the FgFlex
model, you would call:
python study.py fgflex/layers.json
The .json
is not necessary, but is included here for consistency.
Note, that the studies/
directory prefix is not needed.