Skip to content

Ivpe1975/XLM-R-sentiment-analysis

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

44 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

__   __ _     ___  ___      ______   _____            _   _                      _      ___              _           _     
\ \ / /| |    |  \/  |      | ___ \ /  ___|          | | (_)                    | |    / _ \            | |         (_)    
 \ V / | |    | .  . |______| |_/ / \ `--.  ___ _ __ | |_ _ _ __ ___   ___ _ __ | |_  / /_\ \_ __   __ _| |_   _ ___ _ ___ 
 /   \ | |    | |\/| |______|    /   `--. \/ _ \ '_ \| __| | '_ ` _ \ / _ \ '_ \| __| |  _  | '_ \ / _` | | | | / __| / __|
/ /^\ \| |____| |  | |      | |\ \  /\__/ /  __/ | | | |_| | | | | | |  __/ | | | |_  | | | | | | | (_| | | |_| \__ \ \__ \
\/   \/\_____/\_|  |_/      \_| \_| \____/ \___|_| |_|\__|_|_| |_| |_|\___|_| |_|\__| \_| |_/_| |_|\__,_|_|\__, |___/_|___/
                                                                                                            __/ |          
                                                                                                           |___/           

Instructions to reproduce the resutls:

  1. Start with downloading the Amazon reviews dataset from AWS( https://registry.opendata.aws/amazon-reviews-ml). It was too big to host on this github repo. Place the json folder into the same folder as the scripts. We have created the /json/dev/ folder in this repo to demonstrate where the files should be.
  2. You can reproduce the baseline model by running the baseline.py script. This will yield a baseline_metrics.txt file which will hold the F1-scores and the accuracies.
  3. Running any of the roberta_xx.py scripts will yield a model fine tuned on language xx and a results_xx.txt file which will hold all the F1-scores/accuracies for the fine-tuned model. The roberta_xx.job files are there if you are running it on the HPC cluster.
  4. The lang2vec notebook contains all of the calculations for the mean R^2 values of the distance types. Note that the F1-scores are manually input into the notebook from the results_xx.txt files by defining the results vector.
  5. The analysis notebook contains the pipeline for our quantatative and qualitative analysis. Before running it you will need to run roberta_de_test.py in order to get the wrong_ids, y_true and the y_pred vectors. This python script relies on a saved model so you will need to have run roberta_de.py In this case this is analysis of a sample model(fine-tuned on German predicting English). In order to get confusion matricies for the other languages/models the roberta_de_test.py script will have to be adjusted to whichever target and fine-tuned languages you want.

The results themselves are availible in the pdf document in the root folder.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published