Commit all project's files

Eyal8 · May 3, 2021 · b09201d · b09201d
commit b09201d
Show file tree

Hide file tree

Showing 56 changed files with 184,925 additions and 0 deletions.
diff --git a/README.md b/README.md
@@ -0,0 +1,57 @@
+# From Individuals to Communities: Community-Aware Language Modeling for the Detection of Hate Speech
+
+To avoid environment issues you can create a new conda env using the following line:
+
+`conda create --name <env> --file requirements.txt`
+
+This project is divided into two main segments:
+1. Hate Speech Detection - under the module `detection`
+1. Hate Networks - under the model `hate_networks`
+
+
+## Hate Speech Detection
+This is the main module of the thesis. It contains both the post level models (PLMs) and the user level models (ULMs) to detect hate speech.
+
+The configuration of the execution for this module is under the file `config.detection_config.py`. Under this file you can control what dataset is running for the specific execution.
+
+There are four entry points for this module. 
+To execute each of the following experiments from the root dir of the project run:
+
+    `python detection/experiments/{experiment_file_name}.py`
+
+The entry points reside under `experiments` directory in the following files:
+* Post-level experiments
+    * `post_level__experiment.py` - run this file to execute a specific experiment of a model on a specific data set as configured in `config.detection_config.py`.
+     To run this file set the parameter `multiple_experiments` to `False` under `post_level_execution_config` in `config.detection_config.py` file.
+    * `post_level__multiple_experiments.py` - run this file to execute multiple experiments of several models and compare them.
+    To run this file set the parameter `multiple_experiments` to `True` under `post_level_execution_config` in `config.detection_config.py` file.
+
+    **Important note**: to run BertFineTuning model make sure you are using a gpu. 
+
+* User-level experiments
+    * `user_level__experiments.py` - run this file to execute the user level experiment. 
+    It will load the PLM from the `post_level_execution_config` config and predict the probabilities of all of the posts by all users in the given data to contain hate speech.
+    Then it will run the FFNN using the streams of data by the user itself, his followees and his followers, together with network features.
+        * **Important note**: one must run the hate network module with the desired dataset before running this code as it uses some output from it.
+    * `user_level__threshold_models.py` - Run this file to execute threshold-based ULMs.
+        * **Important note**: one must run the experiment file `user_level__experiments.py` with the desired dataset before running this code as it uses some output from it.
+
+* The outputs of the models' executions will be saved under `detection/outputs/{data_name}_{model_name}`.
+
+
+## Hate Networks
+In this module we create networks promoting hate using data from social networks that present engagement between users, i.e., mentions, retweets. 
+Using this module you can create unsupervised segmentation of users to various communities based on the set configuration.
+The unsupervised methods are topic models (LDA/NMF) and Word2Vec modeling based on the users' texts.
+Using these methods you are able to color the users in the reconstructed user-network.
+You can also color the users using the predictions created using the user-classifier from the `detection` module.
+This can be done by setting the param `plot_supervised_networks` to true in `general_conf` under `hate_networks_config.py` file and using the `user_pred` param under path_conf for each dataset.
+
+Important notes:
+* The entry point for this module is the function `main()` under the file `main.py`. To execute the code from the root dir of the project run:
+
+    `python hate_networks/main.py`
+
+* The execution of this code is configured using the configuration dict under the `config` module in the file `hate_networks_config.py`.
+
+* The outputs of the constructed network and the relevant files that are created with it will be saved under `hate_networks/outputs/{data_name}_networks`.
diff --git a/config/data_config.py b/config/data_config.py
@@ -0,0 +1,106 @@
+from datetime import datetime, date, timedelta
+general_conf = {
+    "ignore_retweets": True,
+    "only_english": True,
+    "processes_number": 30,
+    # "data_to_process": "covid",   # not in use, sent as a flag when running the script
+    # possible values: 'covid', 'antisemitism', 'racial_slurs', 'all_datasets'
+    "dataset_type": "twitter"
+}
+
+trending_topic_conf = {
+    "p_num": 15,  # number of processes for parallel execution
+    "latest_date": datetime.today() - timedelta(days=1),  #datetime(2020, 6, 4),
+    # "chunk_size": 5,  # number of days to consider in one chunk of data
+    "chunks_back_num": 3,  # number of chunks to consider in total (including the last chunk)
+    "window_slide_size": 1,  # size of rolling window to move to the next chunk
+    # "unigram_threshold": 200,  # min unigram threshold for topic count
+    # "bigram_threshold": 300,  # min bigram threshold for topic count
+    # "emoji_threshold": 50,
+    "factor": 3,  # the relative growth of the topic's popularity
+    "factor_power": 0.1,
+    "user_limit": False,  # to consider specific users' tweets
+    "ignore_retweets": True, #whether or not to ignore retweets along finding the relevant trending unigrams/bigrmas
+    "ignore_punct": True,
+    "only_english": True,
+    "topn_to_save": 50
+}
+
+path_confs = {
+    "covid": {
+        "root_path": "/data/work/data/covid/",
+        "raw_data": "/data/work/data/covid/data/",
+        "pickled_data": "/data/work/data/covid/processed_data/pickled_data/",
+        "models": "/data/work/data/covid/processed_data/models/",
+        "output_trending_topic_dir": f"/data/work/data/covid/trending_topics/",
+        "output_trending_topic_fn": f"{trending_topic_conf['chunks_back_num']}ChunksBack_"
+                                    f"XXXXXChunkSize_"
+                                    f"{trending_topic_conf['latest_date'].strftime('%Y-%m-%d')}LastDate.tsv",
+        "ts": "/data/work/data/covid/ts/"
+    },
+    "antisemitism": {
+        "root_path": "/data/work/data/hate_speech/antisemitism/",
+        "raw_data": "/data/work/data/hate_speech/antisemitism/data/",
+        "pickled_data": "/data/work/data/hate_speech/antisemitism/processed_data/pickled_data/",
+        "models": "/data/work/data/hate_speech/antisemitism/processed_data/models/",
+        "output_trending_topic_dir": f"/data/work/data/hate_speech/antisemitism/trending_topics/",
+        "output_trending_topic_fn": f"{trending_topic_conf['chunks_back_num']}ChunksBack_"
+                                    f"XXXXXChunkSize_"
+                                    f"{trending_topic_conf['latest_date'].strftime('%Y-%m-%d')}LastDate.tsv",
+        "ts": "/data/work/data/hate_speech/antisemitism/ts/"
+    },
+    "racial_slurs": {
+        "root_path": "/data/work/data/hate_speech/racial_slurs/",
+        "raw_data": "/data/work/data/hate_speech/racial_slurs/data/",
+        "pickled_data": "/data/work/data/hate_speech/racial_slurs/processed_data/pickled_data/",
+        "models": "/data/work/data/hate_speech/racial_slurs/processed_data/models/",
+        "output_trending_topic_dir": f"/data/work/data/hate_speech/racial_slurs/trending_topics/",
+        "output_trending_topic_fn": f"{trending_topic_conf['chunks_back_num']}ChunksBack_"
+                                    f"XXXXXChunkSize_"
+                                    f"{trending_topic_conf['latest_date'].strftime('%Y-%m-%d')}LastDate.tsv",
+        "ts": "/data/work/data/hate_speech/racial_slurs/ts/"
+    },
+    "all_datasets": {
+        "root_path": "/data/work/data/hate_speech/all_datasets/",
+        "pickled_data": "/data/work/data/hate_speech/all_datasets/pickled_data/",
+        "models": "/data/work/data/hate_speech/all_datasets/models/",
+        "output_trending_topic_dir": f"/data/work/data/hate_speech/all_datasets/trending_topics/",
+        "output_trending_topic_fn": f"{trending_topic_conf['chunks_back_num']}ChunksBack_"
+                                    f"XXXXXChunkSize_"
+                                    f"{trending_topic_conf['latest_date'].strftime('%Y-%m-%d')}LastDate.tsv",
+        "ts": "/data/work/data/hate_speech/all_datasets/ts/"
+    },
+    "gab": {
+        "root_path": "/data/work/data/hate_speech/gab/",
+        "raw_data": "/data/work/data/hate_speech/gab/data/",
+        "pickled_data": "/data/work/data/hate_speech/gab/processed_data/pickled_data/",
+        "models": "/data/work/data/hate_speech/gab/processed_data/models/",
+        "ts": "/data/work/data/hate_speech/gab/ts/",
+        "output_trending_topic_dir": "/data/work/data/hate_speech/trending_topics/",
+        "output_trending_topic_fn": f"{trending_topic_conf['chunks_back_num']}ChunksBack_"
+                                    f"XXXXXChunkSize_"
+                                    f"{trending_topic_conf['latest_date'].strftime('%Y-%m-%d')}LastDate.tsv",
+    }
+}
+
+models_config = {
+    "word_embedding": {
+        "cbow":{
+            "embedding_size": 300,
+            "window_size": 11,
+            "min_count": 3
+        },
+        "skipgram":{
+            "embedding_size": 300,
+            "window_size": 11,
+            "min_count": 3
+        },
+        "fasttext":{
+            "embedding_size": 300,
+            "window_size": 11,
+            "min_count": 3,
+            "min_n": 3,  # character n-gram
+            "max_n": 6
+        }
+    }
+}
diff --git a/config/detection_config.py b/config/detection_config.py
@@ -0,0 +1,199 @@
+import os
+
+# post level execution config
+post_level_execution_config = {
+    "multiple_experiments": False,
+    # set multiple_experiments to False when running post_level__experiment.py file.
+    # set multiple_experiments to True when running the file post_level__multiple_experiments.py.
+    "data": {
+        "dataset": "echo_2",  # possible values: ["echo_2", "gab,"waseem_2", "waseem_3", "davidson_2", "davidson_3"]
+        'test_size': 0.2
+    },
+    "train_on_all_data": False,
+    "keep_all_data": True,
+    "omit_echo": False,   # relevant only if keep_al_data is set to false. if omit_echo is true - then we keep only posts without echo, and vice versa.
+    "model": "models.FeedForwardNN",  # choose the model to run
+    # possible model values: ["AttentionLSTM", "CNN_LSTM", "BertFineTuning", "FeedForwardNN", "MyLogisticRegression",
+    #                         "MyCatboost", "MyLightGBM", "MyXGBoost"]
+    "bert_conf": {  # relevant only for "BertFineTuning" model
+        # possible values:
+        # 'bert-base-uncased', 'bert-large-uncased', 'bert-base-cased', 'bert-large-cased'
+        # 'roberta-base', 'roberta-large', 'roberta-large-mnli'
+        # 'xlnet-base-cased', 'xlnet-large-cased'
+        # 'distilroberta-base', 'distilbert-base-uncased', 'distilbert-base-uncased-distilled-squad', 'distilbert-base-cased', 'distilbert-base-cased-distilled-squad'
+
+        # best bert transformer are 'bert-base-uncased', 'distilbert-base-uncased'
+        'model_type': "distilbert-base-uncased",
+        'use_masking': True,
+        'use_token_types': False
+    },
+    "preprocessing": {
+        "type": 'nn',  # one of 'nn', 'bert', 'tfidf'
+        "output_path": "detection/outputs",
+        "max_features": 10000  # applicable only for non-bert models
+    },
+    "kwargs": {
+        "model_name": "",
+        "max_seq_len": 128,
+        "emb_size": 300,
+        "epochs": 20,
+        "fine_tune": True,
+        "validation_split": 0.2,
+        "model_api": "functional",
+        "paths": {
+            "train_output": "detection/outputs/",
+            "model_output": "detection/outputs/"
+        }
+    },
+    "evaluation": {
+        "metrics": [
+            "evaluation_metrics.ConfusionMatrix",
+            "evaluation_metrics.ROC",
+            "evaluation_metrics.PrecisionRecallCurve"
+        ],
+        "output_path": "detection/outputs/"
+    }
+}
+
+# user level execution config
+user_level_execution_config = {
+    "trained_data": "echo_2",
+    "inference_data": "echo_2"
+}
+
+# configs specific to posts with/wo the echo sign
+echo_data_conf = {
+    "with_rt": True,
+    "only_en": False  # if false -> with_rt must be true
+}
+
+if echo_data_conf["only_en"] == True:
+    echo_suffix = "en_"
+else:
+    echo_suffix = "all_lang_"
+if echo_data_conf["with_rt"] == True:
+    echo_suffix += "with_rt"
+else:
+    echo_suffix += "no_rt"
+
+
+# things to add to the execution config
+if post_level_execution_config["multiple_experiments"] == True:
+    post_level_execution_config["kwargs"]["model_name"] = "multiple_experiments"
+else:
+    post_level_execution_config["kwargs"]["model_name"] = post_level_execution_config["model"].split(".")[-1]
+
+# additional paths config (corresponding to the output directory that is set according to the model_name param)
+post_level_execution_config["preprocessing"]["output_path"] = os.path.join(post_level_execution_config["preprocessing"]["output_path"],
+                                                          post_level_execution_config["data"]["dataset"],
+                                                          post_level_execution_config["kwargs"]["model_name"], "preprocessing")
+post_level_execution_config["kwargs"]["paths"]["model_output"] = os.path.join(post_level_execution_config["kwargs"]["paths"]["model_output"],
+                                                          post_level_execution_config["data"]["dataset"],
+                                                          post_level_execution_config["kwargs"]["model_name"], "saved_model")
+post_level_execution_config["kwargs"]["paths"]["train_output"] = os.path.join(post_level_execution_config["kwargs"]["paths"]["train_output"],
+                                                          post_level_execution_config["data"]["dataset"],
+                                                          post_level_execution_config["kwargs"]["model_name"], "training")
+post_level_execution_config["evaluation"]["output_path"] = os.path.join(post_level_execution_config["evaluation"]["output_path"],
+                                                          post_level_execution_config["data"]["dataset"],
+                                                          post_level_execution_config["kwargs"]["model_name"], "evaluation")
+
+
+if 'bert' in post_level_execution_config['model'].lower():
+    post_level_execution_config["preprocessing"]["bert_conf"] = post_level_execution_config["bert_conf"]
+    post_level_execution_config["kwargs"]["bert_conf"] = post_level_execution_config["bert_conf"]
+else:
+    post_level_execution_config["preprocessing"]["bert_conf"] = None
+    post_level_execution_config["kwargs"]["bert_conf"] = None
+
+
+# Data path configs
+post_level_conf = {
+    "davidson_2": {
+        "data_path": "data/twitter/hate-speech-and-offensive-language/davidson_2_labels_no_offensive.tsv",  # davidson_2_labels.tsv
+        "text_column": "text",
+        "label_column": "label",
+        "labels": [0, 1],
+        "labels_interpretation": ["neither", "hate-speech/offensive-language"]
+
+    },
+    "davidson_3": {
+        "data_path": "data/twitter/hate-speech-and-offensive-language/davidson_3_labels.tsv",
+        "text_column": "text",
+        "label_column": "label",
+        "labels": [0, 1, 2],
+        "labels_interpretation": ["neither", "offensive-language", "hate-speech"]
+    },
+    "waseem_2":{
+        "data_path": "data/twitter/hate_speech_naacl/mkr_posts_annotations_2_label.tsv",
+        "text_column": "text",
+        "label_column": "label",
+        "labels": [0, 1],
+        "labels_interpretation": ["none", "sexism/racism"]
+    },
+    "waseem_3":{
+        "data_path": "data/twitter/hate_speech_naacl/mkr_posts_annotations_3_label.tsv",
+        "text_column": "text",
+        "label_column": "label",
+        "labels": [0, 1, 2],
+        "labels_interpretation": ["none", "sexism", "racism"]
+    },
+    "echo_2": {
+        "data_path": "data/post_level/echo_posts_2_labels.tsv",  # all_annotations_2_labels.tsv  echo_tweets_2_labels.tsv",
+        "text_column": "text",
+        "label_column": "label",
+        "labels": [0, 1],
+        "labels_interpretation": ["neutral-responsive", "hate speech"]
+    },
+    "echo_3": {
+        "data_path": "data/post_level/echo_posts_3_labels.tsv",
+        "unique_column": "tweet_id",
+        "text_column": "text",
+        "label_column": "label",
+        "labels": [0, 1, 2],
+        "labels_interpretation": ["neutral", "hate speech", "responsive"]
+    },
+    "gab": {
+        "data_path": "data/post_level/gab_posts_2_labels.tsv",
+        "text_column": "text",
+        "label_column": "label",
+        "labels": [0, 1],
+        "labels_interpretation": ["Not-HS", "HS"]
+    },
+    "combined": {
+        "data_path": "data/post_level/combined_post_data_2_labels_no_offensive_davidson.tsv",
+        "text_column": "text",
+        "label_column": "label",
+        "labels": [0, 1],
+        "labels_interpretation": ["Not-HS", "HS"]
+    },
+}
+
+user_level_conf = {
+    "echo_2": {
+        "data_path": "data/user_level/echo_users_2_labels.tsv",
+        "following_fn": "data_users_mention_edges_df.tsv",
+        "user_unique_column": "user_id",
+        "label_column": "label",
+        "labels": [0, 1],
+        "labels_interpretation": ["neutral-responsive", "hate speech"],
+        "posts_per_user_path": f"hate_networks/outputs/echo_networks/pickled_data/corpora_list_per_user.pkl"
+    },
+    "echo_3": {
+        "data_path": "data/twitter/echo/echo_users_3_labels.tsv",
+        "user_unique_column": "user_id",
+        "label_column": "label",
+        "labels": [0, 1, 2],
+        "labels_interpretation": ["neutral", "hate speech", "responsive"]
+    },
+    "gab": {
+        "data_path": "data/gab/gab_users_2_labels.tsv",
+        "following_fn": "labeled_users_followers.tsv",
+        "user_unique_column": "user_id",
+        "label_column": "label",
+        "labels": [0, 1],
+        "labels_interpretation": ["Not-HM", "HM"],
+        "posts_per_user_path": "hate_networks/gab_networks/pickled_data/corpora_list_per_user.pkl"
+    }
+}
+
+# user_level_conf["echo_2"]["posts_per_user_path"] = f"hate_networks/echo_networks/pickled_data/corpora_list_per_user_{echo_suffix}.pkl"