Skip to content

Commit

Permalink
Commit all project's files
Browse files Browse the repository at this point in the history
  • Loading branch information
Eyal Arviv committed May 3, 2021
0 parents commit b09201d
Show file tree
Hide file tree
Showing 56 changed files with 184,925 additions and 0 deletions.
57 changes: 57 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,57 @@
# From Individuals to Communities: Community-Aware Language Modeling for the Detection of Hate Speech

To avoid environment issues you can create a new conda env using the following line:

`conda create --name <env> --file requirements.txt`

This project is divided into two main segments:
1. Hate Speech Detection - under the module `detection`
1. Hate Networks - under the model `hate_networks`


## Hate Speech Detection
This is the main module of the thesis. It contains both the post level models (PLMs) and the user level models (ULMs) to detect hate speech.

The configuration of the execution for this module is under the file `config.detection_config.py`. Under this file you can control what dataset is running for the specific execution.

There are four entry points for this module.
To execute each of the following experiments from the root dir of the project run:

`python detection/experiments/{experiment_file_name}.py`

The entry points reside under `experiments` directory in the following files:
* Post-level experiments
* `post_level__experiment.py` - run this file to execute a specific experiment of a model on a specific data set as configured in `config.detection_config.py`.
To run this file set the parameter `multiple_experiments` to `False` under `post_level_execution_config` in `config.detection_config.py` file.
* `post_level__multiple_experiments.py` - run this file to execute multiple experiments of several models and compare them.
To run this file set the parameter `multiple_experiments` to `True` under `post_level_execution_config` in `config.detection_config.py` file.

**Important note**: to run BertFineTuning model make sure you are using a gpu.

* User-level experiments
* `user_level__experiments.py` - run this file to execute the user level experiment.
It will load the PLM from the `post_level_execution_config` config and predict the probabilities of all of the posts by all users in the given data to contain hate speech.
Then it will run the FFNN using the streams of data by the user itself, his followees and his followers, together with network features.
* **Important note**: one must run the hate network module with the desired dataset before running this code as it uses some output from it.
* `user_level__threshold_models.py` - Run this file to execute threshold-based ULMs.
* **Important note**: one must run the experiment file `user_level__experiments.py` with the desired dataset before running this code as it uses some output from it.

* The outputs of the models' executions will be saved under `detection/outputs/{data_name}_{model_name}`.


## Hate Networks
In this module we create networks promoting hate using data from social networks that present engagement between users, i.e., mentions, retweets.
Using this module you can create unsupervised segmentation of users to various communities based on the set configuration.
The unsupervised methods are topic models (LDA/NMF) and Word2Vec modeling based on the users' texts.
Using these methods you are able to color the users in the reconstructed user-network.
You can also color the users using the predictions created using the user-classifier from the `detection` module.
This can be done by setting the param `plot_supervised_networks` to true in `general_conf` under `hate_networks_config.py` file and using the `user_pred` param under path_conf for each dataset.

Important notes:
* The entry point for this module is the function `main()` under the file `main.py`. To execute the code from the root dir of the project run:

`python hate_networks/main.py`

* The execution of this code is configured using the configuration dict under the `config` module in the file `hate_networks_config.py`.

* The outputs of the constructed network and the relevant files that are created with it will be saved under `hate_networks/outputs/{data_name}_networks`.
106 changes: 106 additions & 0 deletions config/data_config.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,106 @@
from datetime import datetime, date, timedelta
general_conf = {
"ignore_retweets": True,
"only_english": True,
"processes_number": 30,
# "data_to_process": "covid", # not in use, sent as a flag when running the script
# possible values: 'covid', 'antisemitism', 'racial_slurs', 'all_datasets'
"dataset_type": "twitter"
}

trending_topic_conf = {
"p_num": 15, # number of processes for parallel execution
"latest_date": datetime.today() - timedelta(days=1), #datetime(2020, 6, 4),
# "chunk_size": 5, # number of days to consider in one chunk of data
"chunks_back_num": 3, # number of chunks to consider in total (including the last chunk)
"window_slide_size": 1, # size of rolling window to move to the next chunk
# "unigram_threshold": 200, # min unigram threshold for topic count
# "bigram_threshold": 300, # min bigram threshold for topic count
# "emoji_threshold": 50,
"factor": 3, # the relative growth of the topic's popularity
"factor_power": 0.1,
"user_limit": False, # to consider specific users' tweets
"ignore_retweets": True, #whether or not to ignore retweets along finding the relevant trending unigrams/bigrmas
"ignore_punct": True,
"only_english": True,
"topn_to_save": 50
}

path_confs = {
"covid": {
"root_path": "/data/work/data/covid/",
"raw_data": "/data/work/data/covid/data/",
"pickled_data": "/data/work/data/covid/processed_data/pickled_data/",
"models": "/data/work/data/covid/processed_data/models/",
"output_trending_topic_dir": f"/data/work/data/covid/trending_topics/",
"output_trending_topic_fn": f"{trending_topic_conf['chunks_back_num']}ChunksBack_"
f"XXXXXChunkSize_"
f"{trending_topic_conf['latest_date'].strftime('%Y-%m-%d')}LastDate.tsv",
"ts": "/data/work/data/covid/ts/"
},
"antisemitism": {
"root_path": "/data/work/data/hate_speech/antisemitism/",
"raw_data": "/data/work/data/hate_speech/antisemitism/data/",
"pickled_data": "/data/work/data/hate_speech/antisemitism/processed_data/pickled_data/",
"models": "/data/work/data/hate_speech/antisemitism/processed_data/models/",
"output_trending_topic_dir": f"/data/work/data/hate_speech/antisemitism/trending_topics/",
"output_trending_topic_fn": f"{trending_topic_conf['chunks_back_num']}ChunksBack_"
f"XXXXXChunkSize_"
f"{trending_topic_conf['latest_date'].strftime('%Y-%m-%d')}LastDate.tsv",
"ts": "/data/work/data/hate_speech/antisemitism/ts/"
},
"racial_slurs": {
"root_path": "/data/work/data/hate_speech/racial_slurs/",
"raw_data": "/data/work/data/hate_speech/racial_slurs/data/",
"pickled_data": "/data/work/data/hate_speech/racial_slurs/processed_data/pickled_data/",
"models": "/data/work/data/hate_speech/racial_slurs/processed_data/models/",
"output_trending_topic_dir": f"/data/work/data/hate_speech/racial_slurs/trending_topics/",
"output_trending_topic_fn": f"{trending_topic_conf['chunks_back_num']}ChunksBack_"
f"XXXXXChunkSize_"
f"{trending_topic_conf['latest_date'].strftime('%Y-%m-%d')}LastDate.tsv",
"ts": "/data/work/data/hate_speech/racial_slurs/ts/"
},
"all_datasets": {
"root_path": "/data/work/data/hate_speech/all_datasets/",
"pickled_data": "/data/work/data/hate_speech/all_datasets/pickled_data/",
"models": "/data/work/data/hate_speech/all_datasets/models/",
"output_trending_topic_dir": f"/data/work/data/hate_speech/all_datasets/trending_topics/",
"output_trending_topic_fn": f"{trending_topic_conf['chunks_back_num']}ChunksBack_"
f"XXXXXChunkSize_"
f"{trending_topic_conf['latest_date'].strftime('%Y-%m-%d')}LastDate.tsv",
"ts": "/data/work/data/hate_speech/all_datasets/ts/"
},
"gab": {
"root_path": "/data/work/data/hate_speech/gab/",
"raw_data": "/data/work/data/hate_speech/gab/data/",
"pickled_data": "/data/work/data/hate_speech/gab/processed_data/pickled_data/",
"models": "/data/work/data/hate_speech/gab/processed_data/models/",
"ts": "/data/work/data/hate_speech/gab/ts/",
"output_trending_topic_dir": "/data/work/data/hate_speech/trending_topics/",
"output_trending_topic_fn": f"{trending_topic_conf['chunks_back_num']}ChunksBack_"
f"XXXXXChunkSize_"
f"{trending_topic_conf['latest_date'].strftime('%Y-%m-%d')}LastDate.tsv",
}
}

models_config = {
"word_embedding": {
"cbow":{
"embedding_size": 300,
"window_size": 11,
"min_count": 3
},
"skipgram":{
"embedding_size": 300,
"window_size": 11,
"min_count": 3
},
"fasttext":{
"embedding_size": 300,
"window_size": 11,
"min_count": 3,
"min_n": 3, # character n-gram
"max_n": 6
}
}
}
199 changes: 199 additions & 0 deletions config/detection_config.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,199 @@
import os

# post level execution config
post_level_execution_config = {
"multiple_experiments": False,
# set multiple_experiments to False when running post_level__experiment.py file.
# set multiple_experiments to True when running the file post_level__multiple_experiments.py.
"data": {
"dataset": "echo_2", # possible values: ["echo_2", "gab,"waseem_2", "waseem_3", "davidson_2", "davidson_3"]
'test_size': 0.2
},
"train_on_all_data": False,
"keep_all_data": True,
"omit_echo": False, # relevant only if keep_al_data is set to false. if omit_echo is true - then we keep only posts without echo, and vice versa.
"model": "models.FeedForwardNN", # choose the model to run
# possible model values: ["AttentionLSTM", "CNN_LSTM", "BertFineTuning", "FeedForwardNN", "MyLogisticRegression",
# "MyCatboost", "MyLightGBM", "MyXGBoost"]
"bert_conf": { # relevant only for "BertFineTuning" model
# possible values:
# 'bert-base-uncased', 'bert-large-uncased', 'bert-base-cased', 'bert-large-cased'
# 'roberta-base', 'roberta-large', 'roberta-large-mnli'
# 'xlnet-base-cased', 'xlnet-large-cased'
# 'distilroberta-base', 'distilbert-base-uncased', 'distilbert-base-uncased-distilled-squad', 'distilbert-base-cased', 'distilbert-base-cased-distilled-squad'

# best bert transformer are 'bert-base-uncased', 'distilbert-base-uncased'
'model_type': "distilbert-base-uncased",
'use_masking': True,
'use_token_types': False
},
"preprocessing": {
"type": 'nn', # one of 'nn', 'bert', 'tfidf'
"output_path": "detection/outputs",
"max_features": 10000 # applicable only for non-bert models
},
"kwargs": {
"model_name": "",
"max_seq_len": 128,
"emb_size": 300,
"epochs": 20,
"fine_tune": True,
"validation_split": 0.2,
"model_api": "functional",
"paths": {
"train_output": "detection/outputs/",
"model_output": "detection/outputs/"
}
},
"evaluation": {
"metrics": [
"evaluation_metrics.ConfusionMatrix",
"evaluation_metrics.ROC",
"evaluation_metrics.PrecisionRecallCurve"
],
"output_path": "detection/outputs/"
}
}

# user level execution config
user_level_execution_config = {
"trained_data": "echo_2",
"inference_data": "echo_2"
}

# configs specific to posts with/wo the echo sign
echo_data_conf = {
"with_rt": True,
"only_en": False # if false -> with_rt must be true
}

if echo_data_conf["only_en"] == True:
echo_suffix = "en_"
else:
echo_suffix = "all_lang_"
if echo_data_conf["with_rt"] == True:
echo_suffix += "with_rt"
else:
echo_suffix += "no_rt"


# things to add to the execution config
if post_level_execution_config["multiple_experiments"] == True:
post_level_execution_config["kwargs"]["model_name"] = "multiple_experiments"
else:
post_level_execution_config["kwargs"]["model_name"] = post_level_execution_config["model"].split(".")[-1]

# additional paths config (corresponding to the output directory that is set according to the model_name param)
post_level_execution_config["preprocessing"]["output_path"] = os.path.join(post_level_execution_config["preprocessing"]["output_path"],
post_level_execution_config["data"]["dataset"],
post_level_execution_config["kwargs"]["model_name"], "preprocessing")
post_level_execution_config["kwargs"]["paths"]["model_output"] = os.path.join(post_level_execution_config["kwargs"]["paths"]["model_output"],
post_level_execution_config["data"]["dataset"],
post_level_execution_config["kwargs"]["model_name"], "saved_model")
post_level_execution_config["kwargs"]["paths"]["train_output"] = os.path.join(post_level_execution_config["kwargs"]["paths"]["train_output"],
post_level_execution_config["data"]["dataset"],
post_level_execution_config["kwargs"]["model_name"], "training")
post_level_execution_config["evaluation"]["output_path"] = os.path.join(post_level_execution_config["evaluation"]["output_path"],
post_level_execution_config["data"]["dataset"],
post_level_execution_config["kwargs"]["model_name"], "evaluation")


if 'bert' in post_level_execution_config['model'].lower():
post_level_execution_config["preprocessing"]["bert_conf"] = post_level_execution_config["bert_conf"]
post_level_execution_config["kwargs"]["bert_conf"] = post_level_execution_config["bert_conf"]
else:
post_level_execution_config["preprocessing"]["bert_conf"] = None
post_level_execution_config["kwargs"]["bert_conf"] = None


# Data path configs
post_level_conf = {
"davidson_2": {
"data_path": "data/twitter/hate-speech-and-offensive-language/davidson_2_labels_no_offensive.tsv", # davidson_2_labels.tsv
"text_column": "text",
"label_column": "label",
"labels": [0, 1],
"labels_interpretation": ["neither", "hate-speech/offensive-language"]

},
"davidson_3": {
"data_path": "data/twitter/hate-speech-and-offensive-language/davidson_3_labels.tsv",
"text_column": "text",
"label_column": "label",
"labels": [0, 1, 2],
"labels_interpretation": ["neither", "offensive-language", "hate-speech"]
},
"waseem_2":{
"data_path": "data/twitter/hate_speech_naacl/mkr_posts_annotations_2_label.tsv",
"text_column": "text",
"label_column": "label",
"labels": [0, 1],
"labels_interpretation": ["none", "sexism/racism"]
},
"waseem_3":{
"data_path": "data/twitter/hate_speech_naacl/mkr_posts_annotations_3_label.tsv",
"text_column": "text",
"label_column": "label",
"labels": [0, 1, 2],
"labels_interpretation": ["none", "sexism", "racism"]
},
"echo_2": {
"data_path": "data/post_level/echo_posts_2_labels.tsv", # all_annotations_2_labels.tsv echo_tweets_2_labels.tsv",
"text_column": "text",
"label_column": "label",
"labels": [0, 1],
"labels_interpretation": ["neutral-responsive", "hate speech"]
},
"echo_3": {
"data_path": "data/post_level/echo_posts_3_labels.tsv",
"unique_column": "tweet_id",
"text_column": "text",
"label_column": "label",
"labels": [0, 1, 2],
"labels_interpretation": ["neutral", "hate speech", "responsive"]
},
"gab": {
"data_path": "data/post_level/gab_posts_2_labels.tsv",
"text_column": "text",
"label_column": "label",
"labels": [0, 1],
"labels_interpretation": ["Not-HS", "HS"]
},
"combined": {
"data_path": "data/post_level/combined_post_data_2_labels_no_offensive_davidson.tsv",
"text_column": "text",
"label_column": "label",
"labels": [0, 1],
"labels_interpretation": ["Not-HS", "HS"]
},
}

user_level_conf = {
"echo_2": {
"data_path": "data/user_level/echo_users_2_labels.tsv",
"following_fn": "data_users_mention_edges_df.tsv",
"user_unique_column": "user_id",
"label_column": "label",
"labels": [0, 1],
"labels_interpretation": ["neutral-responsive", "hate speech"],
"posts_per_user_path": f"hate_networks/outputs/echo_networks/pickled_data/corpora_list_per_user.pkl"
},
"echo_3": {
"data_path": "data/twitter/echo/echo_users_3_labels.tsv",
"user_unique_column": "user_id",
"label_column": "label",
"labels": [0, 1, 2],
"labels_interpretation": ["neutral", "hate speech", "responsive"]
},
"gab": {
"data_path": "data/gab/gab_users_2_labels.tsv",
"following_fn": "labeled_users_followers.tsv",
"user_unique_column": "user_id",
"label_column": "label",
"labels": [0, 1],
"labels_interpretation": ["Not-HM", "HM"],
"posts_per_user_path": "hate_networks/gab_networks/pickled_data/corpora_list_per_user.pkl"
}
}

# user_level_conf["echo_2"]["posts_per_user_path"] = f"hate_networks/echo_networks/pickled_data/corpora_list_per_user_{echo_suffix}.pkl"
Loading

0 comments on commit b09201d

Please sign in to comment.