GitHub - tetrahydrofuran/hate_speech_classifier: Project 3 for Metis bootcamp. Objective was NLP analysis of Tweets to classify as "hate speech," "offensive," or "neither." Primary approach was tf-idf vectorization of n-grams of text and part-of-speech up to size 3. Attempted convolutional neural network on tf-idf vectors, without improvement.

tetrahydrofuran / hate_speech_classifier Public

Notifications You must be signed in to change notification settings
Fork 1
Star 0

Project 3 for Metis bootcamp. Objective was NLP analysis of Tweets to classify as "hate speech," "offensive," or "neither." Primary approach was tf-idf vectorization of n-grams of text and part-of-speech up to size 3. Attempted convolutional neural network on tf-idf vectors, without improvement.

towardsdatascience.com/into-a-textual-heart-of-darkness-39b3895ce21e

0 stars 1 fork Branches Tags Activity

Notifications

Name		Name	Last commit message	Last commit date
Latest commit History 29 Commits
.idea		.idea
bin		bin
data		data
docs		docs
img		img
README.md		README.md

Repository files navigation

Hate Speech Classifier

This project was created as part of the Metis data science bootcamp.

The goal of this project was to construct a classifier to separate hate speech from other offensive twitter comments.

Repository Description

The data .csv can be found at this repository. Place it within the data folder, or redirect the call to pd.read_csv in main.py to the appropriate location.
Run main.py to generate non-deep learning models, according to the configuration settings defined at the top of the script.
Run keras-cnn.py to train a convolutional neural network with 2 layers of convolution and pooling.

Repository Structure

bin
- processing: Contains methods to process and normalize text, including part-of-speech classifier
- modeling: Contains methods to generate non-deep learning models
- main.py: Entry point for non-deep learning methods. Configurable settings include:
  - bool to perform PCA or not before analysis, and int number of components
  - bool to convert 3-classes to binary classification problem
  - bool to force performing text normalization
  - bool to determine what type of model to generate
  - str description of run for metrics dataframe identification
  - float between 0 and 1 to define test size
  - int to serve as random seed for train-test splits for reproducibility
- keras-cnn.py: Convolutional neural network to perform classification
data: Dump of data, including processed dataframes, word dictionary to serve as reference, etc.
- models: Generated serialized models
  - v1-nonstratified: Nonstratified sampling
  - v2-stratified: Stratified sampling
  - v3-binary_stratified-class02: Stratified sampling with class 1 and class 0 combined
  - v4-pca_stratified-50components: Stratified sampling with PCA to 50 components
  - v5-pca_stratified-250components: Stratified sampling with PCA to 250 components
- normalize-checkpoints: Will be generated during text normalization steps, containing serialized dataframes before and after most time-consuming steps
docs: Documentation associated with projects
img: Generated images of model results

Dependencies

sklearn
pandas
numpy
nltk
keras
h5py

If I Could Do It Over Again

Save the TfidfVectorizer() objects fitted to my data
Design with class imbalance in mind from the beginning
Much better repository structure and organization

About

Project 3 for Metis bootcamp. Objective was NLP analysis of Tweets to classify as "hate speech," "offensive," or "neither." Primary approach was tf-idf vectorization of n-grams of text and part-of-speech up to size 3. Attempted convolutional neural network on tf-idf vectors, without improvement.

towardsdatascience.com/into-a-textual-heart-of-darkness-39b3895ce21e

nlp machine-learning learning-by-doing nlp-machine-learning metis-bootcamp

Report repository

Releases

No releases published

Packages

No packages published

Languages

Python 100.0%