This project was created as part of the Metis data science bootcamp.
The goal of this project was to construct a classifier to separate hate speech from other offensive twitter comments.
- The data .csv can be found at this repository.
Place it within the data folder, or redirect the call to
pd.read_csv
inmain.py
to the appropriate location. - Run
main.py
to generate non-deep learning models, according to the configuration settings defined at the top of the script. - Run
keras-cnn.py
to train a convolutional neural network with 2 layers of convolution and pooling.
bin
processing
: Contains methods to process and normalize text, including part-of-speech classifiermodeling
: Contains methods to generate non-deep learning modelsmain.py
: Entry point for non-deep learning methods. Configurable settings include:bool
to perform PCA or not before analysis, andint
number of componentsbool
to convert 3-classes to binary classification problembool
to force performing text normalizationbool
to determine what type of model to generatestr
description of run for metrics dataframe identificationfloat
between 0 and 1 to define test sizeint
to serve as random seed for train-test splits for reproducibility
keras-cnn.py
: Convolutional neural network to perform classification
data
: Dump of data, including processed dataframes, word dictionary to serve as reference, etc.models
: Generated serialized modelsv1-nonstratified
: Nonstratified samplingv2-stratified
: Stratified samplingv3-binary_stratified-class02
: Stratified sampling with class 1 and class 0 combinedv4-pca_stratified-50components
: Stratified sampling with PCA to 50 componentsv5-pca_stratified-250components
: Stratified sampling with PCA to 250 components
normalize-checkpoints
: Will be generated during text normalization steps, containing serialized dataframes before and after most time-consuming steps
docs
: Documentation associated with projectsimg
: Generated images of model results
sklearn
pandas
numpy
nltk
keras
h5py
- Save the
TfidfVectorizer()
objects fitted to my data - Design with class imbalance in mind from the beginning
- Much better repository structure and organization