Skip to content

Project 3 for Metis bootcamp. Objective was NLP analysis of Tweets to classify as "hate speech," "offensive," or "neither." Primary approach was tf-idf vectorization of n-grams of text and part-of-speech up to size 3. Attempted convolutional neural network on tf-idf vectors, without improvement.

Notifications You must be signed in to change notification settings

tetrahydrofuran/hate_speech_classifier

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

29 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Hate Speech Classifier

This project was created as part of the Metis data science bootcamp.

The goal of this project was to construct a classifier to separate hate speech from other offensive twitter comments.

Repository Description

  • The data .csv can be found at this repository. Place it within the data folder, or redirect the call to pd.read_csv in main.py to the appropriate location.
  • Run main.py to generate non-deep learning models, according to the configuration settings defined at the top of the script.
  • Run keras-cnn.py to train a convolutional neural network with 2 layers of convolution and pooling.

Repository Structure

  • bin
    • processing: Contains methods to process and normalize text, including part-of-speech classifier
    • modeling: Contains methods to generate non-deep learning models
    • main.py: Entry point for non-deep learning methods. Configurable settings include:
      • bool to perform PCA or not before analysis, and int number of components
      • bool to convert 3-classes to binary classification problem
      • bool to force performing text normalization
      • bool to determine what type of model to generate
      • str description of run for metrics dataframe identification
      • float between 0 and 1 to define test size
      • int to serve as random seed for train-test splits for reproducibility
    • keras-cnn.py: Convolutional neural network to perform classification
  • data: Dump of data, including processed dataframes, word dictionary to serve as reference, etc.
    • models: Generated serialized models
      • v1-nonstratified: Nonstratified sampling
      • v2-stratified: Stratified sampling
      • v3-binary_stratified-class02: Stratified sampling with class 1 and class 0 combined
      • v4-pca_stratified-50components: Stratified sampling with PCA to 50 components
      • v5-pca_stratified-250components: Stratified sampling with PCA to 250 components
    • normalize-checkpoints: Will be generated during text normalization steps, containing serialized dataframes before and after most time-consuming steps
  • docs: Documentation associated with projects
  • img: Generated images of model results

Dependencies

  • sklearn
  • pandas
  • numpy
  • nltk
  • keras
  • h5py

If I Could Do It Over Again

  • Save the TfidfVectorizer() objects fitted to my data
  • Design with class imbalance in mind from the beginning
  • Much better repository structure and organization

About

Project 3 for Metis bootcamp. Objective was NLP analysis of Tweets to classify as "hate speech," "offensive," or "neither." Primary approach was tf-idf vectorization of n-grams of text and part-of-speech up to size 3. Attempted convolutional neural network on tf-idf vectors, without improvement.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages