This repository contains scripts for cleaning and preprocessing text data, including removing emojis, processing text to remove specific content, and filtering out non-English entries.
preprocessing.py
: Functions for text cleaning and preprocessing.pre-processing.ipynb
: A Jupyter Notebook demonstrating how to use the preprocessing functions on a dataset.data/
: A directory containing sample data files for testing the preprocessing functions.emotion_feature.ipynb
: A Jupyter Notebook demonstrating how to extract emotion features from text data.sentiment_feature.ipynb
: A Jupyter Notebook demonstrating how to extract sentiment features from text data.vent_analysis.ipynb
: A Jupyter Notebook demonstrating how to perform vent analysis on text data.
- Emoji Removal: Clean text data by removing emojis.
- Text Processing: Modify text by removing unwanted sections.
- Language Filtering: Detect and remove non-English text entries.
-
Import the Functions: Import the required functions from
preprocessing.py
. -
Load Your Data: Read your dataset into a pandas DataFrame.
-
Apply Preprocessing:
- Remove emojis from your text data.
- Process the text to remove specific sentences or patterns.
- Filter out non-English text entries.
-
Save the Cleaned Data: Export the processed DataFrame to a new file.
- Python 3.6 or higher
- pandas
- emoji
- langdetect
Install the required packages using:
pip install pandas emoji langdetect
This project is licensed under the MIT License.