This project focuses on applying Natural Language Processing (NLP) techniques to process and analyze textual data. The main objective is to extract meaningful patterns and insights from the text data, utilizing various NLP methodologies.
- N-gram token analysis
- Logistic Regression classification
- Model performance evaluation with confusion matrices and classification reports
- Error analysis and discussion for model predictions
The datasets used in this project are derived from the 20 Newsgroups dataset, which is a collection of approximately 20,000 newsgroup documents, partitioned across 20 different newsgroups.
Before running this project, ensure you have the following dependencies installed:
- Python 3.x
- NumPy
- scikit-learn
- NLTK
- Matplotlib
- Seaborn
To run the analysis, execute the Jupyter Notebook V2.3_Karan_patel_Task.ipynb
. Make sure to follow the steps in the notebook, which include data preprocessing, model training, prediction, evaluation and run all nodes chronologically.
The results of the project include various metrics such as accuracy, precision, recall, and F1 score, as well as visualizations like confusion matrices and graphs illustrating model performance.
- Dataset Source
- Contributors to the scikit-learn, NLTK, and related libraries.