This project implements a machine learning model to classify SMS messages as either "spam" or "ham" (non-spam). It leverages natural language processing (NLP) techniques and the Multinomial Naive Bayes classifier for accurate classification.
The SMS spam dataset used in this project contains text messages labeled as "spam" or "ham". It was sourced from https://www.kaggle.com/datasets/uciml/sms-spam-collection-dataset.
-
Data Preprocessing: i. Text cleaning: Lowercasing, punctuation removal. ii. Tokenization and removal of stopwords. iii. Stemming using Porter stemming algorithm.
-
Feature Engineering: i. TF-IDF vectorization: Converting text data into numerical features.
-
Model Selection and Training: Utilized Multinomial Naive Bayes classifier for its suitability in text classification tasks. Trained the model on a labeled dataset.
-
Model Evaluation: Evaluated performance using metrics such as precision, recall, and F1-score. Visualized results with a confusion matrix.
-
Hyperparameter Tuning: Used GridSearchCV for optimizing model parameters like alpha for better accuracy.
-
Deployment and Usage: Saved the trained model and TF-IDF vectorizer for future predictions on new SMS messages.
- Z_Rock_ML_Internship_Project_1.ipynb: Jupyter notebook containing the entire project code and detailed explanations.
- spam.csv: Dataset used for training and testing the model.
- requirements.txt: List of Python dependencies required to run the project.
- spam_detection_model.pkl: Serialized Multinomial Naive Bayes classifier trained to classify SMS messages as 'spam' or 'ham'.
- tfidf_vectorizer.pkl: Serialized TF-IDF (Term Frequency-Inverse Document Frequency) vectorizer used to transform text data into numerical features for SMS spam detection.
Clone the repository using the following commands:
- git clone https://github.com/milap573/Zrock-internship-2024-Project1.git
- cd Zrock-internship-2024-Project1
pip install -r requirements.txt