Spam Detection Using Machine Learning 🚀

Overview

This project implements a spam detection system using machine learning techniques, specifically the Naive Bayes classifier. It analyzes text messages and classifies them as "spam" or "ham" (non-spam). The dataset used is a CSV file containing labeled messages.

Installation 🛠️

To set up the project, ensure you have Python installed, then install the required libraries using:

pip install pandas scikit-learn nltk

Download NLTK Resources

Run the following lines in Python to download necessary NLTK resources:

import nltk
nltk.download('stopwords')
nltk.download('punkt')

Usage 💻

Place the spam.csv dataset in the project directory.
Run the script:
```
python main.py
```
The model will train and evaluate itself, printing the accuracy and classification report.

Logic and Mathematics 📊

How It Works

Data Loading: The dataset is loaded using pandas.
Data Preprocessing: Text messages are cleaned and prepared for analysis:
- Lowercasing and splitting into words.
- Removing stopwords and non-alphanumeric characters.
Feature Extraction: The CountVectorizer converts the processed text into a matrix of token counts, making it suitable for machine learning algorithms.
Model Training: The Multinomial Naive Bayes model is trained on the processed data.
Evaluation: The model's performance is assessed using accuracy and a detailed classification report.

Naive Bayes Classifier

Assumption: Naive Bayes assumes that the presence of a particular feature in a class is independent of the presence of any other feature. This simplification is why it’s termed "naive."
Mathematics: The classifier uses Bayes’ theorem to calculate the probability of a message being spam or ham based on its features.

Import Statements 📦

Here's a breakdown of the important imports in the script:

pandas: For data manipulation and analysis.
sklearn.model_selection.train_test_split: To split the dataset into training and testing sets, ensuring model validation.
sklearn.feature_extraction.text.CountVectorizer: To convert text data into numerical form (bag of words model).
sklearn.naive_bayes.MultinomialNB: The classifier used for the spam detection task.
sklearn.metrics: For measuring the performance of the model.
nltk: A library for natural language processing.
nltk.corpus.stopwords: Provides a list of common words to exclude from analysis.

Contributions 🤝

Main Contributor: Prayush Adhikari - Developed the spam detection model and organized the code.
Collaborators: Contributions from the community are welcome! Feel free to suggest improvements, report bugs, or add features.

License 📄

This project is licensed under the MIT License - see the LICENSE file for details.

Thank you for checking out this project! If you have any questions or suggestions, feel free to reach out. Happy coding! 😊

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Spam Detection Using Machine Learning 🚀

Overview

Table of Contents

Installation 🛠️

Download NLTK Resources

Usage 💻

Logic and Mathematics 📊

How It Works

Naive Bayes Classifier

Import Statements 📦

Contributions 🤝

License 📄

Files

README.md

Latest commit

History

README.md

File metadata and controls

Spam Detection Using Machine Learning 🚀

Overview

Table of Contents

Installation 🛠️

Download NLTK Resources

Usage 💻

Logic and Mathematics 📊

How It Works

Naive Bayes Classifier

Import Statements 📦

Contributions 🤝

License 📄