Network Security Machine Learning Project for Phishing Detection

This project focuses on network security by leveraging machine learning and data science techniques to detect phishing attacks. The project involves data ingestion, validation, transformation, and model training to build a robust phishing detection system.

DATASET: The dataset used in this project is the Phishing Websites Dataset from Keggle. The dataset contains 30 features and 11055 samples.

Project Overview

This project focuses on network security by leveraging machine learning and data science techniques to detect phishing attacks. The project involves data ingestion, validation, transformation, and model training to build a robust phishing detection system.

Directory Structure

NetworkSecurity/
├── artifacts/
├── data_schema/
├── network_security/
│   ├── components/
│   ├── entity/
│   ├── exception/
│   ├── logging/
│   ├── utils/
├── saved_models/
├── templates/
├── venv/
├── .gitignore
├── app.py
├── main.py
├── requirements.txt
├── Dockerfile
├── .github/
│   └── workflows/
│       └── network-security-workflow.yml
└── README.md

Prerequisites

Python 3.7 or higher
MongoDB
Git

Installation

Clone the repository:

git clone https://github.com/Zoro-chi/NetworkSecurity.git
cd NetworkSecurity

Create a virtual environment:

python -m venv venv
source venv/bin/activate  # On Windows use `venv\Scripts\activate`

Install the dependencies:
```
pip install -r requirements.txt
```

Usage

Ensure you have the necessary environment variables set up. You can use a .env file for this purpose.
Run the main script:
```
python main.py
```
Run FastAPI application:
```
python app.py
```

Configuration

Configuration settings are defined in the network_security/constants/training_pipeline.py file. You can adjust the settings for data ingestion, validation, transformation, and model training as needed.

Data Ingestion

The data ingestion component reads data from a MongoDB collection and exports it as a DataFrame. The data is then split into training and testing sets.

Data Validation

The data validation component checks the number of columns and detects data drift between the training and testing datasets. It generates a drift report saved in the artifacts directory.

Data Transformation

The data transformation component handles missing values using KNN imputation and transforms the data into a suitable format for model training.

Model Training

The model training component trains multiple machine learning models using the transformed data, evaluates them, and selects the best model based on performance metrics. The trained model is saved in the artifacts directory, and the training process is tracked using MLflow.

Model Training Steps

Model Training: Multiple machine learning models are trained using the transformed data. The models include:
- Random Forest
- Decision Tree
- Gradient Boosting
- Logistic Regression
- AdaBoost
Model Evaluation: The trained models are evaluated using metrics such as F1 score, recall, and precision. The best model is selected based on these metrics.
Model Tracking with MLflow: The training process and metrics are tracked using MLflow and dagshub. This includes logging the metrics and saving the trained model.
Model Saving: The best model is saved to the specified file path for future use.

API Endpoints

The project includes a FastAPI application with the following endpoints:

GET /: Redirects to the API documentation.
GET /train: Runs the training pipeline.
POST /predict: Accepts a CSV file and returns predictions.

Example Usage
1. Train the model:
```
curl -X GET "http://localhost:8000/train"
```
2. Make predictions using the model:
```
curl -X POST "http://localhost:8000/predict" -F "file=@path_to_your_csv_file"
```
  Note: Replace path_to_your_csv_file with the path to your CSV file.

Syncing Artifacts to S3

The project includes functionality to sync local artifacts and saved models to an S3 bucket.

Sync Local Artifacts to S3

def sync_artifacts_dir_to_s3(self):
try:
aws_bucket_url = f"s3://{training_pipeline.S3_TRAINING_BUCKET_NAME}/artifacts/{self.training_pipeline_config.timestamp}"
S3Sync.sync_folder_to_s3(
self=S3Sync,
folder=self.training_pipeline_config.artifact_dir,
aws_bucket_url=aws_bucket_url,
)
except Exception as e:
raise NetworkSecurityException(e, sys)

Sync Saved Models to S3

def sync_saved_model_dir_to_s3(self):
    try:
        aws_bucket_url = f"s3://{training_pipeline.S3_TRAINING_BUCKET_NAME}/final_model/{self.training_pipeline_config.timestamp}"
        S3Sync.sync_folder_to_s3(
            self=S3Sync,
            folder=self.training_pipeline_config.model_dir,
            aws_bucket_url=aws_bucket_url,
        )
    except Exception as e:
        raise NetworkSecurityException(e, sys)

S3Sync Class

class S3Sync:
    def sync_folder_to_s3(self, folder: str, aws_bucket_url: str):
        command = f"aws s3 sync {folder} {aws_bucket_url} "
        os.system(command)

    def sync_folder_from_s3(self, aws_bucket_url: str, folder: str):
        command = f"aws s3 sync {aws_bucket_url} {folder} "
        os.system(command)

Docker

The project uses Docker to containerize the application. The Docker image is built and pushed to Amazon ECR (Elastic Container Registry).

Build and Run Docker Container

Build the Docker image:

  docker build -t network-security:latest .

Run the Docker container:

  docker run -p 8000:8000 network-security:latest

GitHub Actions Workflow

The project includes a GitHub Actions workflow for continuous integration, continuous delivery and continuous Deployment. The workflow builds the Docker image and pushes it to Amazon ECR and also syncs the artifacts to S3. Adter that, it deploys the FastAPI application to an EC2 instance. The workflow is defined in the .github/workflows/network-security-workflow.yml file. The workflow is triggered on push events to the main branch. You can customize the workflow to suit your requirements.

Results

The results of the data ingestion, validation, transformation, and model training processes are saved in the artifacts directory. You can review the generated reports and trained models.

Name		Name	Last commit message	Last commit date
Latest commit History 35 Commits
.github/workflows		.github/workflows
Network_Data		Network_Data
__pycache__		__pycache__
build/lib/NetworkSecurity		build/lib/NetworkSecurity
config		config
data_schema		data_schema
final_model		final_model
monogdb		monogdb
network_security		network_security
prediction_output		prediction_output
research		research
templates		templates
valid_data		valid_data
.env.example		.env.example
.gitignore		.gitignore
Dockerfile		Dockerfile
README.md		README.md
app.py		app.py
main.py		main.py
params.yaml		params.yaml
project_template_setup.py		project_template_setup.py
requirements.txt		requirements.txt
schema.yaml		schema.yaml
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Network Security Machine Learning Project for Phishing Detection

Table of Contents

Project Overview

Directory Structure

Prerequisites

Installation

Usage

Configuration

Data Ingestion

Data Validation

Data Transformation

Model Training

Model Training Steps

API Endpoints

Example Usage

Note: Replace `path_to_your_csv_file` with the path to your CSV file.

Syncing Artifacts to S3

Sync Local Artifacts to S3

Sync Saved Models to S3

S3Sync Class

Docker

Build and Run Docker Container

GitHub Actions Workflow

Results

About

Releases

Packages

Languages

Zoro-chi/NetworkSecurity

Folders and files

Latest commit

History

Repository files navigation

Network Security Machine Learning Project for Phishing Detection

Table of Contents

Project Overview

Directory Structure

Prerequisites

Installation

Usage

Configuration

Data Ingestion

Data Validation

Data Transformation

Model Training

Model Training Steps

API Endpoints

Example Usage

Note: Replace path_to_your_csv_file with the path to your CSV file.

Syncing Artifacts to S3

Sync Local Artifacts to S3

Sync Saved Models to S3

S3Sync Class

Docker

Build and Run Docker Container

GitHub Actions Workflow

Results

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Note: Replace `path_to_your_csv_file` with the path to your CSV file.

Packages