This project focuses on network security by leveraging machine learning and data science techniques to detect phishing attacks. The project involves data ingestion, validation, transformation, and model training to build a robust phishing detection system.
DATASET: The dataset used in this project is the Phishing Websites Dataset from Keggle. The dataset contains 30 features and 11055 samples.
- Project Overview
- Directory Structure
- Prerequisites
- Installation
- Usage
- Configuration
- Data Ingestion
- Data Validation
- Data Transformation
- Model Training
- API Endpoints
- Syncing Artifacts to S3
- Docker
- GitHub Actions Workflow
- Results
This project focuses on network security by leveraging machine learning and data science techniques to detect phishing attacks. The project involves data ingestion, validation, transformation, and model training to build a robust phishing detection system.
NetworkSecurity/
├── artifacts/
├── data_schema/
├── network_security/
│ ├── components/
│ ├── entity/
│ ├── exception/
│ ├── logging/
│ ├── utils/
├── saved_models/
├── templates/
├── venv/
├── .gitignore
├── app.py
├── main.py
├── requirements.txt
├── Dockerfile
├── .github/
│ └── workflows/
│ └── network-security-workflow.yml
└── README.md
- Python 3.7 or higher
- MongoDB
- Git
-
Clone the repository:
git clone https://github.com/Zoro-chi/NetworkSecurity.git cd NetworkSecurity
-
Create a virtual environment:
python -m venv venv source venv/bin/activate # On Windows use `venv\Scripts\activate`
-
Install the dependencies:
pip install -r requirements.txt
-
Ensure you have the necessary environment variables set up. You can use a .env file for this purpose.
-
Run the main script:
python main.py
-
Run FastAPI application:
python app.py
Configuration settings are defined in the network_security/constants/training_pipeline.py
file. You can adjust the settings for data ingestion, validation, transformation, and model training as needed.
The data ingestion component reads data from a MongoDB collection and exports it as a DataFrame. The data is then split into training and testing sets.
The data validation component checks the number of columns and detects data drift between the training and testing datasets. It generates a drift report saved in the artifacts
directory.
The data transformation component handles missing values using KNN imputation and transforms the data into a suitable format for model training.
The model training component trains multiple machine learning models using the transformed data, evaluates them, and selects the best model based on performance metrics. The trained model is saved in the artifacts
directory, and the training process is tracked using MLflow.
-
Model Training: Multiple machine learning models are trained using the transformed data. The models include:
- Random Forest
- Decision Tree
- Gradient Boosting
- Logistic Regression
- AdaBoost
-
Model Evaluation: The trained models are evaluated using metrics such as F1 score, recall, and precision. The best model is selected based on these metrics.
-
Model Tracking with MLflow: The training process and metrics are tracked using MLflow and dagshub. This includes logging the metrics and saving the trained model.
-
Model Saving: The best model is saved to the specified file path for future use.
The project includes a FastAPI application with the following endpoints:
-
GET /: Redirects to the API documentation.
-
GET /train: Runs the training pipeline.
-
POST /predict: Accepts a CSV file and returns predictions.
-
Train the model:
curl -X GET "http://localhost:8000/train"
-
Make predictions using the model:
curl -X POST "http://localhost:8000/predict" -F "file=@path_to_your_csv_file"
-
The project includes functionality to sync local artifacts and saved models to an S3 bucket.
def sync_artifacts_dir_to_s3(self):
try:
aws_bucket_url = f"s3://{training_pipeline.S3_TRAINING_BUCKET_NAME}/artifacts/{self.training_pipeline_config.timestamp}"
S3Sync.sync_folder_to_s3(
self=S3Sync,
folder=self.training_pipeline_config.artifact_dir,
aws_bucket_url=aws_bucket_url,
)
except Exception as e:
raise NetworkSecurityException(e, sys)
def sync_saved_model_dir_to_s3(self):
try:
aws_bucket_url = f"s3://{training_pipeline.S3_TRAINING_BUCKET_NAME}/final_model/{self.training_pipeline_config.timestamp}"
S3Sync.sync_folder_to_s3(
self=S3Sync,
folder=self.training_pipeline_config.model_dir,
aws_bucket_url=aws_bucket_url,
)
except Exception as e:
raise NetworkSecurityException(e, sys)
class S3Sync:
def sync_folder_to_s3(self, folder: str, aws_bucket_url: str):
command = f"aws s3 sync {folder} {aws_bucket_url} "
os.system(command)
def sync_folder_from_s3(self, aws_bucket_url: str, folder: str):
command = f"aws s3 sync {aws_bucket_url} {folder} "
os.system(command)
The project uses Docker to containerize the application. The Docker image is built and pushed to Amazon ECR (Elastic Container Registry).
- Build the Docker image:
docker build -t network-security:latest .
- Run the Docker container:
docker run -p 8000:8000 network-security:latest
The project includes a GitHub Actions workflow for continuous integration, continuous delivery and continuous Deployment. The workflow builds the Docker image and pushes it to Amazon ECR and also syncs the artifacts to S3. Adter that, it deploys the FastAPI application to an EC2 instance. The workflow is defined in the .github/workflows/network-security-workflow.yml
file. The workflow is triggered on push events to the main branch. You can customize the workflow to suit your requirements.
The results of the data ingestion, validation, transformation, and model training processes are saved in the artifacts directory. You can review the generated reports and trained models.