Malware poses a significant and constant threat to cybersecurity, with programs designed to damage, disrupt, or gain unauthorized access to computer systems. The rapid evolution of malware creation techniques has rendered traditional detection approaches insufficient. Artificial Intelligence (AI) provides a promising solution by automating and improving malware detection through the use of machine learning and deep learning models.
This project explores the application of AI models to classify and detect malware, offering a modern approach to bolster cybersecurity defenses.
- Understand the Dataset: Analyze a technical dataset containing features of executable files to identify patterns and relevant insights.
- Build and Evaluate AI Models: Design and evaluate multiple classification models, including:
- Decision Trees
- Random Forests
- An unsupervised clustering model (KMeans)
- Performance Comparison: Compare the performance of models using evaluation metrics such as:
- Accuracy
- Confusion Matrix
- Precision, Recall, and F1-Score
- Deep Learning Integration: Train and test a simple deep learning model to assess its effectiveness in malware detection.
-
Dataset Analysis:
- Preprocessed the dataset by scaling the features to ensure uniformity and improve model performance.
- Conducted feature selection to enhance clustering accuracy for unsupervised models.
-
Model Development:
- Implemented supervised learning models (Decision Tree and Random Forest) to classify malware vs. legitimate files.
- Built an unsupervised clustering model (KMeans with 2 clusters) to group data points without prior labels.
- Trained a deep learning model using TensorFlow/Keras with a fully connected neural network architecture.
-
Model Evaluation:
- Evaluated models on a test set using metrics like accuracy, confusion matrix, precision, recall, and F1-score.
- Performed 5-fold cross-validation to ensure robustness and generalization.
-
Performance Comparison:
- Compared supervised models to determine the most effective approach for malware detection.
- Assessed the clustering effectiveness of KMeans using Adjusted Rand Index (ARI).
-
Decision Tree:
- Accuracy: 98.79%
- Cross-Validation Average Accuracy: 98.65%
-
Random Forest:
- Accuracy: 99.19%
- Cross-Validation Average Accuracy: 99.04%
- KMeans Clustering:
- Adjusted Rand Index (ARI): 0.4811
- Achieved high accuracy with potential for further optimization.
- Best Performing Model: Random Forest achieved the best overall performance in terms of accuracy and generalization.
- Consistency: Both Decision Tree and Random Forest showed consistent results between simple data splits and cross-validation, indicating robustness.
- KMeans Effectiveness: KMeans clustering improved with feature selection but was less effective in differentiating malware and legitimate files compared to supervised models.
-
Clone this repository:
git clone https://github.com/yourusername/malware-detection-ai.git cd malware-detection-ai
-
Ensure the dataset is available: The dataset (MalwareDataset.csv) is included in the repository. Ensure it is in the same directory as the script (or in a data folder if specified in the code).
-
Install required Python packages:
pip install -r requirements.txt
-
Run the models:
python main.py
- Optimize the deep learning model architecture for better performance.
- Experiment with other clustering algorithms to improve unsupervised model results.
- Incorporate additional features to enhance model accuracy and robustness.