Skip to content

A machine learning and deep learning project for detecting malware using Decision Trees, Random Forests, and KMeans clustering, with model evaluation and comparison metrics.

Notifications You must be signed in to change notification settings

LMeriem/Malware-Detection-Using-AI

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Malware Detection Using AI and Machine Learning

Problem Statement

Malware poses a significant and constant threat to cybersecurity, with programs designed to damage, disrupt, or gain unauthorized access to computer systems. The rapid evolution of malware creation techniques has rendered traditional detection approaches insufficient. Artificial Intelligence (AI) provides a promising solution by automating and improving malware detection through the use of machine learning and deep learning models.

This project explores the application of AI models to classify and detect malware, offering a modern approach to bolster cybersecurity defenses.

Objectives

  1. Understand the Dataset: Analyze a technical dataset containing features of executable files to identify patterns and relevant insights.
  2. Build and Evaluate AI Models: Design and evaluate multiple classification models, including:
    • Decision Trees
    • Random Forests
    • An unsupervised clustering model (KMeans)
  3. Performance Comparison: Compare the performance of models using evaluation metrics such as:
    • Accuracy
    • Confusion Matrix
    • Precision, Recall, and F1-Score
  4. Deep Learning Integration: Train and test a simple deep learning model to assess its effectiveness in malware detection.

Solution Approach

  1. Dataset Analysis:

    • Preprocessed the dataset by scaling the features to ensure uniformity and improve model performance.
    • Conducted feature selection to enhance clustering accuracy for unsupervised models.
  2. Model Development:

    • Implemented supervised learning models (Decision Tree and Random Forest) to classify malware vs. legitimate files.
    • Built an unsupervised clustering model (KMeans with 2 clusters) to group data points without prior labels.
    • Trained a deep learning model using TensorFlow/Keras with a fully connected neural network architecture.
  3. Model Evaluation:

    • Evaluated models on a test set using metrics like accuracy, confusion matrix, precision, recall, and F1-score.
    • Performed 5-fold cross-validation to ensure robustness and generalization.
  4. Performance Comparison:

    • Compared supervised models to determine the most effective approach for malware detection.
    • Assessed the clustering effectiveness of KMeans using Adjusted Rand Index (ARI).

Results

Supervised Models

  • Decision Tree:

    • Accuracy: 98.79%
    • Cross-Validation Average Accuracy: 98.65%
  • Random Forest:

    • Accuracy: 99.19%
    • Cross-Validation Average Accuracy: 99.04%

Unsupervised Model

  • KMeans Clustering:
    • Adjusted Rand Index (ARI): 0.4811

Deep Learning Model

  • Achieved high accuracy with potential for further optimization.

Conclusion

  1. Best Performing Model: Random Forest achieved the best overall performance in terms of accuracy and generalization.
  2. Consistency: Both Decision Tree and Random Forest showed consistent results between simple data splits and cross-validation, indicating robustness.
  3. KMeans Effectiveness: KMeans clustering improved with feature selection but was less effective in differentiating malware and legitimate files compared to supervised models.

How to Use

  1. Clone this repository:

    git clone https://github.com/yourusername/malware-detection-ai.git
    cd malware-detection-ai
  2. Ensure the dataset is available: The dataset (MalwareDataset.csv) is included in the repository. Ensure it is in the same directory as the script (or in a data folder if specified in the code).

  3. Install required Python packages:

    pip install -r requirements.txt
  4. Run the models:

    python main.py

Future Improvements

  • Optimize the deep learning model architecture for better performance.
  • Experiment with other clustering algorithms to improve unsupervised model results.
  • Incorporate additional features to enhance model accuracy and robustness.

About

A machine learning and deep learning project for detecting malware using Decision Trees, Random Forests, and KMeans clustering, with model evaluation and comparison metrics.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published