Skip to content

Commit

Permalink
Merge pull request #29 from nv-morpheus/branch-23.01
Browse files Browse the repository at this point in the history
[RELEASE] morpheus-experimental v23.01
  • Loading branch information
dagardner-nv authored Jan 27, 2023
2 parents 9e591b3 + 94c9e51 commit 06e6e79
Show file tree
Hide file tree
Showing 21 changed files with 37,273 additions and 0 deletions.
6 changes: 6 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
# morpheus-experimental 23.01.00 (27 Jan 2023)

## 🚀 New Features

- Created Docker build process based on the process ([#28](https://github.com/nv-morpheus/morpheus-experimental/pull/28)) [@shawn-davis](https://github.com/shawn-davis)
- merge asset_clust ([#26](https://github.com/nv-morpheus/morpheus-experimental/pull/26)) [@avinashvem](https://github.com/avinashvem)
25 changes: 25 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -22,6 +22,28 @@ git clone https://github.com/nv-morpheus/morpheus-experimental $MORPHEUS_EXPERIM
cd $MORPHEUS_EXPERIMENTAL_ROOT
```

### Build Morpheus Experimental Container

To assist in building a Morpheus Experimental container, several scripts have been provided in the `./docker` directory. To build the "release" container, run the following:

```bash
./docker/build_container.sh
```

This will create an image named `nvcr.io/nvidia/morpheus/mor_exp:${MORPHEUS_EXPERIMENTAL_VERSION}-runtime` where `$MORPHEUS_EXPERIMENTAL_VERSION` is replaced by the output of `git describe --tags --abbrev=0`.

To run the built "release" container, use the following:

```bash
./docker/run_container.sh
```

You can specify different Docker images and tags by passing the script the `DOCKER_IMAGE_TAG`, and `DOCKER_IMAGE_TAG` variables respectively. For example, to run version `v22.09.00a` use the following:

```bash
DOCKER_IMAGE_TAG="v22.09.00a-runtime" ./docker/run_container.sh
```

### Prototype Specific Requirements
To get started with a specific prototype additional requirements must be installed into your environment. Each prototype directory contains its own `requirements.txt` file.

Expand All @@ -47,6 +69,9 @@ This technique syntactically groups system log messages and finds group represen
## [Detection of Anomalous authentication using Relational Graph Neural Network (RGCN)](/anomalous-auth-detection)
This model shows an application of a graph neural network for anomalous authentication detection in Azure-AD signon heterogeneous graph. An Azure-AD signon dataset includes four types of nodes, authentication, user, device and service application nodes are used for modeling. A relational graph neural network (RGCN)is used to identify anomalous authentications from azure-ad signon input.

## [Asset Clustering using Windows Event Logs](/asset-clustering)
This model is a clustering algorithm to assign each host present in the dataset to a cluster based on aggregated and derived features from Windows Event Logs of that particular host.

# Repo Structure
Each prototype has its own directory that contains everything belonging to the specific prototype. Directories can include the following subfolders and documentation:

Expand Down
70 changes: 70 additions & 0 deletions asset-clustering/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,70 @@
## Asset Clustering using Windows Event Logs

## Use Case
Cluster assets into various groups based on Windows Event Logs data.

### Version
1.0

### Model Overview
The model is a clustering algorithm to assign each host present in the dataset to a cluster based on aggregated and derived features from Windows Event Logs of that particular host.

### Model Architecture
There are two clustering algorithms available:
- DBSCAN which stands for Density-Based Spatial Clustering of Applications with Noise.
- KMeans
Input features to the model are derived from the windows event logs wherein various facets of login events, type of logon event, number of usernames associated with a host etc.., are aggregated.

### Requirements
An environment based on __[Rapids](https://rapids.ai/pip.html)__ is required to run the scripts and python notebook provided. Also on top of that the additional requirements can be installed into the environment via the supplementary requirements file provided.

```bash
pip install -r requirements.txt
```

### Training

#### Training data
In this project we use the publicly available __[**Unified Host and Network Data Set**](https://csr.lanl.gov/data/2017/)__[1] dataset from the Advanced Research team in Cyber Systems of the Los Alamos National Laboratory (lanl) to demonstrate various aspects involved in clustering assets in a given network.
The lanl dataset consists of netflow and windows event log (wls) files for 90 days. For this project we focus solely on the windows event log files which use the naming convention wls_day-01.bz2, wls_day-02.bz2,..., wls_day-90.bz2. The training data uses first ten days of data i.e. wls_day-01.bz2,..., wls_day-10.bz2. Note that for purposes of scale and quick reproducibility, we use only first ten days of data to experiment. One can easily use more data by changing the input file suffix. Refer to experiment.ipynb for more details. These ten days' data is pre-processed and the features are aggregated. The resulting dataset contains 14044 hosts and is saved in datasets/host_agg_data_day-01_day-10.csv.


#### Training parameters
The following parameters are chosen in training for the DBSCAN algorithm:
- $\epsilon=0.0005$
- *Manhattan distance* as the metric i.e. Minkowski distance with $p=1$.


#### Model accuracy
clusters found = 9 (+1 cluster for for the noisy samples)
Silhouette score = 0.975

#### Training script

To train the model run the following script under working directory.
```bash
cd ${MORPHEUS_EXPERIMENTAL_ROOT}/asset-clustering/training-tuning-inference
# Run training script and save models
python train.py --model dbscan
```
This saves trained model files under `../models` directory. Then the inference script can load the models for future inferences.

### Inference Input

```
python inference.py --model dbscan
```
When the above command is executed, dbscan clustering is performed on the windows event logs data from days 11 to 15. This data is pre-processed and aggregated to a validation dataset which can be found at datasets/host_agg_data_day-11_day-15.csv. This contains a total of 12606 hosts. One can similarly run inference using KMeans clustering model.
```
python inference.py --model kmeans
```


### Inference Output
The clustering of the 12606 hosts is performed and the count of each cluster is printed to stdout.

### Ethical considerations
N/A

### References
[1]. M. Turcotte, A. Kent and C. Hash, “Unified Host and Network Data Set”, in Data Science for Cyber-Security. November 2018, 1-22
Loading

0 comments on commit 06e6e79

Please sign in to comment.