Update readme

tamnguyenvan · Mar 12, 2020 · f9d37f6 · f9d37f6
1 parent e6b6cab
commit f9d37f6
Show file tree

Hide file tree

Showing 3 changed files with 53 additions and 34 deletions.
diff --git a/README.md b/README.md
@@ -1,35 +1,28 @@
 # MalNet - Detect malware using Convolutional Neural Networks
 By Tam Nguyen Van
 # Introduction
-The repository contains all source code for training and evaluate  malware detection with **MalNet**.
+Malware detection using Convolutional Neural Networks.
 # Requirements
-1. Python 3.6
-2. Keras (2.0.8)
-3. Tensorflow (1.2.0)
+1. Python >=3
+2. Keras (>=2.0.8)
+3. Tensorflow (>=1.15)
 # Installation
 1. Clone the repository to your local.
 `git clone https://github.com/tamnguyenvan/malnet`
-2. Install all requirements (use **virtualenv** is recommneded). Note: It just works on Python 3.6 (maybe 3.5 but I haven't tested).
-- `pip install tensorflow==1.2.0` (CPU only) or `pip install tensorflow-gpu==1.2.0` (GPU)
+2. Install all requirements (**virtualenv** is recommneded).
+- `pip install tensorflow==1.15` (CPU only) or `pip install tensorflow-gpu==1.15` (GPU)
 - `pip install -r requirements.txt`
-3. Make data directory. For example, make a direcotory called **data** in the root project directory. The data can be found at [here](https://drive.google.com/drive/folders/1zUXAb7JnwOiBtfBheQI6LDFu4EG_XZ-_). After downloading, extract it and put all data files into the data directory.
-# Training
-If you have accomplished the installation step correctly, almost done. We just need run `python train.py` for training with default parameters.
-Some options:
-- `--model` Use specific model. For now, just `malnet`, `et` and `rt` are available.
-- `--batch-size` Set batch size to fit our memory. Default is 32.
-- `--epochs` Number of epochs will be trained. Default is 5.
+3. Download Ember dataset [here](https://pubdata.endgame.com/ember/ember_dataset_2018_2.tar.bz2). You can go to their [home page](https://github.com/endgameinc/ember) for more details. Extract to wherever you like.
+4. Extract features by running: `python create_data.py --data_dir PATH_TO_DATA_DIR`. See `create_data.py` for the details. After that, some `.dat` file should be created in the same directory.
+# Training model
+Almost done, just run `python train.py --data_dir PATH_TO_DATA_DIR` for training. Show help to see additional options.
 
-Please see source code for more details.
-# Evaluate
-The training script also had evaluation step. But, we still provide other script for evaluating independently. After training, the model will be saved in **result/checkpoint**. We can evaluate this or use my pretrained model that can be found at [here](https://drive.google.com/file/d/1zD99s0L9l1eVPmSo9o6c3WgkZrpa2e2o). The directory must contain 3 files:
-- `model.h5` Model weights.
-- `model.json` Model graph.
-- `scaler.pkl` Pickle binary file contains an object for preprocessing scaler.
+# Evaluate model
+In case you want to regenerate validation result, run `python eval.py --data_dir PATH_TO_DATA_DIR--model_path MODEL_PATH --scaler_path SCALER_PATH`. Again, show help to see options.
+
+# Deploy
+Let's have some fun. We will try the pretrained model on real PE files. Download your PE file then run `python test.py --input_file INPUT_FILE --model_path MODEL_PATH`.
 
-In order to evaluating, just run `python src/eval.py`.
-# Deployment
-In this section, we will try to use the model to predict samples from the real. We provided a script for this in **src/test.py**. So all we need to do is just run `python src/test.py --input [path/to/sample/file]`.
 # Contact
 Tam Nguyen Van ([email protected])
 Any questions can be left as issues in this repository. You're are welcome all.
diff --git a/malnet/create_data.py b/malnet/create_data.py
@@ -1,8 +1,8 @@
-"""
-- Author: tamnv
-- Description: This script will extract raw data from EMBER
-json files, then write into 4 files: X_train.dat, X_test.dat,
-y_train.dat and y_test.dat
+"""This script help to extract raw data from EMBER json files.
+It will store features into 4 files: X_train.dat, X_test.dat, y_train.dat
+and y_test.dat. You can limit number of sample using option `scale`.
+
+Usage: python create_data.py --data_dir DATA_DIR --scale SCALE
 """
 
 import argparse
@@ -14,7 +14,7 @@
 def parse_arguments(argv):
     """Parse command line arguments."""
     parser = argparse.ArgumentParser()
-    parser.add_argument('--data-dir', dest='data_dir', type=str, default='data',
+    parser.add_argument('--data_dir', dest='data_dir', type=str, default='data',
                         help='Path to data directory.')
     parser.add_argument('--scale', dest='scale', type=float, default=1.,
                         help='Scale of training/test dataset.')

diff --git a/requirements.txt b/requirements.txt
@@ -1,8 +1,34 @@
-keras==2.0.8
+absl-py==0.9.0
+astor==0.8.1
+cycler==0.10.0
+gast==0.2.2
+google-pasta==0.1.8
+grpcio==1.27.2
+h5py==2.10.0
+joblib==0.14.1
+Keras==2.0.8
+Keras-Applications==1.0.8
+Keras-Preprocessing==1.1.0
+kiwisolver==1.1.0
+lief==0.10.1
+Markdown==3.2.1
 matplotlib==2.2.2
-scipy
-scikit-learn
-pandas
+numpy==1.18.1
+opt-einsum==3.2.0
+pandas==1.0.1
+pkg-resources==0.0.0
+protobuf==3.11.3
+pyparsing==2.4.6
+python-dateutil==2.8.1
+pytz==2019.3
+PyYAML==5.3
+scikit-learn==0.22.2.post1
+scipy==1.4.1
+six==1.14.0
+tensorboard==1.15.0
+tensorflow-estimator==1.15.1
+tensorflow-gpu==1.15.0
+termcolor==1.1.0
 tqdm==4.23.4
-h5py
-lief
+Werkzeug==1.0.0
+wrapt==1.12.1