Skip to content

Commit e643e5f

Browse files
committed
Moved pipeline code into its own folder
1 parent 85c9773 commit e643e5f

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

61 files changed

+1285
-1261
lines changed

.gitignore

+4-4
Original file line numberDiff line numberDiff line change
@@ -1,12 +1,12 @@
11
# Models
2-
models/*
2+
**/models/*
33

44
# Datasets
5-
java_files/*
6-
text_arff/*
5+
**/java_files/*
6+
**/text_arff/*
77

88
# Weka files
9-
weka_files/*
9+
**/weka_files/*
1010

1111
# Keep readme files
1212
!**/README.md
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.

README.md

+25-4
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
# Obsucated code2vec: Improving Generalisation by Hiding Information
1+
# Obsucated code2vec: Reducing Model Bias by Hiding Information
22

33
![Overall project view](img/overall.png)
44

@@ -11,18 +11,37 @@ All of the model-related code (`common.py`, `model.py`, `PathContextReader.py`)
1111
All models/datasets are on the paper google drive folder
1212
https://drive.google.com/drive/u/1/folders/1CXgSXKf292BTlryASui2kBvYvJSvFnWN
1313

14+
## Requirements
15+
- Java 8+
16+
- Python 3
17+
18+
## Usage - Obfuscator
19+
These steps should all be run from within the `java-obfuscator/` directory.
20+
1. Locate a folder of `.java` files (e.g., from the [code2seq](https://github.com/tech-srl/code2seq) repository)
21+
2. Alter the input and output directories in `obfs-script.sh`, as well as the number of threads of your machine. If you're running this on a particularly large folder (e.g., millions of files) then you may need to increase the `NUM_PARTITIONS` to 3 or 4, otherwise memory issues can occur, grinding the obfuscator to a near halt.
22+
3. Run `obfs-script.sh` i.e. `$ source obfs-script.sh`
23+
24+
This will result in a new obfuscated folder of `.java` files, that can be used to train a new obfuscated code2vec model (or any model that performs learning from source code for that matter).
25+
1426
## Usage - Dataset Pipeline
1527

1628
![Dataset Pipeline View](img/pipeline.png)
1729

30+
These steps will convert a dataset of `.java` files into a numerical form (`.arff` by default), that can then be used with any standard WEKA classifier.
31+
32+
These steps should all be run from within the `pipeline/` directory of this repository.
1833
To run the dataset pipeline and create class-level embeddings for a dataset of Java files:
34+
1. `cd pipeline`
35+
2. `pip install -r requirements.txt`
1936
1. Download a `.java` dataset (from the datasets supplied or your own) and put in the `java_files/` directory
2037
2. Download a code2vec model checkpoint and put the checkpoint folder in the `models/` directory
21-
3. Change the paths and definitions in `model_defs.py` and number of models in `create_datasets.sh` to match your setup
22-
4. Run `create_datasets.sh`. This will loop through each model and create class-level embeddings for the supplied datasets. The resulting datasets will be in `.arff` format in the `weka_files/` folder
38+
3. Change the paths and definitions in `model_defs.py` and number of models in `scripts/create_datasets.sh` to match your setup
39+
4. Run `create_datasets.sh` (`source scripts/create_datasets.sh`). This will loop through each model and create class-level embeddings for the supplied datasets. The resulting datasets will be in `.arff` format in the `weka_files/` folder.
40+
41+
You can now perform class-level classification on the dataset using any off-the-shelf classifier.
2342

2443
### Config
25-
By default the pipeline will use the full range of values for each parameter, which creates a huge number of resulting `.arff` datasets (>1000). To reduce the number of these, remove (or comment out) some of the items in the arrays in `reduction_methods.py` and `selection_methods.py` (at the end of the file). Our experiments showed that the `SelectAll` selection method and `NoReduction` reduction method performed best in most cases so you may want to keep only these.
44+
By default the pipeline will use the full range of values for each parameter, which creates a huge number of resulting `.arff` datasets (>1000). To reduce the number of these, remove (or comment out) some of the items in the arrays in `reduction_methods.py` and `selection_methods.py` (at the end of the file). Our experiments showed that the `SelectAll` selection method and `NoReduction` reduction method performed best in most cases so you may want to just keep these.
2645

2746
## Datasets
2847

@@ -52,6 +71,8 @@ The `.java` files are all [available for download](https://drive.google.com/driv
5271

5372
13 categories, 1062 instances
5473

74+
This dataset was collected using the [github-scraper](https://github.com/basedrhys/github-scraper) python tool, which makes it easy to download specific types of files from github repos (`.java` files in this case).
75+
5576
[Google Drive Link](https://drive.google.com/open?id=1IC0Nxeew73p9yvfhKcKH-6mxW8nHGyfn)
5677

5778
[Embedding Visualisation](http://projector.tensorflow.org/?config=https://gist.githubusercontent.com/basedrhys/36fcd8653f2d759a8f1b03e56502a58e/raw/7d2ddef1c219d4fad7a49cc2c978d1ff4e25e5f1/author_config.json)

java-tool.jar

-15.5 MB
Binary file not shown.
File renamed without changes.
File renamed without changes.
File renamed without changes.
Original file line numberDiff line numberDiff line change
@@ -1,23 +1,23 @@
1-
<?xml version="1.0" encoding="UTF-8"?>
2-
<projectDescription>
3-
<name>JavaExtractor</name>
4-
<comment></comment>
5-
<projects>
6-
</projects>
7-
<buildSpec>
8-
<buildCommand>
9-
<name>org.eclipse.jdt.core.javabuilder</name>
10-
<arguments>
11-
</arguments>
12-
</buildCommand>
13-
<buildCommand>
14-
<name>org.eclipse.m2e.core.maven2Builder</name>
15-
<arguments>
16-
</arguments>
17-
</buildCommand>
18-
</buildSpec>
19-
<natures>
20-
<nature>org.eclipse.jdt.core.javanature</nature>
21-
<nature>org.eclipse.m2e.core.maven2Nature</nature>
22-
</natures>
23-
</projectDescription>
1+
<?xml version="1.0" encoding="UTF-8"?>
2+
<projectDescription>
3+
<name>JavaExtractor</name>
4+
<comment></comment>
5+
<projects>
6+
</projects>
7+
<buildSpec>
8+
<buildCommand>
9+
<name>org.eclipse.jdt.core.javabuilder</name>
10+
<arguments>
11+
</arguments>
12+
</buildCommand>
13+
<buildCommand>
14+
<name>org.eclipse.m2e.core.maven2Builder</name>
15+
<arguments>
16+
</arguments>
17+
</buildCommand>
18+
</buildSpec>
19+
<natures>
20+
<nature>org.eclipse.jdt.core.javanature</nature>
21+
<nature>org.eclipse.m2e.core.maven2Nature</nature>
22+
</natures>
23+
</projectDescription>
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,3 @@
1-
eclipse.preferences.version=1
2-
encoding//src/main/java=UTF-8
3-
encoding/<project>=UTF-8
1+
eclipse.preferences.version=1
2+
encoding//src/main/java=UTF-8
3+
encoding/<project>=UTF-8
Original file line numberDiff line numberDiff line change
@@ -1,16 +1,16 @@
1-
eclipse.preferences.version=1
2-
org.eclipse.jdt.core.compiler.codegen.inlineJsrBytecode=enabled
3-
org.eclipse.jdt.core.compiler.codegen.targetPlatform=1.8
4-
org.eclipse.jdt.core.compiler.codegen.unusedLocal=preserve
5-
org.eclipse.jdt.core.compiler.compliance=1.8
6-
org.eclipse.jdt.core.compiler.debug.lineNumber=generate
7-
org.eclipse.jdt.core.compiler.debug.localVariable=generate
8-
org.eclipse.jdt.core.compiler.debug.sourceFile=generate
9-
org.eclipse.jdt.core.compiler.problem.assertIdentifier=error
10-
org.eclipse.jdt.core.compiler.problem.enablePreviewFeatures=disabled
11-
org.eclipse.jdt.core.compiler.problem.enumIdentifier=error
12-
org.eclipse.jdt.core.compiler.problem.forbiddenReference=warning
13-
org.eclipse.jdt.core.compiler.problem.reportPreviewFeatures=ignore
14-
org.eclipse.jdt.core.compiler.processAnnotations=disabled
15-
org.eclipse.jdt.core.compiler.release=disabled
16-
org.eclipse.jdt.core.compiler.source=1.8
1+
eclipse.preferences.version=1
2+
org.eclipse.jdt.core.compiler.codegen.inlineJsrBytecode=enabled
3+
org.eclipse.jdt.core.compiler.codegen.targetPlatform=1.8
4+
org.eclipse.jdt.core.compiler.codegen.unusedLocal=preserve
5+
org.eclipse.jdt.core.compiler.compliance=1.8
6+
org.eclipse.jdt.core.compiler.debug.lineNumber=generate
7+
org.eclipse.jdt.core.compiler.debug.localVariable=generate
8+
org.eclipse.jdt.core.compiler.debug.sourceFile=generate
9+
org.eclipse.jdt.core.compiler.problem.assertIdentifier=error
10+
org.eclipse.jdt.core.compiler.problem.enablePreviewFeatures=disabled
11+
org.eclipse.jdt.core.compiler.problem.enumIdentifier=error
12+
org.eclipse.jdt.core.compiler.problem.forbiddenReference=warning
13+
org.eclipse.jdt.core.compiler.problem.reportPreviewFeatures=ignore
14+
org.eclipse.jdt.core.compiler.processAnnotations=disabled
15+
org.eclipse.jdt.core.compiler.release=disabled
16+
org.eclipse.jdt.core.compiler.source=1.8
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
activeProfiles=
2-
eclipse.preferences.version=1
3-
resolveWorkspaceProjects=true
4-
version=1
1+
activeProfiles=
2+
eclipse.preferences.version=1
3+
resolveWorkspaceProjects=true
4+
version=1
Original file line numberDiff line numberDiff line change
@@ -1,75 +1,75 @@
1-
<?xml version="1.0" encoding="UTF-8"?>
2-
<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/maven-v4_0_0.xsd">
3-
<modelVersion>4.0.0</modelVersion>
4-
<groupId>JavaExtractor</groupId>
5-
<artifactId>JavaExtractor</artifactId>
6-
<name>JPredict</name>
7-
<version>0.0.1-SNAPSHOT</version>
8-
<url>http://maven.apache.org</url>
9-
<build>
10-
<plugins>
11-
<plugin>
12-
<artifactId>maven-compiler-plugin</artifactId>
13-
<version>3.2</version>
14-
<configuration>
15-
<source>1.8</source>
16-
<target>1.8</target>
17-
<excludes>
18-
<exclude>Test.java</exclude>
19-
</excludes>
20-
</configuration>
21-
</plugin>
22-
<plugin>
23-
<artifactId>maven-shade-plugin</artifactId>
24-
<version>2.1</version>
25-
<executions>
26-
<execution>
27-
<phase>package</phase>
28-
<goals>
29-
<goal>shade</goal>
30-
</goals>
31-
<configuration>
32-
<transformers>
33-
<transformer
34-
implementation="org.apache.maven.plugins.shade.resource.ManifestResourceTransformer">
35-
</transformer>
36-
</transformers>
37-
</configuration>
38-
</execution>
39-
</executions>
40-
</plugin>
41-
</plugins>
42-
</build>
43-
<dependencies>
44-
<dependency>
45-
<groupId>com.github.javaparser</groupId>
46-
<artifactId>javaparser-core</artifactId>
47-
<version>3.0.0-alpha.4</version>
48-
</dependency>
49-
<dependency>
50-
<groupId>commons-io</groupId>
51-
<artifactId>commons-io</artifactId>
52-
<version>1.3.2</version>
53-
<scope>compile</scope>
54-
</dependency>
55-
<dependency>
56-
<groupId>com.fasterxml.jackson.core</groupId>
57-
<artifactId>jackson-databind</artifactId>
58-
<version>2.9.10.1</version>
59-
</dependency>
60-
<dependency>
61-
<groupId>args4j</groupId>
62-
<artifactId>args4j</artifactId>
63-
<version>2.33</version>
64-
</dependency>
65-
<dependency>
66-
<groupId>org.apache.commons</groupId>
67-
<artifactId>commons-lang3</artifactId>
68-
<version>3.5</version>
69-
</dependency>
70-
</dependencies>
71-
<properties>
72-
<project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
73-
</properties>
74-
</project>
75-
1+
<?xml version="1.0" encoding="UTF-8"?>
2+
<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/maven-v4_0_0.xsd">
3+
<modelVersion>4.0.0</modelVersion>
4+
<groupId>JavaExtractor</groupId>
5+
<artifactId>JavaExtractor</artifactId>
6+
<name>JPredict</name>
7+
<version>0.0.1-SNAPSHOT</version>
8+
<url>http://maven.apache.org</url>
9+
<build>
10+
<plugins>
11+
<plugin>
12+
<artifactId>maven-compiler-plugin</artifactId>
13+
<version>3.2</version>
14+
<configuration>
15+
<source>1.8</source>
16+
<target>1.8</target>
17+
<excludes>
18+
<exclude>Test.java</exclude>
19+
</excludes>
20+
</configuration>
21+
</plugin>
22+
<plugin>
23+
<artifactId>maven-shade-plugin</artifactId>
24+
<version>2.1</version>
25+
<executions>
26+
<execution>
27+
<phase>package</phase>
28+
<goals>
29+
<goal>shade</goal>
30+
</goals>
31+
<configuration>
32+
<transformers>
33+
<transformer
34+
implementation="org.apache.maven.plugins.shade.resource.ManifestResourceTransformer">
35+
</transformer>
36+
</transformers>
37+
</configuration>
38+
</execution>
39+
</executions>
40+
</plugin>
41+
</plugins>
42+
</build>
43+
<dependencies>
44+
<dependency>
45+
<groupId>com.github.javaparser</groupId>
46+
<artifactId>javaparser-core</artifactId>
47+
<version>3.0.0-alpha.4</version>
48+
</dependency>
49+
<dependency>
50+
<groupId>commons-io</groupId>
51+
<artifactId>commons-io</artifactId>
52+
<version>1.3.2</version>
53+
<scope>compile</scope>
54+
</dependency>
55+
<dependency>
56+
<groupId>com.fasterxml.jackson.core</groupId>
57+
<artifactId>jackson-databind</artifactId>
58+
<version>2.9.10.1</version>
59+
</dependency>
60+
<dependency>
61+
<groupId>args4j</groupId>
62+
<artifactId>args4j</artifactId>
63+
<version>2.33</version>
64+
</dependency>
65+
<dependency>
66+
<groupId>org.apache.commons</groupId>
67+
<artifactId>commons-lang3</artifactId>
68+
<version>3.5</version>
69+
</dependency>
70+
</dependencies>
71+
<properties>
72+
<project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
73+
</properties>
74+
</project>
75+

0 commit comments

Comments
 (0)