Skip to content

Latest commit

 

History

History
30 lines (24 loc) · 1.07 KB

README.md

File metadata and controls

30 lines (24 loc) · 1.07 KB

Dataset: Java(hu)

The authors of A Transformer-based Approach for Source Code Summarizatio n shared their code and dataset. In this repo., it offers original and runnable codes of Java dataset and therefore we can generate AST with Tree-Sitter.

However, as for Python dataset, its original codes are not runnable. An optional way to deal with such problem is that we can acquire runnable Python codes from raw data.


Step 1

Download pre-processed and raw (java_hu) dataset.

bash dataset/java_hu/download.sh

Step 2

Move code/code_tokens/docstring/docstring_tokens to ~/java_hu/flatten/*.

python -m dataset.java_hu.flatten

Step 3

Generating raw/bin data with multi-processing. Before generating datasets, plz make sure config file is set correctly.

# code_tokens/docstring_tokens
python -m dataset.java_hu.preprocess