HAConvGNN: Hierarchical Attention Based Convolutional Graph Neural Network for Code Documentation Generation in Jupyter Notebooks
This repository contains the public release code for our new accepted paper: HAConvGNN: Hierarchical Attention Based Convolutional Graph Neural Network for Code Documentation Generation in Jupyter Notebooks
The related package requirement is in the requirement.txt file.
com.test file is my ground true output file.
You'd better use Cuda to run this project.
Because the dataset file and trained model file is very large, you can download it in the following website.
The reproducibility package has four parts:
- the code found in this repository
- the fully processed data (as a pkl file) can be downloaded HERE
- the model file can be downloaded HERE(If you want to load this model with CPU, you should add
map_location=torch.device('cpu')
when loading the model file) - The raw dataset of this project can be found HERE. comment.json is the target documentation file. relatedcode is a dictionary storing corresponding code list which will be used in the HAConvGNN model and the function.json is the code sequence which is joined by '\n' from the code list. Kaggle3741 has all the notebooks we crawled from the Kaggle.
This code uses Pytorch v1.5.0
Inspired by Wang et al. 2021, we decided to utilize the top-voted and well-documented Kaggle notebooks to construct the notebookCDGdataset
We collected the top 10% highly-voted notebooks from the top 20 popular competitions on Kaggle (e.g. Titanic). We checked the data policy of each of the 20 competitions, none of them has copyright issues. We also contacted the Kaggle administrators to make sure our data collection complies with the platform’s policy.
In total, we collected 3,944 notebooks as raw data. After data preprocessing, the final dataset contains 2,476 notebooks out of the 3,944 notebooks from the raw data. It has 28,625 code–documentation pairs. The overall code-to-markdown ratio is 2.2195
In my code, I have HAConvGNN model.
To run prediction with the trained models from the paper, download the three parts of the reproducibility package and run predict.py. Predict.py takes the path to the model file as a positional argument and will output the prediction file to ./modelout/predictions.
python3 predict.py {path to model} --gpu 0 --data {path to data download}
For example:
python3 predict.py ./modelout/HAConvGNN.h5 --gpu 0 --data ./final_data
As a reference, running the prediction on one Tesla T4 GPU takes about 1 hour.
To train a new model run train.py with the modeltype and gpu options set.
python3 train.py --gpu 0 --data ./final_data
Models | ROUGE-1 | ROUGE-2 | ROUGE-L | ||||||
---|---|---|---|---|---|---|---|---|---|
P | R | F1 | P | R | F1 | P | R | F1 | |
Baselines | |||||||||
code2seq | 11.45 | 8.46 | 8.23 | 1.67 | 1.11 | 1.11 | 13.13 | 10.28 | 10.24 |
graph2seq | 13.21 | 9.87 | 9.51 | 2.86 | 1.99 | 2.03 | 14.46 | 11.4 | 11.18 |
Our Model & Ablation Study | |||||||||
HAConvGNN(Our Model) | 22.87 | 16.92 | 16.58 | 6.72 | 4.86 | 4.97 | 24.03 | 18.6 | 18.54 |
HAConvGNN withlow-level attention without high-level attention with uniform attention |
20.66 | 15.65 | 14.91 | 4.74 | 3.92 | 3.8 | 21.84 | 17.27 | 16.81 |
HAConvGNN with low-level attention without high-level attention without uniform attention |
19.57 | 14.59 | 14.23 | 4.87 | 3.56 | 3.63 | 20.83 | 16.24 | 16.12 |
HAConvGNN without low-level attention without high-level attention with uniform attention |
11.39 | 7.73 | 7.82 | 1.58 | 1.06 | 1.08 | 13.13 | 9.47 | 9.82 |
Our work is published at EMNLP'21 Finding. You can cite:
@misc{liu2021haconvgnn,
title={HAConvGNN: Hierarchical Attention Based Convolutional Graph Neural Network for Code Documentation Generation in Jupyter Notebooks},
author={Xuye Liu and Dakuo Wang and April Wang and Yufang Hou and Lingfei Wu},
year={2021},
eprint={2104.01002},
archivePrefix={arXiv},
primaryClass={cs.SE}
}