Skip to content

Steps to preprocess dataset

Hlib edited this page Feb 19, 2019 · 1 revision

The dataset was downloaded using this link.

PATH_TO_RAW_ALLAMANIS_DATASET="$HOME/raw_datasets/allamanis"

Number of projects in the dataset:

ls "$PATH_TO_RAW_ALLAMANIS_DATASET/all" | wc -l
14436

Number of files in the dataset:

find "$PATH_TO_RAW_ALLAMANIS_DATASET/all" -name "*.java" | wc -l
2130311
 du -sh raw_datasets/allamanis/all
17G	$PATH_TO_RAW_ALLAMANIS_DATASET/all

Copying dataset

cp -r "$PATH_TO_RAW_ALLAMANIS_DATASET/all" "$PATH_TO_RAW_ALLAMANIS_DATASET/nodup_all"

Removing duplicated files

Checking the number of files in the resulting set

find "$PATH_TO_RAW_ALLAMANIS_DATASET/nodup_all" -name "*.java" | wc -l
1675434 

Number of files removed 454877 (21.35%)

Split the files into the training/test/valid sets

./split ~/$PATH_TO_RAW_ALLAMANIS_DATASET/nodup_all/ 70 15 15

Checking number of projects in training/test/validation sets

$PATH_TO_RAW_ALLAMANIS_DATASET/nodup_all/train/ | wc -l
10106
$PATH_TO_RAW_ALLAMANIS_DATASET/nodup_all/test/ | wc -l
2165
$PATH_TO_RAW_ALLAMANIS_DATASET/nodup_all/valid/ | wc -l
2165

Checking all 3 sets contain different projects:

$PATH_TO_RAW_ALLAMANIS_DATASET/nodup_all/train/ > all_projects.tmp
$PATH_TO_RAW_ALLAMANIS_DATASET/nodup_all/test/ >> all_projects.tmp
$PATH_TO_RAW_ALLAMANIS_DATASET/nodup_all/valid/ >> all_projects.tmp

cat all_projects.tmp | sort | uniq | wc -l && rm all_projects.tmp 
14436 <-- everything is alright

Put projects into chunks for easier management of dataset fractions

cd $PATH_TO_RAW_ALLAMANIS_DATASET/nodup_all/train/
chunkify
cd $PATH_TO_RAW_ALLAMANIS_DATASET/nodup_all/test/
chunkify
cd $PATH_TO_RAW_ALLAMANIS_DATASET/nodup_all/valid/
chunkify

Remove files that are symlinks:

rm $PATH_TO_RAW_ALLAMANIS_DATASET/nodup_all/train/247_code_1/NXT/src/uk/ac/ed/inf/sdp2012/group7/control/ConstantsReuse.java
rm $PATH_TO_RAW_ALLAMANIS_DATASET/nodup_all/train/206_lightscript/.attic2/code/examples/readeval/Yolan.java
rm $PATH_TO_RAW_ALLAMANIS_DATASET/nodup_all/train/830_OpenAttestation/Source/HisPrivacyCAWebServices2/src/gov/niarl/his/privacyca/TpmSymmetricKeyParams.java
rm $PATH_TO_RAW_ALLAMANIS_DATASET/nodup_all/train/830_OpenAttestation/Source/HisPrivacyCAWebServices2/src/gov/niarl/his/privacyca/TpmSymCaAttestation.java
rm $PATH_TO_RAW_ALLAMANIS_DATASET/nodup_all/train/830_OpenAttestation/Source/HisPrivacyCAWebServices2/src/gov/niarl/his/privacyca/TpmRsaKeyParams.java
rm $PATH_TO_RAW_ALLAMANIS_DATASET/nodup_all/train/830_OpenAttestation/Source/HisPrivacyCAWebServices2/src/gov/niarl/his/privacyca/TpmAsymCaContents.java
rm $PATH_TO_RAW_ALLAMANIS_DATASET/nodup_all/train/830_OpenAttestation/Source/HisPrivacyCAWebServices2/src/gov/niarl/his/privacyca/TpmSymmetricKey.java
rm $PATH_TO_RAW_ALLAMANIS_DATASET/nodup_all/train/830_OpenAttestation/Source/HisPrivacyCAWebServices2/src/gov/niarl/his/privacyca/idResponse.java
rm $PATH_TO_RAW_ALLAMANIS_DATASET/nodup_all/train/830_OpenAttestation/Source/HisPrivacyCAWebServices2/src/gov/niarl/his/privacyca/TpmUtils.java
rm $PATH_TO_RAW_ALLAMANIS_DATASET/nodup_all/train/830_OpenAttestation/Source/HisPrivacyCAWebServices2/src/gov/niarl/his/privacyca/TpmPubKey.java
rm $PATH_TO_RAW_ALLAMANIS_DATASET/nodup_all/train/830_OpenAttestation/Source/HisPrivacyCAWebServices2/src/gov/niarl/his/privacyca/TpmKeyParams.java
rm $PATH_TO_RAW_ALLAMANIS_DATASET/nodup_all/train/830_OpenAttestation/Source/HisPrivacyCAWebServices2/src/gov/niarl/his/privacyca/TpmIdentityProof.java
rm $PATH_TO_RAW_ALLAMANIS_DATASET/nodup_all/train/830_OpenAttestation/Source/HisPrivacyCAWebServices2/src/gov/niarl/his/privacyca/HisSetup.java

Parse projects

python logrec/dataprep/parse_projects nodup_all

Find non-english files:

/home/lv71161/hlibbabii/prep_datasets/v2/noneng-projects-train.txt
/home/lv71161/hlibbabii/prep_datasets/v2/noneng-projects-test.txt
/home/lv71161/hlibbabii/prep_datasets/v2/noneng-projects-valid.txt

Number of files before removing non-English files: 1186856/238371/250194

Copy the dataset and remove non-English files

cp -r ~/raw_datasets/allamanis/nodup_all ~/raw_datasets/allamanis/nodup_en_only
cat /home/lv71161/hlibbabii/prep_datasets/v2/noneng-projects.txt | xargs -I{} rm "/home/lv71161/hlibbabii/raw_datasets/allamanis/nodup_en_only/{}"

Number of files before removing non-English files: 1175186/235356/248094

Check if the number of removed files matches the number of files listed in noneng-projects.txt

(fastai) [hlibbabii@l33 log-recommender-master]$ cat /home/lv71161/hlibbabii/prep_datasets/v2/noneng-projects-test.txt | grep "^test/"| wc -l
3015
(fastai) [hlibbabii@l33 log-recommender-master]$ cat /home/lv71161/hlibbabii/prep_datasets/v2/noneng-projects-test.txt | grep "^valid/"| wc -l
2100
(fastai) [hlibbabii@l33 log-recommender-master]$ cat /home/lv71161/hlibbabii/prep_datasets/v2/noneng-projects-test.txt | grep "^train/"| wc -l
11670

OK!

Clone this wiki locally