-
Notifications
You must be signed in to change notification settings - Fork 0
Steps to preprocess dataset
The dataset was downloaded using this link.
PATH_TO_RAW_ALLAMANIS_DATASET="$HOME/raw_datasets/allamanis"
Number of projects in the dataset:
ls "$PATH_TO_RAW_ALLAMANIS_DATASET/all" | wc -l
14436
Number of files in the dataset:
find "$PATH_TO_RAW_ALLAMANIS_DATASET/all" -name "*.java" | wc -l
2130311
du -sh raw_datasets/allamanis/all
17G $PATH_TO_RAW_ALLAMANIS_DATASET/all
Copying dataset
cp -r "$PATH_TO_RAW_ALLAMANIS_DATASET/all" "$PATH_TO_RAW_ALLAMANIS_DATASET/nodup_all"
Removing duplicated files
Checking the number of files in the resulting set
find "$PATH_TO_RAW_ALLAMANIS_DATASET/nodup_all" -name "*.java" | wc -l
1675434
Number of files removed 454877 (21.35%)
Split the files into the training/test/valid sets
./split ~/$PATH_TO_RAW_ALLAMANIS_DATASET/nodup_all/ 70 15 15
Checking number of projects in training/test/validation sets
$PATH_TO_RAW_ALLAMANIS_DATASET/nodup_all/train/ | wc -l
10106
$PATH_TO_RAW_ALLAMANIS_DATASET/nodup_all/test/ | wc -l
2165
$PATH_TO_RAW_ALLAMANIS_DATASET/nodup_all/valid/ | wc -l
2165
Checking all 3 sets contain different projects:
$PATH_TO_RAW_ALLAMANIS_DATASET/nodup_all/train/ > all_projects.tmp
$PATH_TO_RAW_ALLAMANIS_DATASET/nodup_all/test/ >> all_projects.tmp
$PATH_TO_RAW_ALLAMANIS_DATASET/nodup_all/valid/ >> all_projects.tmp
cat all_projects.tmp | sort | uniq | wc -l && rm all_projects.tmp
14436 <-- everything is alright
Put projects into chunks for easier management of dataset fractions
cd $PATH_TO_RAW_ALLAMANIS_DATASET/nodup_all/train/
chunkify
cd $PATH_TO_RAW_ALLAMANIS_DATASET/nodup_all/test/
chunkify
cd $PATH_TO_RAW_ALLAMANIS_DATASET/nodup_all/valid/
chunkify
Remove files that are symlinks:
rm $PATH_TO_RAW_ALLAMANIS_DATASET/nodup_all/train/247_code_1/NXT/src/uk/ac/ed/inf/sdp2012/group7/control/ConstantsReuse.java
rm $PATH_TO_RAW_ALLAMANIS_DATASET/nodup_all/train/206_lightscript/.attic2/code/examples/readeval/Yolan.java
rm $PATH_TO_RAW_ALLAMANIS_DATASET/nodup_all/train/830_OpenAttestation/Source/HisPrivacyCAWebServices2/src/gov/niarl/his/privacyca/TpmSymmetricKeyParams.java
rm $PATH_TO_RAW_ALLAMANIS_DATASET/nodup_all/train/830_OpenAttestation/Source/HisPrivacyCAWebServices2/src/gov/niarl/his/privacyca/TpmSymCaAttestation.java
rm $PATH_TO_RAW_ALLAMANIS_DATASET/nodup_all/train/830_OpenAttestation/Source/HisPrivacyCAWebServices2/src/gov/niarl/his/privacyca/TpmRsaKeyParams.java
rm $PATH_TO_RAW_ALLAMANIS_DATASET/nodup_all/train/830_OpenAttestation/Source/HisPrivacyCAWebServices2/src/gov/niarl/his/privacyca/TpmAsymCaContents.java
rm $PATH_TO_RAW_ALLAMANIS_DATASET/nodup_all/train/830_OpenAttestation/Source/HisPrivacyCAWebServices2/src/gov/niarl/his/privacyca/TpmSymmetricKey.java
rm $PATH_TO_RAW_ALLAMANIS_DATASET/nodup_all/train/830_OpenAttestation/Source/HisPrivacyCAWebServices2/src/gov/niarl/his/privacyca/idResponse.java
rm $PATH_TO_RAW_ALLAMANIS_DATASET/nodup_all/train/830_OpenAttestation/Source/HisPrivacyCAWebServices2/src/gov/niarl/his/privacyca/TpmUtils.java
rm $PATH_TO_RAW_ALLAMANIS_DATASET/nodup_all/train/830_OpenAttestation/Source/HisPrivacyCAWebServices2/src/gov/niarl/his/privacyca/TpmPubKey.java
rm $PATH_TO_RAW_ALLAMANIS_DATASET/nodup_all/train/830_OpenAttestation/Source/HisPrivacyCAWebServices2/src/gov/niarl/his/privacyca/TpmKeyParams.java
rm $PATH_TO_RAW_ALLAMANIS_DATASET/nodup_all/train/830_OpenAttestation/Source/HisPrivacyCAWebServices2/src/gov/niarl/his/privacyca/TpmIdentityProof.java
rm $PATH_TO_RAW_ALLAMANIS_DATASET/nodup_all/train/830_OpenAttestation/Source/HisPrivacyCAWebServices2/src/gov/niarl/his/privacyca/HisSetup.java
Parse projects
python logrec/dataprep/parse_projects nodup_all
Find non-english files:
/home/lv71161/hlibbabii/prep_datasets/v2/noneng-projects-train.txt
/home/lv71161/hlibbabii/prep_datasets/v2/noneng-projects-test.txt
/home/lv71161/hlibbabii/prep_datasets/v2/noneng-projects-valid.txt
Number of files before removing non-English files: 1186856/238371/250194
Copy the dataset and remove non-English files
cp -r ~/raw_datasets/allamanis/nodup_all ~/raw_datasets/allamanis/nodup_en_only
cat /home/lv71161/hlibbabii/prep_datasets/v2/noneng-projects.txt | xargs -I{} rm "/home/lv71161/hlibbabii/raw_datasets/allamanis/nodup_en_only/{}"
Number of files before removing non-English files: 1175186/235356/248094
Check if the number of removed files matches the number of files listed in noneng-projects.txt
(fastai) [hlibbabii@l33 log-recommender-master]$ cat /home/lv71161/hlibbabii/prep_datasets/v2/noneng-projects-test.txt | grep "^test/"| wc -l
3015
(fastai) [hlibbabii@l33 log-recommender-master]$ cat /home/lv71161/hlibbabii/prep_datasets/v2/noneng-projects-test.txt | grep "^valid/"| wc -l
2100
(fastai) [hlibbabii@l33 log-recommender-master]$ cat /home/lv71161/hlibbabii/prep_datasets/v2/noneng-projects-test.txt | grep "^train/"| wc -l
11670
OK!