- Linux
- Python 3.10
- PyTorch 2.2.0+cu121
Our final approach is to merge the results of the test set using the GCN model, the Xgboost machine learning model, and the llm model (ChatGLM) after fine tuning.
Method | WAUC |
---|---|
GCN | 0.7687(test) |
Xgboost(before using oagbert) | 0.7993(test) |
(ChatGLM(title5000_venue2500) + ChatGLM(title10000))/2 | 0.78790(test) |
(GCN* 0.1+ Xgboost* 0.9)* 0.6 + (ChatGLM(title5000_venue2500)+ChatGLM(title10000))/2*0.4 | 0.8131(test) |
Clone this repo.
git clone https://github.com/virtualanimal/kdd2024race1rank4.git
cd kdd2024race1rank4
For Xgboost
,
pip install -r Xgboost/requirements.txt
For GCN
,
pip install -r GCN/requirements.txt
For ChatGLM
,
pip install -r ChatGLM/requirements.txt
The dataset can be downloaded from BaiduPan with password gvza, Aliyun or DropBox. Unzip the dataset and put files into dataset/
directory.
-
DataProcess
Before training, we should normalize paper info and extract feature from paper info(title, abstract, auther name, auther org, keywords, veneu and year),those feature can be used in GCN method (That means you should first put dataset in the right place and use Xgboost's pre three commands). you can check the detail in Xgboost's README file and you can generate those feature (include scores and embeddings) by following Xgboost from step1 to step3。
This process may takes a lot of time, please be patinet. If you need, we will upload our processed file later.
you can download the processed file from dataset, and put all file in dataset acrroding to File Struct.(Due to google drive storage limitations, we are unable to upload gcn training data, if you only need to test, the data provided here is sufficient)
-
Pretrained Model prepare
We use three word embeddding tools(includeing bge-small-en-v1.5、[sci-bert](allenai/scibert_scivocab_uncased at main (huggingface.co)) and oag-bert, oag-bert can be donwload by tool cogdl(in Xgboost requirements.txt)), you should download those pretrained model and put them in
model/
directory(except oag-bert and remember change download model name to bge-small-en-v1.5 and scibert ).
Run Method for KDD Cup 2024
We provide three Method: GCN, Xgboost, and ChatGLM [Hugging Face].
For Xgboost
,
Do feature engineering and train xgboost model to predict results at 10 fold
- train and predict
cd Xgboost
# step1: Preprocessing Data
python norm_data.py
# step2: Embedding vector
python encode.py
# step3: Get features
python get_feature.py
# step4: predict
python predict.py
or bash run.sh
-
predict
Pretrained gcn model can be downloaded from xgb_model ,and put it in
model/xgb_model
.python infer.py
For GCN
,
Build graph relational data, train and predict results using gcn model.
-
train and predict
export CUDA_VISIBLE_DEVICES='?' # specify which GPU(s) to be used cd GCN # as same as Xgboost pre three command #python norm_data.py #python encode.py #python get_feature.py python build_graph.py bash train.sh #include train and predict
-
predict
Pretrained gcn model can be downloaded from (gcn_model)[https://drive.google.com/drive/folders/17Y7QhOdvkj76dCUxyOS0kyJpTg-0Vrg2?usp=drive_link] ,and put it in
model/gcn_model
.python predict.py
For ChatGLM
,
Two fine-tuned ChatGLM checkpoint via Lora can be downloaded from ChatGLM-lora
-
fineturn with title(len=10000)
export CUDA_VISIBLE_DEVICES='?' # specify which GPU(s) to be used cd ChatGLM bash train.sh accelerate launch --num_processes 8 inference.py --lora_path your_lora_path --model_path your_model_path --pub_path ../dataset/norm_pid_to_info_all.json --eval_path ../dataset/IND-test-public/ind_test_author_filter_public.json # multi-GPU python inference.py --lora_path your_lora_checkpoint --model_path path_to_chatglm --pub_path ../dataset/norm_pid_to_info_all.json --eval_path ../dataset/IND-test-public/ind_test_author_filter_public.json # single GPU
-
fineturn with title(len=5000)+venue(len=2500)
export CUDA_VISIBLE_DEVICES='?' # specify which GPU(s) to be used cd ChatGLM bash train2.sh accelerate launch --num_processes 8 inference2.py --lora_path your_lora_path --model_path your_model_path --pub_path ../dataset/norm_pid_to_info_all.json --eval_path ../dataset/IND-test-public/ind_test_author_filter_public.json # multi-GPU python inference2.py --lora_path your_lora_checkpoint --model_path path_to_chatglm --pub_path ../dataset/norm_pid_to_info_all.json --eval_path ../dataset/IND-test-public/ind_test_author_filter_public.json # single GPU
-
merge two method result
cd ChatGLM python merge.py --first_json your_first_file_path --second_json your_second_file_path ----merge_llm_name your_merge_name
All Model Merge
python merge.py --gcn_rsult your_gcn_file --ml_result your_ml_reslut --llm_result your_llm_merge_reslut
.
├── ChatGLM
│ ├── arguments.py
│ ├── configs
│ │ └── deepspeed.json
│ ├── finetune2.py
│ ├── finetune.py
│ ├── inference2.py
│ ├── inference.py
│ ├── merge.py
│ ├── metric.py
│ ├── output
│ ├── README.md
│ ├── requirements.txt
│ ├── train2.sh
│ ├── trainer.py
│ ├── train.sh
│ ├── utils2.py
│ └── utils.py
├── dataset
│ ├── embedding
│ ├── feature
│ ├── graph
│ └── result
├── GCN
│ ├── build_graph.py
│ ├── encode.py
│ ├── get_feature.py
│ ├── models.py
│ ├── norm_data.py
│ ├── README.md
│ ├── requirements.txt
│ ├── train.py
│ ├── predict.py
│ └── train.sh
├── merge.py
├── model
├── gcn_model
└── xgb_model
├── README.md
└── Xgboost
├── encode.py
├── get_feature.py
├── norm_data.py
├── predict.py
├── README.md
├── requirements.txt
├── run.sh
└── infer.py
and in dataset/
in dataset/graph/