WhoIsWho-IND-KDD-2024 rank4

Prerequisites

Linux
Python 3.10
PyTorch 2.2.0+cu121

Final Method

Our final approach is to merge the results of the test set using the GCN model, the Xgboost machine learning model, and the llm model (ChatGLM) after fine tuning.

Method	WAUC
GCN	0.7687(test)
Xgboost（before using oagbert）	0.7993(test)
(ChatGLM(title5000_venue2500) + ChatGLM(title10000))/2	0.78790(test)
（GCN* 0.1+ Xgboost* 0.9）* 0.6 + (ChatGLM(title5000_venue2500)+ChatGLM(title10000))/2*0.4	0.8131(test)

Getting Started

Installation

Clone this repo.

git clone https://github.com/virtualanimal/kdd2024race1rank4.git
cd kdd2024race1rank4

For Xgboost,

pip install -r Xgboost/requirements.txt

For GCN,

pip install -r GCN/requirements.txt

For ChatGLM,

pip install -r ChatGLM/requirements.txt

IND Dataset and DataProcess

The dataset can be downloaded from BaiduPan with password gvza, Aliyun or DropBox. Unzip the dataset and put files into dataset/ directory.

DataProcess

Before training, we should normalize paper info and extract feature from paper info(title, abstract, auther name, auther org, keywords, veneu and year),those feature can be used in GCN method (That means you should first put dataset in the right place and use Xgboost's pre three commands). you can check the detail in Xgboost's README file and you can generate those feature (include scores and embeddings) by following Xgboost from step1 to step3。

This process may takes a lot of time, please be patinet. If you need, we will upload our processed file later.

update!!!

you can download the processed file from dataset, and put all file in dataset acrroding to File Struct.（Due to google drive storage limitations, we are unable to upload gcn training data, if you only need to test, the data provided here is sufficient）
Pretrained Model prepare

We use three word embeddding tools(includeing bge-small-en-v1.5、[sci-bert](allenai/scibert_scivocab_uncased at main (huggingface.co)) and oag-bert, oag-bert can be donwload by tool cogdl(in Xgboost requirements.txt)), you should download those pretrained model and put them in model/ directory(except oag-bert and remember change download model name to bge-small-en-v1.5 and scibert ).

Run Method for KDD Cup 2024

We provide three Method: GCN, Xgboost, and ChatGLM [Hugging Face].

For Xgboost,

Do feature engineering and train xgboost model to predict results at 10 fold

train and predict

cd Xgboost
# step1: Preprocessing Data
python norm_data.py
# step2: Embedding vector
python encode.py
# step3: Get features
python get_feature.py
# step4: predict
python predict.py

or bash run.sh

predict

Pretrained gcn model can be downloaded from xgb_model ,and put it in model/xgb_model.
```
python infer.py
```

For GCN,

Build graph relational data, train and predict results using gcn model.

train and predict

export CUDA_VISIBLE_DEVICES='?'  # specify which GPU(s) to be used
cd GCN
# as same as Xgboost pre three command 
#python norm_data.py
#python encode.py
#python get_feature.py
python build_graph.py 
bash train.sh #include train and predict

predict

Pretrained gcn model can be downloaded from (gcn_model)[https://drive.google.com/drive/folders/17Y7QhOdvkj76dCUxyOS0kyJpTg-0Vrg2?usp=drive_link] ,and put it in model/gcn_model.
```
python predict.py
```

For ChatGLM,

Two fine-tuned ChatGLM checkpoint via Lora can be downloaded from ChatGLM-lora

fineturn with title(len=10000)

export CUDA_VISIBLE_DEVICES='?'  # specify which GPU(s) to be used
cd ChatGLM
bash train.sh
accelerate launch --num_processes 8 inference.py --lora_path your_lora_path --model_path your_model_path --pub_path  ../dataset/norm_pid_to_info_all.json --eval_path ../dataset/IND-test-public/ind_test_author_filter_public.json  # multi-GPU
python inference.py --lora_path your_lora_checkpoint --model_path path_to_chatglm --pub_path ../dataset/norm_pid_to_info_all.json  --eval_path ../dataset/IND-test-public/ind_test_author_filter_public.json # single GPU

fineturn with title(len=5000)+venue(len=2500)

export CUDA_VISIBLE_DEVICES='?'  # specify which GPU(s) to be used
cd ChatGLM
bash train2.sh
accelerate launch --num_processes 8 inference2.py --lora_path your_lora_path --model_path your_model_path --pub_path  ../dataset/norm_pid_to_info_all.json --eval_path ../dataset/IND-test-public/ind_test_author_filter_public.json  # multi-GPU
python inference2.py --lora_path your_lora_checkpoint --model_path path_to_chatglm --pub_path ../dataset/norm_pid_to_info_all.json  --eval_path ../dataset/IND-test-public/ind_test_author_filter_public.json # single GPU

merge two method result

cd ChatGLM
python merge.py --first_json your_first_file_path --second_json your_second_file_path ----merge_llm_name your_merge_name

Result

All Model Merge

python merge.py --gcn_rsult your_gcn_file --ml_result your_ml_reslut --llm_result your_llm_merge_reslut

File Struct

.
├── ChatGLM
│   ├── arguments.py
│   ├── configs
│   │   └── deepspeed.json
│   ├── finetune2.py
│   ├── finetune.py
│   ├── inference2.py
│   ├── inference.py
│   ├── merge.py
│   ├── metric.py
│   ├── output
│   ├── README.md
│   ├── requirements.txt
│   ├── train2.sh
│   ├── trainer.py
│   ├── train.sh
│   ├── utils2.py
│   └── utils.py
├── dataset
│   ├── embedding
│   ├── feature
│   ├── graph
│   └── result
├── GCN
│   ├── build_graph.py
│   ├── encode.py
│   ├── get_feature.py
│   ├── models.py
│   ├── norm_data.py
│   ├── README.md
│   ├── requirements.txt
│   ├── train.py
│   ├── predict.py
│   └── train.sh
├── merge.py
├── model
	├── gcn_model
	└── xgb_model
├── README.md
└── Xgboost
    ├── encode.py
    ├── get_feature.py
    ├── norm_data.py
    ├── predict.py
    ├── README.md
    ├── requirements.txt
    ├── run.sh
    └── infer.py

and in dataset/

in dataset/graph/

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

WhoIsWho-IND-KDD-2024 rank4

Prerequisites

Final Method

Getting Started

Installation

IND Dataset and DataProcess

update!!!

Run Method for KDD Cup 2024

Result

File Struct

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
ChatGLM		ChatGLM
GCN		GCN
Xgboost		Xgboost
dataset		dataset
img		img
model		model
README.md		README.md
merge.py		merge.py

virtualanimal/kdd2024race1rank4

Folders and files

Latest commit

History

Repository files navigation

WhoIsWho-IND-KDD-2024 rank4

Prerequisites

Final Method

Getting Started

Installation

IND Dataset and DataProcess

update!!!

Run Method for KDD Cup 2024

Result

File Struct

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages