This repository contains the sample code that achieved 5th place in the KDD 2024 OAG-Challenge PST task. The technical report is available here: Leveraging Hybrid Embeddings and Data Augmentation for Identifying Significant References
If you are interested in this repository, please check out this report.
- Linux System
- Python 3.10
- CUDA 12.0
- NVIDIA A100 80G
Clone this repository.
git clone
cd kddcup_oag-challenge-pst_rank6
Please install dependencies by
pip install -r requirements.txt
The dataset can be downloaded from BaiduPan with password bft3, Aliyun or DropBox. The paper XML files are generated by Grobid APIs from paper pdfs. And please download the DBLP-Citation-network V16 from DBLP and version OAG 3.1 from OAG, extract them, and place them in the data folder.
--kddcup2024-oagpst-solution
--script
--...(some files)
--data
--PST
--...(some files)
--paper-xml(load competition dataset)
This notebook processes data for this competition and creates hand-crafted features. It corresponds to the section "Data Extraction from XML Files."
This notebook generates text embedding features using the text embedding model 'multilingual-E5-large.' It corresponds to the section "Generation of Textual Features."
This notebook creates graph features using the node embedding model 'node2vec.' It corresponds to the section "Generation of Network Features."
These notebooks handle model training, ensembling with data augmentation, and inference using the best ensemble model. They correspond to the sections "Model Training and Inference" and "Experiments."