Skip to content

ToyotaInfoTech/kddcup2024-oagpst-solution

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Introduction

This repository contains the sample code that achieved 5th place in the KDD 2024 OAG-Challenge PST task. The technical report is available here: Leveraging Hybrid Embeddings and Data Augmentation for Identifying Significant References

If you are interested in this repository, please check out this report.

Prerequisites

  • Linux System
  • Python 3.10
  • CUDA 12.0
  • NVIDIA A100 80G

Installation

Clone this repository.

git clone 
cd kddcup_oag-challenge-pst_rank6

Please install dependencies by

pip install -r requirements.txt

PST Dataset

The dataset can be downloaded from BaiduPan with password bft3, Aliyun or DropBox. The paper XML files are generated by Grobid APIs from paper pdfs. And please download the DBLP-Citation-network V16 from DBLP and version OAG 3.1 from OAG, extract them, and place them in the data folder.

Directory structure

--kddcup2024-oagpst-solution
	--script
		--...(some files)
	--data
    	--PST
    		--...(some files)
    		--paper-xml(load competition dataset)

Main Approach

Main Approch

description of the files

1_data_manipulation.ipynb

This notebook processes data for this competition and creates hand-crafted features. It corresponds to the section "Data Extraction from XML Files."

2_text_embedding.ipynb

This notebook generates text embedding features using the text embedding model 'multilingual-E5-large.' It corresponds to the section "Generation of Textual Features."

3_network_processing.ipynb

This notebook creates graph features using the node embedding model 'node2vec.' It corresponds to the section "Generation of Network Features."

4_createMLdataset_and_train.ipynb and 5_inference.ipynb

These notebooks handle model training, ensembling with data augmentation, and inference using the best ensemble model. They correspond to the sections "Model Training and Inference" and "Experiments."

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published