Skip to content

ZhuLab-Fudan/GORetriever

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

51 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

GORetriever

Official code for GORetriever

Requirements

python=3.8

sentence-transformers == 2.2.2

pyserini == 0.10.1.0

pygaggle == 0.0.3.1

faiss-cpu == 1.7.4

numpy == 1.23.5

nltk == 3.8.1

Inference

Download cross_encoder from https://drive.google.com/file/d/11W51FnM62Z79qGPkuZHRzAv6Bx_L1Mah/view?usp=sharing into folder cross_model

python predict.py \
  --task [branch] \
  --pro [k] \
  --gpu [n] \
  --file [file]

branch: bp, mf, cc

pro: number of retrieved proteins, set k = 3 for MFO and BPO and k = 2 for CCO

gpu: cuda number, default as 0

file: filename for test set, eg: ./data/test.txt

eg:

python predict.py --task bp --pro 3 --gpu 0

Start your own experiments

If you want to start your new experiment, delete the files in folders test, file, pro_index, go_index. Otherwise the program will reason based on these pre-existing files.

Rewrite:

  • *_pro2go.npy - proteinid -> GO list dictionary in the training data
  • pmid2text.npy - Pubmed ID -> title and abstract dictionary for the PubMed article you will use in the training and testing data
  • proid2name.npy - proteinid -> protein name dictionary for the PubMed article you will use in the training and testing data
  • go_index, pro_index - index built by pyserini

Delete (will be rebuilt after running predict.py):

  • *_dev_t5_texts.npy - extracted sentences
  • *_retrieval_all.npy
  • *_retrieval_pro_3.npy - retrieval result
  • *_t5_scores.npy - score cache

File Structure

├── cross_model - Download

└────── *_PubMedBERT_epoch{}

└────── *_PubMedBERT_epoch1

├── data - training and testing data

└────── golden - evaluate files

└────── *_train.txt - train set

└────── test.txt - test set

├── file

└────── *_pro2go.npy - proteinid -> GO list

└────── pmid2text.npy - Pubmed ID -> title and abstract

└────── proid2name.npy - proteinid -> protein name

├── test

└────── *_dev_t5_texts.npy - extracted sentences

└────── *_retrieval_all.npy

└────── *_retrieval_pro_3.npy - retrieval result

└────── *_t5_scores.npy - score cache

├── go_index

├── pro_index

├── result - final result

├── predict.py

└── data_preprocess.py

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages