Skip to content

songlab-cal/gpn

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

GPN (Genomic Pre-trained Network)

hgt_genome_392c4_a47ce0

Code and resources from GPN and related genomic language models.

Table of contents

Installation

pip install git+https://github.com/songlab-cal/gpn.git

Quick start

import gpn.model
from transformers import AutoModelForMaskedLM, AutoModel

model = AutoModelForMaskedLM.from_pretrained("songlab/gpn-brassicales")
# or
model = AutoModelForMaskedLM.from_pretrained("songlab/gpn-msa-sapiens")
# or
model = AutoModel.from_pretrained("songlab/PhyloGPN", trust_remote_code=True)

Modeling frameworks

Model Paper Notes
GPN Benegas et al. 2023 Requires unaligned genomes
GPN-MSA Benegas et al. 2025 Requires aligned genomes for both training and inference
PhyloGPN [Albors et al. 2025] Uses an alignment during training, but does not require it for inference or fine-tuning

Applications of the models

Paper Model Dataset Code Resources on HuggingFace 🤗
Benegas et al. 2023 GPN Arabidopsis and other Brassicale plants analysis/gpn_arabidopsis Model, dataset, intermediate results
Benegas et al. 2025 GPN-MSA Human and other vertebrates analysis/gpn-msa_human Model, dataset, benchmarks, predictions
Benegas et al. 2025b GPN Animal promoters analysis/gpn_animal_promoter Model, dataset, benchmarks

GPN

Can also be called GPN-SS (single sequence).

Examples

  • Play with the model: examples/ss/basic_example.ipynb Open In Colab

Training on your own data

  1. Snakemake workflow to create a dataset
    • Can automatically download data from NCBI given a list of accessions, or use your own fasta files.
  2. Training
    • Will automatically detect all available GPUs.
    • Track metrics on Weights & Biases
    • Implemented encoders: convnet (default), roformer (Transformer), bytenet
    • Specify config overrides: e.g. --config_overrides encoder=bytenet,num_hidden_layers=30
    • The number of steps that you can train without overfitting will be a function of the size and diversity of your dataset
    • Example:
WANDB_PROJECT=your_project torchrun --nproc_per_node=$(echo $CUDA_VISIBLE_DEVICES | awk -F',' '{print NF}') -m gpn.ss.run_mlm --do_train --do_eval \
    --report_to wandb --prediction_loss_only True --remove_unused_columns False \
    --dataset_name results/dataset --tokenizer_name gonzalobenegas/tokenizer-dna-mlm \
    --soft_masked_loss_weight_train 0.1 --soft_masked_loss_weight_evaluation 0.0 \
    --weight_decay 0.01 --optim adamw_torch \
    --dataloader_num_workers 16 --seed 42 \
    --save_strategy steps --save_steps 10000 --evaluation_strategy steps \
    --eval_steps 10000 --logging_steps 10000 --max_steps 120000 --warmup_steps 1000 \
    --learning_rate 1e-3 --lr_scheduler_type constant_with_warmup \
    --run_name your_run --output_dir your_output_dir --model_type GPN \
    --per_device_train_batch_size 512 --per_device_eval_batch_size 512 --gradient_accumulation_steps 1 --total_batch_size 2048 \ 
    --torch_compile \
    --ddp_find_unused_parameters False \
    --bf16 --bf16_full_eval \
  1. Extract embeddings
    • Input file requires chrom, start, end
    • Example:
torchrun --nproc_per_node=$(echo $CUDA_VISIBLE_DEVICES | awk -F',' '{print NF}') -m gpn.ss.get_embeddings windows.parquet genome.fa.gz 100 your_output_dir \
    results.parquet --per_device_batch_size 4000 --is_file --dataloader_num_workers 16
  1. Variant effect prediction
    • Input file requires chrom, pos, ref, alt
    • Example:
torchrun --nproc_per_node=$(echo $CUDA_VISIBLE_DEVICES | awk -F',' '{print NF}') -m gpn.ss.run_vep variants.parquet genome.fa.gz 512 your_output_dir results.parquet \
    --per_device_batch-size 4000 --is_file --dataloader_num_workers 16

GPN-MSA

Examples

  • Play with the model: examples/msa/basic_example.ipynb
  • Variant effect prediction: examples/msa/vep.ipynb
  • Training (human): examples/msa/training.ipynb

Training on other species (e.g. other vertebrates, plants)

PhyloGPN

PhyloGPN is a convolutional neural network that takes encoded DNA sequences as input and outputs rate matrix parameters for Felsenstein's 1981 model (the F81 model, for short). It was trained to maximize the likelihood of columns in the Zoonomia alignment given a phylogenetic tree. The stationary distribution of the substitution process described by the F81 model indicates the relative viability of each allele at any given locus. As a result, PhyloGPN is formally a (single-sequence) genomic language model. It can be used for transfer learning and zero-shot SNV deleteriousness prediction. It is especially useful for sequences that are not directly in the human reference genome.

Citation

GPN:

@article{benegas2023dna,
  title={DNA language models are powerful predictors of genome-wide variant effects},
  author={Benegas, Gonzalo and Batra, Sanjit Singh and Song, Yun S},
  journal={Proceedings of the National Academy of Sciences},
  volume={120},
  number={44},
  pages={e2311219120},
  year={2023},
  publisher={National Acad Sciences}
}

GPN-MSA:

@article{benegas2025dna,
  title={A DNA language model based on multispecies alignment predicts the effects of genome-wide variants},
  author={Benegas, Gonzalo and Albors, Carlos and Aw, Alan J and Ye, Chengzhong and Song, Yun S},
  journal={Nature Biotechnology},
  pages={1--6},
  year={2025},
  publisher={Nature Publishing Group US New York}
}