GPN (Genomic Pre-trained Network)

Code and resources from GPN and related genomic language models.

Installation

pip install git+https://github.com/songlab-cal/gpn.git

Quick start

import gpn.model
from transformers import AutoModelForMaskedLM, AutoModel

model = AutoModelForMaskedLM.from_pretrained("songlab/gpn-brassicales")
# or
model = AutoModelForMaskedLM.from_pretrained("songlab/gpn-msa-sapiens")
# or
model = AutoModel.from_pretrained("songlab/PhyloGPN", trust_remote_code=True)

Modeling frameworks

Model	Paper	Notes
GPN	Benegas et al. 2023	Requires unaligned genomes
GPN-MSA	Benegas et al. 2025	Requires aligned genomes for both training and inference
PhyloGPN	[Albors et al. 2025]	Uses an alignment during training, but does not require it for inference or fine-tuning

Applications of the models

Paper	Model	Dataset	Code	Resources on HuggingFace 🤗
Benegas et al. 2023	GPN	Arabidopsis and other Brassicale plants	analysis/gpn_arabidopsis	Model, dataset, intermediate results
Benegas et al. 2025	GPN-MSA	Human and other vertebrates	analysis/gpn-msa_human	Model, dataset, benchmarks, predictions
Benegas et al. 2025b	GPN	Animal promoters	analysis/gpn_animal_promoter	Model, dataset, benchmarks

GPN

Can also be called GPN-SS (single sequence).

Examples

Play with the model: examples/ss/basic_example.ipynb

Training on your own data

Snakemake workflow to create a dataset
- Can automatically download data from NCBI given a list of accessions, or use your own fasta files.
Training
- Will automatically detect all available GPUs.
- Track metrics on Weights & Biases
- Implemented encoders: convnet (default), roformer (Transformer), bytenet
- Specify config overrides: e.g. --config_overrides encoder=bytenet,num_hidden_layers=30
- The number of steps that you can train without overfitting will be a function of the size and diversity of your dataset
- Example:

WANDB_PROJECT=your_project torchrun --nproc_per_node=$(echo $CUDA_VISIBLE_DEVICES | awk -F',' '{print NF}') -m gpn.ss.run_mlm --do_train --do_eval \
    --report_to wandb --prediction_loss_only True --remove_unused_columns False \
    --dataset_name results/dataset --tokenizer_name gonzalobenegas/tokenizer-dna-mlm \
    --soft_masked_loss_weight_train 0.1 --soft_masked_loss_weight_evaluation 0.0 \
    --weight_decay 0.01 --optim adamw_torch \
    --dataloader_num_workers 16 --seed 42 \
    --save_strategy steps --save_steps 10000 --evaluation_strategy steps \
    --eval_steps 10000 --logging_steps 10000 --max_steps 120000 --warmup_steps 1000 \
    --learning_rate 1e-3 --lr_scheduler_type constant_with_warmup \
    --run_name your_run --output_dir your_output_dir --model_type GPN \
    --per_device_train_batch_size 512 --per_device_eval_batch_size 512 --gradient_accumulation_steps 1 --total_batch_size 2048 \ 
    --torch_compile \
    --ddp_find_unused_parameters False \
    --bf16 --bf16_full_eval \

Extract embeddings
- Input file requires chrom, start, end
- Example:

torchrun --nproc_per_node=$(echo $CUDA_VISIBLE_DEVICES | awk -F',' '{print NF}') -m gpn.ss.get_embeddings windows.parquet genome.fa.gz 100 your_output_dir \
    results.parquet --per_device_batch_size 4000 --is_file --dataloader_num_workers 16

Variant effect prediction
- Input file requires chrom, pos, ref, alt
- Example:

torchrun --nproc_per_node=$(echo $CUDA_VISIBLE_DEVICES | awk -F',' '{print NF}') -m gpn.ss.run_vep variants.parquet genome.fa.gz 512 your_output_dir results.parquet \
    --per_device_batch-size 4000 --is_file --dataloader_num_workers 16

GPN-MSA

Examples

Play with the model: examples/msa/basic_example.ipynb
Variant effect prediction: examples/msa/vep.ipynb
Training (human): examples/msa/training.ipynb

Training on other species (e.g. other vertebrates, plants)

See #28, #40
Another source for plant alignments: https://plantregmap.gao-lab.org/download.php#alignment-conservation

PhyloGPN

PhyloGPN is a convolutional neural network that takes encoded DNA sequences as input and outputs rate matrix parameters for Felsenstein's 1981 model (the F81 model, for short). It was trained to maximize the likelihood of columns in the Zoonomia alignment given a phylogenetic tree. The stationary distribution of the substitution process described by the F81 model indicates the relative viability of each allele at any given locus. As a result, PhyloGPN is formally a (single-sequence) genomic language model. It can be used for transfer learning and zero-shot SNV deleteriousness prediction. It is especially useful for sequences that are not directly in the human reference genome.

Citation

GPN:

@article{benegas2023dna,
  title={DNA language models are powerful predictors of genome-wide variant effects},
  author={Benegas, Gonzalo and Batra, Sanjit Singh and Song, Yun S},
  journal={Proceedings of the National Academy of Sciences},
  volume={120},
  number={44},
  pages={e2311219120},
  year={2023},
  publisher={National Acad Sciences}
}

GPN-MSA:

@article{benegas2025dna,
  title={A DNA language model based on multispecies alignment predicts the effects of genome-wide variants},
  author={Benegas, Gonzalo and Albors, Carlos and Aw, Alan J and Ye, Chengzhong and Song, Yun S},
  journal={Nature Biotechnology},
  pages={1--6},
  year={2025},
  publisher={Nature Publishing Group US New York}
}

Name		Name	Last commit message	Last commit date
Latest commit History 392 Commits
analysis		analysis
examples		examples
gpn		gpn
workflow		workflow
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

GPN (Genomic Pre-trained Network)

Table of contents

Installation

Quick start

Modeling frameworks

Applications of the models

GPN

Examples

Training on your own data

GPN-MSA

Examples

Training on other species (e.g. other vertebrates, plants)

PhyloGPN

Citation

About

Releases 6

Packages

Contributors 2

Languages

License

songlab-cal/gpn

Folders and files

Latest commit

History

Repository files navigation

GPN (Genomic Pre-trained Network)

Table of contents

Installation

Quick start

Modeling frameworks

Applications of the models

GPN

Examples

Training on your own data

GPN-MSA

Examples

Training on other species (e.g. other vertebrates, plants)

PhyloGPN

Citation

About

Topics

Resources

License

Stars

Watchers

Forks

Releases 6

Packages 0

Contributors 2

Languages

Packages