Code and resources from GPN and related genomic language models.
- Installation
- Quick start
- Modeling frameworks
- Applications of the models
- GPN
- GPN-MSA
- PhyloGPN
- Citation
pip install git+https://github.com/songlab-cal/gpn.git
import gpn.model
from transformers import AutoModelForMaskedLM, AutoModel
model = AutoModelForMaskedLM.from_pretrained("songlab/gpn-brassicales")
# or
model = AutoModelForMaskedLM.from_pretrained("songlab/gpn-msa-sapiens")
# or
model = AutoModel.from_pretrained("songlab/PhyloGPN", trust_remote_code=True)
Model | Paper | Notes |
---|---|---|
GPN | Benegas et al. 2023 | Requires unaligned genomes |
GPN-MSA | Benegas et al. 2025 | Requires aligned genomes for both training and inference |
PhyloGPN | [Albors et al. 2025] | Uses an alignment during training, but does not require it for inference or fine-tuning |
Paper | Model | Dataset | Code | Resources on HuggingFace 🤗 |
---|---|---|---|---|
Benegas et al. 2023 | GPN | Arabidopsis and other Brassicale plants | analysis/gpn_arabidopsis | Model, dataset, intermediate results |
Benegas et al. 2025 | GPN-MSA | Human and other vertebrates | analysis/gpn-msa_human | Model, dataset, benchmarks, predictions |
Benegas et al. 2025b | GPN | Animal promoters | analysis/gpn_animal_promoter | Model, dataset, benchmarks |
Can also be called GPN-SS (single sequence).
- Snakemake workflow to create a dataset
- Can automatically download data from NCBI given a list of accessions, or use your own fasta files.
- Training
- Will automatically detect all available GPUs.
- Track metrics on Weights & Biases
- Implemented encoders:
convnet
(default),roformer
(Transformer),bytenet
- Specify config overrides: e.g.
--config_overrides encoder=bytenet,num_hidden_layers=30
- The number of steps that you can train without overfitting will be a function of the size and diversity of your dataset
- Example:
WANDB_PROJECT=your_project torchrun --nproc_per_node=$(echo $CUDA_VISIBLE_DEVICES | awk -F',' '{print NF}') -m gpn.ss.run_mlm --do_train --do_eval \
--report_to wandb --prediction_loss_only True --remove_unused_columns False \
--dataset_name results/dataset --tokenizer_name gonzalobenegas/tokenizer-dna-mlm \
--soft_masked_loss_weight_train 0.1 --soft_masked_loss_weight_evaluation 0.0 \
--weight_decay 0.01 --optim adamw_torch \
--dataloader_num_workers 16 --seed 42 \
--save_strategy steps --save_steps 10000 --evaluation_strategy steps \
--eval_steps 10000 --logging_steps 10000 --max_steps 120000 --warmup_steps 1000 \
--learning_rate 1e-3 --lr_scheduler_type constant_with_warmup \
--run_name your_run --output_dir your_output_dir --model_type GPN \
--per_device_train_batch_size 512 --per_device_eval_batch_size 512 --gradient_accumulation_steps 1 --total_batch_size 2048 \
--torch_compile \
--ddp_find_unused_parameters False \
--bf16 --bf16_full_eval \
- Extract embeddings
- Input file requires
chrom
,start
,end
- Example:
- Input file requires
torchrun --nproc_per_node=$(echo $CUDA_VISIBLE_DEVICES | awk -F',' '{print NF}') -m gpn.ss.get_embeddings windows.parquet genome.fa.gz 100 your_output_dir \
results.parquet --per_device_batch_size 4000 --is_file --dataloader_num_workers 16
- Variant effect prediction
- Input file requires
chrom
,pos
,ref
,alt
- Example:
- Input file requires
torchrun --nproc_per_node=$(echo $CUDA_VISIBLE_DEVICES | awk -F',' '{print NF}') -m gpn.ss.run_vep variants.parquet genome.fa.gz 512 your_output_dir results.parquet \
--per_device_batch-size 4000 --is_file --dataloader_num_workers 16
- Play with the model:
examples/msa/basic_example.ipynb
- Variant effect prediction:
examples/msa/vep.ipynb
- Training (human):
examples/msa/training.ipynb
- See #28, #40
- Another source for plant alignments: https://plantregmap.gao-lab.org/download.php#alignment-conservation
PhyloGPN is a convolutional neural network that takes encoded DNA sequences as input and outputs rate matrix parameters for Felsenstein's 1981 model (the F81 model, for short). It was trained to maximize the likelihood of columns in the Zoonomia alignment given a phylogenetic tree. The stationary distribution of the substitution process described by the F81 model indicates the relative viability of each allele at any given locus. As a result, PhyloGPN is formally a (single-sequence) genomic language model. It can be used for transfer learning and zero-shot SNV deleteriousness prediction. It is especially useful for sequences that are not directly in the human reference genome.
GPN:
@article{benegas2023dna,
title={DNA language models are powerful predictors of genome-wide variant effects},
author={Benegas, Gonzalo and Batra, Sanjit Singh and Song, Yun S},
journal={Proceedings of the National Academy of Sciences},
volume={120},
number={44},
pages={e2311219120},
year={2023},
publisher={National Acad Sciences}
}
@article{benegas2025dna,
title={A DNA language model based on multispecies alignment predicts the effects of genome-wide variants},
author={Benegas, Gonzalo and Albors, Carlos and Aw, Alan J and Ye, Chengzhong and Song, Yun S},
journal={Nature Biotechnology},
pages={1--6},
year={2025},
publisher={Nature Publishing Group US New York}
}