GitHub - hlilab/minisplice: Scoring GT/AG sites for improving spliced alignment

Getting Started

# compile
git clone https://github.com/lh3/minisplice
cd minisplice && make

# download vertebrate-insect pre-trained model and calibration data
wget -O- https://zenodo.org/records/15931054/files/vi2-7k.tgz | tar zxf -

# compute the splice score for GT and AG sites; see below for model training
./minisplice predict -t16 -c vi2-7k.kan.cali vi2-7k.kan genome.fa.gz > score.tsv

# use splice scores (miniprot-v0.16+ or minimap2-r1285+)
miniprot -Iut16 --gff -j2 --spsc=score.tsv genome.fa.gz proteins.faa > align.gff
minimap2 -cxsplice:hq -t16 --spsc=score.tsv genome.fa.gz rna-seq.fq > align.paf

# use pre-calculated human or Drosophila scores
wget https://zenodo.org/records/15931054/files/human-GRCh38.v4.tsv.gz
miniprot -Iut16 --gff -j2 --spsc human-GRCh38.vi2-7k.tsv.gz hg38.fa proteins.faa

Introduction

What: minisplice is a command-line tool to estimate the odds-ratio score of canonical donor (GT) and acceptor (AG) splice sites. It is intended to be used with miniprot (v0.16+) or minimap2 (r1285+) for improving alignment accuracy especially for distant homologs. Pre-trained models and pre-computed splice scores can be found at Zenodo.

Why: protein-to-genome aligners like miniprot and GeneWise are effectively gene-finders that trace open-reading frames and model splice signals. For distant homologs, the splice model plays an important role in resolving ambiguous alignment around splice junctions. Miniprot uses a simplistic model with 4-5 parameters based on human data. While this model is reasonably robust in practice, it losses power in comparison to more sophisticated solutions such as position weight matrix.

How: minisplice trains a 1D convolutional neural network (1D-CNN) on sequences around annotated and random GT- or -AG sites from the genome, calibrates the model output to empirical probability, and computes the odds-ratio score of each GT- or -AG in the genome. Training can be applied to multiple distantly related species.

Usage

Prediction

If your target genome is a vertebrate or insect, you can use pre-trained model:

./minisplice predict -t16 -c vi2-7k.kan.cali vi2-7k.kan genome.fa.gz > score.tsv

where vi2-7k.kan encodes the model and vi2-7k.kan.cali provides calibration data which is used to translate the model output to empirical probability. The output looks like:

2L  1005295  +   A   2
2L  1005339  -   D   13
2L  1005339  -   A   0
2L  1005410  -   A   -5
2L  1005415  -   A   -5

Each line gives contig/chromosome name, offset, strand, D (for donor) or A (for acceptor) and the score, which is 2log2-scaled odds ratio of the estimated probability of the site being real over the genome-wide fraction of annotated splice sites (the null model). Miniprot can take this file as input with option --spsc. Note that this option was added in miniprot-0.14 (r265), but versions before v0.16 may lead to an assertion failure.

Training

The following command lines show how to train and calibrate a model for one genome:

# convert gene annotation in GTF/GFF3 to BED12; requiring https://github.com/lh3/minigff
minigff.js all2bed -a1 anno.gtf.gz | gzip > anno-long.bed.gz  # longest protein-coding only
minigff.js all2bed anno.gtf.gz | gzip > anno-all.bed.gz       # all annotation

# generate training data
./minisplice gentrain anno-long.bed.gz genome-odd.fa.gz | gzip > train.txt.gz

# model training; 8 or 16 threads are recommended
./minisplice train -t16 -o model.kan train.txt.gz

# calibration (computing empirical odds-ratio scores)
./minisplice predict -t16 -b anno-all.bed.gz model.kan genome-even.fa.gz > model.cali

# prediction
./minisplice predict -t16 -c model.cali model.kan genome.fa.gz | gzip > score.tsv.gz

You can train on odd chromosomes and calibrate on even chromosomes. To train a model from multiple species:

./minisplice gentrain anno1-long.bed.gz genome1-odd.fa.gz | gzip > train1.txt.gz
./minisplice gentrain anno2-long.bed.gz genome2-odd.fa.gz | gzip > train2.txt.gz
cat train*.txt.gz | ./minisplice train -t16 -o model.kan -
./minisplice predict -t16 -b anno1-all.bed.gz model.kan genome1-even.fa.gz > cali1.txt
./minisplice predict -t16 -b anno1-all.bed.gz model.kan genome2-even.fa.gz > cali2.txt
script/merge-cali.js 1,1 cali1.txt cali2.txt > model.cali

Usually one genome provides enough training data. To save training time, you can subsample training data from each genome before combining them. If you subsample training data at different rates, it is recommended to provides the rates on the merge-cali.js command line.

Citing minisplice

Yang S, Huang N and Li H (2025) Improving spliced alignment by modeling splice sites with deep learning arXiv:2506.12986. DOI:10.48550/arXiv.2506.12986

Name		Name	Last commit message	Last commit date
Latest commit History 120 Commits
scripts		scripts
tex		tex
.gitignore		.gitignore
Makefile		Makefile
NEWS.md		NEWS.md
README.md		README.md
bed.c		bed.c
kann.c		kann.c
kann.h		kann.h
kautodiff.c		kautodiff.c
kautodiff.h		kautodiff.h
ketopt.h		ketopt.h
khashl.h		khashl.h
kseq.h		kseq.h
ksort.h		ksort.h
main.c		main.c
minisp.h		minisp.h
misc.c		misc.c
msppriv.h		msppriv.h
predict.c		predict.c
reader.c		reader.c
strmap.c		strmap.c
train.c		train.c

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Getting Started

Introduction

Usage

Prediction

Training

Citing minisplice

About

Uh oh!

Releases

Packages

Languages

hlilab/minisplice

Folders and files

Latest commit

History

Repository files navigation

Getting Started

Introduction

Usage

Prediction

Training

Citing minisplice

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages