SingleMod

by Ying-Yuan Xie, Zhen-Dong Zhong, Hong-Xuan Chen
Correspondence to: [email protected] and [email protected]

A deep learning model designed for the precise detection of m6A within single RNA molecules using nanopore DRS data. SingleMod is trained through a deep multiple instance regression framework, carefully tailored to harness the extensive methylation-rate labels. SingleMod is a generizable framework which can be easily adopted to train model for other Nucleic Acid Modifications.

Note: SingleMod support m6A prediction from direct RNA sequencing data generated using RNA002 kit or the latest RNA004 kit.

Requisites

Data preparing:

Data	Note
fast5 or pod5 files	containing raw current signals
reference.fa	genome.fa or transcript.fa, we recommend genome.fa
methylation_rate.bed	methylation-rate labels, needed only for training your own models

Environment:
Our pipeline is executed on the Linux system.
Python3.6+ is required.

Softwares:

Tool	Usage	Note
Guppy	generate fastq from fast5 through basecalling	ignored, if your fast5 has been basecalled
dorado	generate fastq from pod5 through basecalling	for RNA004 data, ignored, if your pod5 has been basecalled
minimap2	align reads to reference.fa
samtools	bam files processing
bedtools	bed files processing
Picard	split bam file to multiple ones	allowing for parallel processing, significantly saving time
nanopolish	eventalign, assign current signals to bases	we use version of 0.13.2
pod5	convert pod5 format to fast5 format	for RNA004 data
f5c	eventalign, assign current signals to bases	for RNA004 data

python modules:

Module	Usage	Note
torch	an open source Python machine learning library
pysam	bam files processing	mark m6A onto molecules for visualization in IGV
adabound	optimize the model's parameters	needed only for training your own models

Check that you have the following basic packages: numpy, os, re, random, scipy, datetime, collections, argparse, multiprocessing, array

SingleMod code (https://github.com/xieyy46/SingleMod-v1/tree/main/SingleMod):

Code	Usage	Note
organize_from_eventalign.py	extra and organize raw signals from nanopolish eventalign result
merge_motif_npy.py	extra and organize raw signals into different motifs
SingleMod_m6A_prediction.py	predict m6A modification within each molecule
bam_mark_m6A.py	mark m6A modifications into bam file for visualization of single-molecule m6A
SingleMod_train.py	training your own models	needed only for training your own models

SingleMod models:
RNA002 (mammal): https://github.com/xieyy46/SingleMod-v1/tree/main/models/RNA002/mammal
RNA002 (non-mammal): https://github.com/xieyy46/SingleMod-v1/tree/main/models/RNA002/non-mammal
RNA004: https://github.com/xieyy46/SingleMod-v1/tree/main/models/RNA004

Running SingleMod

Following our pipeline, beginners in DRS can easily generate single-molecule m6A profile.
Welcome to use our test data （including both RNA002 and RNA004） for end-to-end practice; we also provide the expected results for each step: https://github.com/xieyy46/SingleMod-v1/tree/main/test.
While some key parameters of SingleMod are described below, it is advisable to understand all of its parameters before use through command python xx.py -h.

1, basecalling # ignore, if your fast5 has been basecalled

#RNA002:
guppy_basecaller -i fast5_dir -s basecall_output_dir -c rna_r9.4.1_70bps_hac.cfg -x 'cuda:all'

#RNA004:
dorado basecaller [email protected] pod5_dir -x 'cuda:all' > basecall_output_dir/calls.bam
dorado summary basecall_output_dir/calls.bam > basecall_output_dir/calls.summary
samtools fastq basecall_output_dir/calls.bam  > basecall_output_dir/calls.fastq

fast5_dir: path to directory containing your fast5 files (xxx.fast5).
pod5_dir: path to directory containing your pod5 files (xxx.pod5).
basecall_output_dir: path to directory containing outputs during basecalling process.
-x: specify cuda index to use GPU for basecalling.

2, mapping and spliting bam file

mkdir split_bam_dir

#mapping
#RNA002:
cat basecall_output_dir/pass/*fastq > basecall_output_dir/merge.fastq # ignore, if you have merge your fastq files

#RNA004:
mv basecall_output_dir/calls.fastq basecall_output_dir/merge.fastq

#if mapping to genome.fa  
minimap2 -ax splice -k 14 reference.fa -t 25 --secondary=no basecall_output_dir/merge.fastq -o sample_name.sam # ignore, if you have mapped your reads
#if mapping to transcript.fa 
minimap2 -ax map-ont -k 14 reference.fa -t 25 --secondary=no basecall_output_dir/merge.fastq -o sample_name.sam # ignore, if you have mapped your reads

samtools view -@ 30 -F 2048 -F 4 -b sample_name.sam | samtools sort -O BAM -@ 20  -o sample_name.bam
samtools index -@ 16 sample_name.bam

#spliting bam files for parallel processing
java -jar picard.jar SplitSamByNumberOfReads --INPUT sample_name.bam --SPLIT_TO_N_FILES 25 --OUTPUT split_bam_dir
for bam in split_bam_dir/*bam
do
{
samtools index $bam
} &
done

split_bam_dir: path to directory containing split bam files.
--SPLIT_TO_N_FILES 25: how many files sample_name.bam should be split into for following parallel processing. This value can be adjust.
The default prefix for split bam files is shard_xxxx.

3, eventalign

mkdir eventalign_output_dir

#making index
#RNA002:
nanopolish index --directory=fast5_dir --sequencing-summary=basecall_output_dir/sequencing_summary.txt basecall_output_dir/merge.fastq
# or if you donot have sequencing_summary.txt, but much slower: nanopolish index --directory=fast5_dir basecall_output_dir/merge.fastq

#RNA004:
pod5 convert to_fast5 pod5_dir/ --output fast5_dir/
f5c index --iop 10 -t 10 -d fast5_dir basecall_output_dir/merge.fastq

#parallelly nanopolish eventalign
#RNA002:
for file in split_bam_dir/*.bam
do
{
info=(${file//// })
nanopolish eventalign --reads basecall_output_dir/merge.fastq --bam $file --genome reference.fa -t 15 --scale-events --samples --signal-index --summary eventalign_output_dir/${info[-1]%%.bam}_summary.txt --print-read-names > eventalign_output_dir/${info[-1]%%.bam}_eventalign.txt
} &
done

#RNA004:
for file in split_bam_dir/*.bam
do
{
info=(${file//// })
f5c eventalign -r basecall_output_dir/merge.fastq -b $file -g reference.fa -t 15 --pore rna004 --rna --scale-events --samples --signal-index --summary eventalign_output_dir/${info[-1]%%.bam}_summary.txt --print-read-names > eventalign_output_dir/${info[-1]%%.bam}_eventalign.txt
} &
done
#if run out of memory, please run in batches

eventalign_output_dir: path to directory containing outputs during nanopolish eventalign process.
sequencing_summary.txt: this file will be generate in basecalling step.

4, extracting and organizing features for m6A prediction (or model traning)

mkdir tmp_features  
mkdir features

cd split_bam_dir
#convert bam to bed to extract strand informationt
for file in shard*bam
do
{
bedtools bamtobed -i $file > ${file/.bam/.bed}
} &
done
wait

#running parallelly
batch=(shard_0001 shard_0002 shard_0003 shard_0004 shard_0005 shard_0006 shard_0007 shard_0008 shard_0009 shard_0010 shard_0011 shard_0012 shard_0013 shard_0014 shard_0015 shard_0016 shard_0017 shard_0018 shard_0019 shard_0020 shard_0021 shard_0022 shard_0023 shard_0024 shard_0025)
for i in ${batch[@]}
do
{
python -u SingleMod/organize_from_eventalign.py -v 002|004 -b split_bam_dir/${i}.bed -e eventalign_output_dir/${i}_eventalign.txt -o tmp_features -p $i -s 500000
} &
done
wait
cd tmp_features #required step
wc -l *-extra_info.txt | sed 's/^ *//g' | sed '$d' | tr " " "\t"   > extra_info.txt

python -u SingleMod/merge_motif_npy.py -v 002|004 -d tmp_features -s 500000 -o features

-v: DRS kit used, choose either 002 (RNA002) or 004 (RNA004), with the default setting being 002.
-s: the first dimension of memmap files (size), default setting is 500000. If run in 25 parallel batchs, 500000 is enough to cover 5G (or ~5 millions reads) DRS data. If your data is over 5G, set a larger size, for example 5-10G: 1000000, 10-15G: 1500000.
tmp_features: path to directory containing intermediate file.
features: path to directory containing final input files to SingleMod for different motifs (including sequence.npy, signal.npy and extra.npy).

5, m6A prediction

mkdir prediction

#predicting
# we now support m6A prediction within 39 motifs for RNA002 data
for motif in AAACA AAACC AAACG AAACT AAATA AAATT AGACA AGACC AGACG AGACT AGATT ATACT CAACT CGACT CTACT GAACA GAACC GAACG GAACT GAATA GAATC GAATG GAATT GGACA GGACC GGACG GGACT GGATA GGATC GGATG GGATT GTACT TAACT TGACA TGACC TGACG TGACT TTACA TTACT
do
python -u SingleMod/SingleMod_m6A_prediction.py -v 002 -d features -k $motif -m models/model_${motif}.pth.tar -g 0 -b 30000 -o prediction/${motif}_prediction.txt
done

#36 motifs for RNA004 data
for motif in AAACA AAACC AAACG AAACT AAATA AAATT AGACA AGACC AGACG AGACT AGATT ATACT CAACT CGACT CTACT GAACA GAACC GAACT GAATA GAATG GAATT GGACA GGACC GGACG GGACT GGATA GGATC GGATG GGATT GTACT TAACA TAACT TGACA TGACC TGACT TTACT
do
python -u SingleMod/SingleMod_m6A_prediction.py -v 004 -d features -k $motif -m models/model_${motif}.pth.tar -g 0 -b 30000 -o prediction/${motif}_prediction.txt
done

#organizing the results
cat prediction/*_prediction.txt > prediction.txt

#calculate methylation rate for each sites from single-molecule m6A prediction
#if the predicted probability > 0.5, we determine a A base to be m6A-modified
awk 'BEGIN{OFS=FS="\t"}{split($1,info,"|");s=info[1]"|"info[2]"|"info[3]"|"info[5];t[s]=t[s]+1;if($2 > 0.5){m[s]=m[s]+1}}END{for(i in t){split(i,info,"|");if(i in m){print info[1],info[2]-1,info[2],i,m[i]/t[i],info[3],t[i],m[i],info[4]}else{print info[1],info[2]-1,info[2],i,0,info[3],t[i],0,info[4]}}}' prediction.txt | sort -k1,1 -k2,2n > mr.bed

prediction: directory containing m6A prediction results.
models: directory containing SingleMod models.
-g: cuda index, default is using CPU; if you use GPU, please specify the cuda index.
-b: batch size for m6A prediction, default is 30000; if you use CPU to make prediction, you can use a larger batch size.
prediction.txt is the final result containing single-molecule m6A prediction, its format is as follow:
(chromosome|location|strand|read_name|motif probability)
chr14|3864706|+|90e1832b-38e5-40c3-944d-b7cfd1407ad6|AAACA 0.9866609573364258 chr5|747885|+|388ca3b1-1353-4dbc-a5c9-b3fdf0ed5818|AAACA 4.8746630335547135e-34
we use a threshold of 0.5 to determine m6A modification (probability > 0.5)
mr.bed contains methylation rate results for each sites calculated from single-molecule m6A prediction, its format is as follow:
(chromosome location-1 location ID methylation_rate strand total_molecule_number modified_molecule_number motif)
chr1 16677290 16677291 chr1|16677291|+|AAACA 0.156863 + 51 8 AAACA
chr1 16677516 16677517 chr1|16677517|+|AAACA 0.037736 + 53 2 AAACA

6, visualization of single-molecule m6A in IGV (optional)

mkdir marked

#marking m6A modification information into bam file
cd split_bam_dir
for file in *bam
do
{
python -u SingleMod/bam_mark_m6A.py  -p prediction.txt -b $file -o marked/$file
} &
done
wait

#merge and index
cd marked
for file in *bam; do samtools index $file;done
samtools merge -@ 20 merge.bam *bam
samtools index merge.bam
rm shard*

marked: path to directory containing m6A-marked bam file
in bam_mark_m6A.py, we use a threshold of 0.5 to determine m6A modification (probability > 0.5)

Load the m6A-marked bam file into IGV. A representative gene snapshot is shown below (the gray and black vertical bars indicate A and m6A, respectively):
#To achieve a clearer visualization, it is necessary to modify some settings in IGV. Please following: https://github.com/xieyy46/SingleMod-v1/blob/main/test/RNA002/marked/README.md

Training your own models

You can train new models to detect modifications at single-molecule level, including m6A models in other motifs, or models for other types of modifications.

mkdir training
mkdir training/motif
mkdir training/motif/rep

python -u SingleMod/SingleMod_train.py -v 002|004 -s Sample1,Sample2,...,SampleN \
-seq Sample1_features/motif_sequence.npy,Sample2_features/motif_sequence.npy,...,SampleN_features/motif_sequence.npy \
-sig Sample1_features/motif_signal.npy,Sample2_features/motif_signal.npy,...,SampleN_features/motif_signal.npy \
-ext Sample1_features/motif_extra.npy,Sample2_features/motif_extra.npy,...,SampleN_features/motif_extra.npy \
-d Sample1_label,Sample2_label,...,SampleN_label \
-m motif -r 0 -g 0 \
-o training/motif/rep > training/motif/rep/training.log

training: directory containing model training results.
-v: DRS kit used, choose either 002 (RNA002) or 004 (RNA004), with the default setting being 002.
-m: motif specified.
-r: experiment batch index, used to set seed when split data into train, validate and test set, default is 0.
-g: cuda index, default is using CPU
-d: bed file containing the absolute quantification information (methylation rates), its format is as follow:
(chromosvome location-1 location . methylation_rate strand kmer) # Tab-separated
chr1 15878 15879 . 0.0290404 - CGCCAAGCT
chr1 15939 15940 . 0.028949549999999998 - AGGGAGCTC

Citing SingleMod

Single-Molecule Direct RNA Sequencing Reveals the Shaping of Epitranscriptome Across Multiple Species, bioRxiv, 2023.11.16.567334; doi: https://doi.org/10.1101/2023.11.16.567334

Name		Name	Last commit message	Last commit date
Latest commit History 300 Commits
Figures		Figures
SingleMod		SingleMod
m6Anet_Curlcake		m6Anet_Curlcake
models		models
test		test
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SingleMod

Requisites

Running SingleMod

Training your own models

Citing SingleMod

Data availability

About

Releases

Packages

Contributors 2

Languages

xieyy46/SingleMod-v1

Folders and files

Latest commit

History

Repository files navigation

SingleMod

Requisites

Running SingleMod

Training your own models

Citing SingleMod

Data availability

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages