Skip to content

Commit b1f5fa5

Browse files
authoredNov 18, 2024
Merge pull request #5 from HKU-BAL/v0.2.0
V0.2.0
2 parents 6f6b25f + 27d6f5c commit b1f5fa5

File tree

6 files changed

+347
-37
lines changed

6 files changed

+347
-37
lines changed
 

‎README.md

+23-7
Original file line numberDiff line numberDiff line change
@@ -3,7 +3,7 @@
33
<img src="images/fjsClaiRR.jpg" width = "200" title="aka. ClaiRR. Image credits to Fritz Sedlezeck.">
44
</div>
55

6-
# Clair3-RNA - long-read small variant caller for RNA sequencing data
6+
# Clair3-RNA - A deep learning-based small variant caller for long-read RNA sequencing data
77

88
[![License](https://img.shields.io/badge/License-BSD%203--Clause-blue.svg)](https://opensource.org/licenses/BSD-3-Clause)
99

@@ -14,7 +14,7 @@ Email: {rbluo,zxzheng}@cs.hku.hk
1414

1515
## Introduction
1616

17-
Clair3-RNA is a small variant caller for RNA long-read data. Clair3-RNA supports ONT complementary DNA sequencing (cDNA) and direct RNA sequencing (dRNA). dRNA sequencing support the ONT latest [SQK-RNA004 kit](https://community.nanoporetech.com/docs/prepare/library_prep_protocols/direct-rna-sequencing-sqk-rna004/v/drs_9195_v4_revd_20sep2023) data for variant calling. Clair3-RNA also supports PacBio Sequel and PacBio MAS-Seq RNA sequencing data.
17+
Clair3-RNA is a small variant caller for RNA long-read data. Clair3-RNA supports ONT complementary DNA sequencing (cDNA) and direct RNA sequencing (dRNA). dRNA sequencing support the ONT latest [SQK-RNA004 kit](https://community.nanoporetech.com/docs/prepare/library_prep_protocols/direct-rna-sequencing-sqk-rna004/v/drs_9195_v4_revd_20sep2023) data for variant calling. Clair3-RNA also supports PacBio Sequel and PacBio MAS-Seq RNA sequencing data. Clair3-RNA reached a ~95% F1-score for ONT dRNA using SQK-RNA004 kit and ~96% F1-score using PacBio Iso-Seq and MAS-Seq, respectively, with at least ten supporting reads and disregarding the zygosity. With read phased, the performance reached ~97% for ONT and ~98% for PacBio.
1818

1919
For germline small variant calling, please use [Clair3](https://github.com/HKU-BAL/Clair3).
2020

@@ -39,6 +39,8 @@ For somatic small variant calling using tumor sample only, please try [ClairS-TO
3939
----
4040

4141
## Latest Updates
42+
*v0.2.0 (Nov 18, 2024)* : 1. Added a new pileup phasing model (enable by using `--enable_phasing_model` opiton) for ONT dRNA004(`ont_dorado_drna004`), PacBio Iso-Seq(`hifi_sequel2_minimap2`), and PacBio MAS-Seq(`hifi_mas_minimap2`), the SNP performance improved by ~2% and Indel performance improved by ~6%. 2. Fixed some formatting issues in the calling workflow.
43+
4244
*v0.1.0 (Aug 15, 2024)* : 1. Added a new ONT dRNA004 direct RNA sequencing model (`ont_dorado_drna004`) for SQK-RNA004 kit. 2. Added new PacBio Sequel (`hifi_sequel2_minimap2`) and Revio (`hifi_mas_minimap2`) model to support minimap2 alignment. 3. Enhance model training techniques to boost performance by incorporating strategies such as managing low-coverage sites, verifying variant zygosity, filtering RNA editing sites, etc. 4. Renamed all ONT and PacBio model names, check [here](https://github.com/HKU-BAL/Clair3-RNA?tab=readme-ov-file#pre-trained-models) for more details.
4345

4446
*v0.0.1 (Nov 27, 2023)*: Initial release for early access.
@@ -99,7 +101,8 @@ docker run -it \
99101
--ref_fn ${INPUT_DIR}/ref.fa \ ## use your reference file name here
100102
--threads ${THREADS} \ ## maximum threads to be used
101103
--platform ${PLATFORM} \ ## options: {ont_dorado_drna004, ont_guppy_drna002, ont_guppy_cdna, hifi_sequel2_pbmm2, hifi_sequel2_minimap2, hifi_mas_pbmm2, hifi_sequel2_minimap2}
102-
--tag_variant_using_readiportal ## optional, tag variants using REDIportal dataset
104+
--tag_variant_using_readiportal \ ## optional, tag variants using REDIportal dataset
105+
--enable_phasing_model \ ## optional, enable calling using phasing model
103106
--output_dir ${OUTPUT_DIR} ## output path prefix
104107
```
105108

@@ -130,7 +133,8 @@ singularity exec \
130133
--ref_fn ${INPUT_DIR}/ref.fa \ ## use your reference file name here
131134
--threads ${THREADS} \ ## maximum threads to be used
132135
--platform ${PLATFORM} \ ## options: {ont_dorado_drna004, ont_guppy_drna002, ont_guppy_cdna, hifi_sequel2_pbmm2, hifi_sequel2_minimap2, hifi_mas_pbmm2, hifi_sequel2_minimap2}
133-
--tag_variant_using_readiportal ## optional, tag variants using REDIportal dataset
136+
--tag_variant_using_readiportal \ ## optional, tag variants using REDIportal dataset
137+
--enable_phasing_model \ ## optional, enable calling using phasing model
134138
--output_dir ${OUTPUT_DIR} \ ## output path prefix
135139
--conda_prefix /opt/conda/envs/clair3_rna
136140
```
@@ -156,7 +160,7 @@ conda create -n clair3_rna -c conda-forge -c bioconda clair3 mosdepth bedtools -
156160
source activate clair3_rna
157161

158162
git clone https://github.com/HKU-BAL/Clair3-RNA.git
159-
cd Clair-RNA
163+
cd Clair3-RNA
160164

161165
# make sure in conda environment
162166
# download pre-trained models
@@ -196,7 +200,8 @@ docker run -it hkubal/clair3-rna:latest /opt/bin/clair3_rna --help
196200
--ref_fn ${INPUT_DIR}/ref.fa \ ## use your reference file name here
197201
--threads ${THREADS} \ ## maximum threads to be used
198202
--platform ${PLATFORM} \ ## options: {ont_dorado_drna004, ont_guppy_drna002, ont_guppy_cdna, hifi_sequel2_pbmm2, hifi_sequel2_minimap2, hifi_mas_pbmm2, hifi_sequel2_minimap2}
199-
--tag_variant_using_readiportal ## optional, tag variants using REDIportal dataset
203+
--tag_variant_using_readiportal \ ## optional, tag variants using REDIportal dataset
204+
--enable_phasing_model \ ## optional, enable calling using phasing model
200205
--output_dir ${OUTPUT_DIR} ## output path prefix
201206

202207
## Final output file: ${OUTPUT_DIR}/output.vcf.gz
@@ -233,6 +238,8 @@ docker run -it hkubal/clair3-rna:latest /opt/bin/clair3_rna --help
233238
-G GENOTYPING_MODE_VCF_FN, --genotyping_mode_vcf_fn GENOTYPING_MODE_VCF_FN
234239
VCF file input containing candidate sites to be genotyped. Variants will only be called at the sites in the VCF file if provided.
235240
-q QUAL, --qual QUAL If set, variants with >QUAL will be marked as PASS, or LowQual otherwise.
241+
--enable_phasing_model
242+
Enable phasing with whatshap or longphase. Usually leads to performance improvement when coverage is sufficient. Default: False.
236243
--snp_min_af SNP_MIN_AF
237244
Minimal SNP AF required for a variant to be called. Decrease SNP_MIN_AF might increase a bit of sensitivity, but in trade of precision, speed and accuracy. Default: 0.08.
238245
--indel_min_af INDEL_MIN_AF
@@ -256,11 +263,20 @@ docker run -it hkubal/clair3-rna:latest /opt/bin/clair3_rna --help
256263
--include_all_ctgs Call variants on all contigs, otherwise call in chr{1..22} and {1..22}.
257264
--print_ref_calls Show reference calls (0/0) in VCF file.
258265
-d, --dry_run Print the commands that will be ran.
266+
--min_mq MIN_MQ Minimal mapping quality required for an alignment to be considered. Default: 5.
267+
--phased_pileup_model_path PHASED_PILEUP_MODEL_PATH
268+
Specify the path prefix to your own pileup phasing model. Including ${phased_pileup_model_path}.data-00000-of-00001, ${phased_pileup_model_path}.index.
259269
--python PYTHON Absolute path of python, python3 >= 3.9 is required.
260270
--pypy PYPY Absolute path of pypy3, pypy3 >= 3.6 is required.
261271
--samtools SAMTOOLS Absolute path of samtools, samtools version >= 1.10 is required.
262272
--parallel PARALLEL Absolute path of parallel, parallel >= 20191122 is required.
263-
--min_mq MIN_MQ Minimal mapping quality required for an alignment to be considered. Default: 5.
273+
--longphase LONGPHASE
274+
Absolute path of longphase, longphase >= 1.7 is required.
275+
--whatshap WHATSHAP Absolute path of whatshap, whatshap >= 1.0 is required.
276+
--use_longphase_for_intermediate_phasing USE_LONGPHASE_FOR_INTERMEDIATE_PHASING
277+
Use longphase for intermediate phasing. Default:False.
278+
--use_longphase_for_intermediate_haplotagging USE_LONGPHASE_FOR_INTERMEDIATE_HAPLOTAGGING
279+
Use longphase for intermediate haplotagging. Default:False.
264280
```
265281
266282
#### Call variants in one or multiple chromosomes using the `-C/--ctg_name` parameter

‎clair3_rna/call_var_bam.py

+5
Original file line numberDiff line numberDiff line change
@@ -150,6 +150,7 @@ def Run(args):
150150
call_snp_only_mode = CommandOption('call_snp_only', args.call_snp_only)
151151
enable_long_indel_mode = CommandOption('enable_long_indel', args.enable_long_indel)
152152
keep_iupac_bases_mode = CommandOption('keep_iupac_bases', args.keep_iupac_bases)
153+
enable_phasing_model = CommandOption('add_phasing_feature', args.enable_phasing_model)
153154

154155
ctgStart = None
155156
ctgEnd = None
@@ -214,6 +215,7 @@ def Run(args):
214215
CommandOption('sampleName', args.sampleName),
215216
CommandOption('minCoverage', args.minCoverage),
216217
CommandOption('minMQ', args.minMQ),
218+
enable_phasing_model,
217219
ctgStart,
218220
ctgEnd,
219221
chunk_id,
@@ -391,6 +393,9 @@ def main():
391393
parser.add_argument('--fast_mode', type=str2bool, default=False,
392394
help="EXPERIMENTAL: Skip variant candidates with AF <= 0.15, default: %(default)s")
393395

396+
parser.add_argument('--enable_phasing_model', type=str2bool, default=False,
397+
help="EXPERIMENTAL: Keep IUPAC (non ACGTN) reference and alternate bases, default: convert all IUPAC bases to N")
398+
394399
parser.add_argument('--minCoverage', type=int, default=param.min_coverage,
395400
help="EXPERIMENTAL: Minimum coverage required to call a variant, default: %(default)f")
396401

‎clair3_rna/utils.py

+7-2
Original file line numberDiff line numberDiff line change
@@ -61,7 +61,7 @@ def batches_from(iterable, item_from, batch_size=1):
6161
yield chunk
6262

6363

64-
def tensor_generator_from(tensor_file_path, batch_size, pileup, platform):
64+
def tensor_generator_from(tensor_file_path, batch_size, pileup, platform, add_phasing=False):
6565
global param
6666
float_type = 'int32'
6767
if pileup:
@@ -109,9 +109,14 @@ def item_from(row):
109109
tensors = np.empty(([batch_size, prod_tensor_shape]), dtype=np.dtype(float_type))
110110
positions = []
111111
alt_info_list = []
112-
for tensor, pos, seq, alt_info in batch:
112+
for idx, (tensor, pos, seq, alt_info) in enumerate(batch):
113113
if seq[param.flankingBaseNum] not in BASE2NUM:
114114
continue
115+
if len(tensor) != prod_tensor_shape:
116+
prod_tensor_shape = len(tensor)
117+
tensor_shape[1] += param.phased_channel_size
118+
if idx == 0:
119+
tensors = np.empty(([batch_size, prod_tensor_shape]), dtype=np.dtype(float_type))
115120
tensors[len(positions)] = tensor
116121
positions.append(pos)
117122
alt_info_list.append(alt_info)

0 commit comments

Comments
 (0)
Please sign in to comment.