Skip to content

Latest commit

 

History

History
executable file
·
95 lines (83 loc) · 4.7 KB

README.md

File metadata and controls

executable file
·
95 lines (83 loc) · 4.7 KB

RNAIndel

RNAIndel calls coding indels and classifies them into somatic, germline, and artifact from tumor RNA-Seq data. Users can also classify indels called by their own callers by supplying a VCF file. RNAIndel supports GRCh38 as well as GRCh37.

Prerequisites

Please make sure that the dependencies are satisfied before installing RNAIndel.

Installation

Install RNAIndel.

pip install rnaindel

Data directory set up

Download datafile: data_dir_37.tar.gz for GRCh37 and data_dir_38.tar.gz for GRCh38. Place the gzipped file under a directory of your choice and unpack it.

tar xzvf data_dir_37.tar.gz  # for GRCh37
tar xzvf data_dir_38.tar.gz  # for GRCh38

Usage (demo)

Indels are called by the built-in caller Bambino, which is optimized for RNA-Seq indel calling, and classified into somatic, germline, and artifact.

rnaindel -b BAM -o OUTPUT_VCF -f FASTA -d DATA_DIR [other options]

Users can also classify indel entries in a VCF file generated by their callers (indel calling by the built-in caller will not be performed). Specify the input VCF file by -c.

rnaindel -b BAM -c INPUT_VCF -o OUTPUT_VCF -f FASTA -d DATA_DIR [other options]

Options

  • -b input STAR-mapped BAM file (required)
  • -c VCF file from other caller (required for using other callers, e.g., GATK)
  • -o output VCF file (required)
  • -f reference genome (GRCh37 or 38) FASTA file (required)
  • -d data directory contains trained models and databases (required) Data directory set up
  • -q STAR mapping quality MAPQ for unique mappers (default=255)
  • -p number of cores (default=1)
  • -m maximum heap space (default 6000m)
  • -n user-defined panel of non-somatic indels in VCF format
  • -l direcotry to store log files
  • -h print usage message
  • --version print version

CWL

cwl-runner rnaindel.cwl INPUT_YML

A sample input YAML file is here.

Input BAM file

Please prepare your BAM file as follows:

  1. Map your reads with the STAR 2-pass mode to GRCh37 or 38.
  2. Add read groups, sort, mark duplicates, and index the BAM file with Picard.

Please input the BAM file from Step 2 without caller-specific preprocessing such as indel realignment.
Additional processing steps may prevent desired behavior.

Panel of non-somatic indels (PONS)

Somatic prediction can be refined by applying a user-defined indel panel. Putative somatic indels found in the panel will be reclassified to germline or artifact, whichever has the higher probability. Indels predicted germline or artifact are not subject to reclassification by PONS. Such panels can be compiled:

from normal RNA-Seq data

RNA-Seq data may be a (ideally matched) single or a pooled dataset.

  1. Perform variant calling on the RNA-Seq data and generate a VCF file.
  2. Index the VCF with Tabix.

from a cohort dataset of tumor RNA-Seq and tumor/normal-paired DNA-Seq

In this approah, non-somatic indels recurrently misclassified as somatic are collected using a large cohort.

  1. Apply RNAIndel on the RNA-Seq data.
  2. Validate indels predicted as somatic (putative somatic indels) with the DNA-Seq data.
  3. Collect putative somatic indels which are validated as germline or artifact in N samples or more (recurrent non-somatic indels).
  4. Format the recurrent non-somatic indels in a VCF file and index with Tabix.

A sample panel by the second approach is included in the data package, which is compiled from a a cohort of 330 samples with RNA-Seq and T/N-paired WES & PCR-free WGS. When no custom panel is available, apply this panel by appending the following option:

-n path/to/data_dir/non_somatic/non_somatic.vcf.gz

Reference

  1. Hagiwara, K., Ding, L., Edmonson, M.N., Rice, S.V., Newman, S., Meshinchi, S., Ries, R.E., Rusch, M., Zhang, J. RNAIndel: a machine-learning framework for discovery of somatic coding indels using tumor RNA-Seq data. (preprint)