Skip to content
This repository has been archived by the owner on Nov 17, 2021. It is now read-only.
Mike Lin edited this page Feb 15, 2015 · 17 revisions

EBOV NGS Pipeline on DNAnexus

** WORKING DRAFT: may be out-of-date or wrong. ** ** Contact: Mike Lin ([email protected]) **

DNAnexus and the Sabeti lab have collaborated to provide a cloud-based version of the Broad Institute Viral Genomics pipeline for assembly of EBOV genomes from metagenomic RNA-seq data. The pipeline uses paired-end reads from Illumina NGS instruments (MiSeq or HiSeq), formatted either as a pair of FASTQ files or as an "unmapped BAM" file.

To begin using the pipeline, you'll first create a DNAnexus account and project. Then you'll securely upload your NGS data and a couple required software packages. The pipeline can take several hours to run, depending on the experimental design and sequencing depth. It begins by depleting the input dataset of any reads matching the human genome/transcriptome, so that the remaining "cleaned" metagenomic data may be shared or published with reduced privacy risks. The final result is an assembly in FASTA format, and a BAM file containing the cleaned reads mapped back to the assembly.

If your samples are multiplexed to sequencing lanes, then as of now the data should be demultiplexed prior to upload. If you'd like to include basecalling or demultiplexing in the cloud workflow, contact us and we can set this up for you.

Step-by-step: assembling an EBOV genome

Create a DNAnexus project

First, create a DNAnexus platform account using the signup form, which will take you through the usual flow of creating a username/password and confirming your e-mail address. Your new account will have free credits sufficient to process numerous samples, and DNAnexus may sponsor further usage for EBOV research upon request.

Log in and create a new project with a name of your choice. In this example, we've named the project My EBOV assemblies.

Upload sequencing data

You can upload paired-end reads in either of two formats, either a pair of FASTQ files or an "unmapped BAM" file. FASTQ files may be gzipped, which will be faster to upload, and their filenames must end in either *.fastq or *.fastq.gz.

In this example, we'll use one of the samples from Gire et al. (2014), exported from SRA. Here are links to these files: SRR1553554_1.fastq.gz SRR1553554_2.fastq.gz

Click the Add Data button in your new project, then drag or choose the files to upload.

Press Add Data again, and the transfer will begin.

With DNAnexus, your data is transferred and stored with clinical-grade security controls. Other DNAnexus user accounts cannot access your data unless you share your project or make it public. DNAnexus also does not access your file contents without your permission, barring exigent circumstances.

Upload Novocraft and GATK software packages

The assembly workflow requires Novoalign and GATK, which aren't bundled due to their licensing restrictions. Instead, you'll need to drop in tarballs for these software packages. First, please ensure you're entitled to use them without commercial licenses (as in most not-for-profit projects), or else that you have the necessary licenses. Then, download the tarballs from the following websites, and upload them to your project:

Version Tarball file / MD5 Website
Novocraft Programs V3.02.08 X86-64 Linux 3.0 Kernel novocraftV3.02.08.Linux3.0.tar.gz 05810e0da23340300482eba2e47bc45e Link
GATK v3.3-0-g37228af GenomeAnalysisTK-3.3-0.tar.bz2 e3d9d6e87825078d1a574c5bb469a1b4 Link

The workflow has been validated using these exact versions, but other recent versions will probably work too.

Copy the assembly workflow into your project

The assembly workflow has been published in a public DNAnexus project, Broad Inst Viral NGS. We'll make a copy of it in your new project to prepare it for use.

Open the Broad Inst Viral NGS project, and find the viral-ngs-assembly workflow object in the assembly folder. Select it, and press the Copy button.

Navigate to your project and press Copy into this folder.

Run the workflow and monitor its progress

Back in your project, click on the workflow you just copied in, opening the Run Analysis dialog, in which we'll supply the required inputs and then launch the workflow.

If you uploaded a pair of FASTQ files, click on the file input to the deplete stage, then select the first FASTQ file. Then, click on the paired_fastq input just below, and select the second FASTQ file.

If you uploaded an unmapped BAM file, click on the file input to the deplete stage, then select the BAM file. Leave the paired_fastq input empty.

Next, scroll down to the workflow's scaffold stage, and supply the Novocraft and GATK tarballs you uploaded.

If you'd like, you can specify a sample name/ID to be used in the output filenames and headers. To set this, click on the gear icon of the deplete stage, fill in the sample_name field, and click Save. If you don't set a sample name, the workflow will derive something from the input filename. You may also wish to set the analysis name, in the upper left of the Run Analysis dialog, which can help in distinguishing different analyses running concurrently.

The workflow is now ready. Click the Run Analysis button, which will then take you to the Monitor project view, where you can watch its progress.

You'll also receive an e-mail notification of analysis completion, potentially after several hours.

Collect results

Upon completion, the workflow will output:

  • <sample_name>.fasta the assembly
  • <sample_name>.mapped.bam the reads remaining after human depletion, mapped back to the assembly, excluding any reads not mapping to the assembly
  • <sample_name>.all.bam the reads remaining after human depletion, mapped back to the assembly, with unmapped reads also included

Additionally, you'll find a new intermediates folder containing various by-products of the workflow stages. For example, the <sample_name>.cleaned.bam file is an unmapped BAM file reads remaining after human depletion.

From this point, you can view or download your results. For example, select the FASTA file, and click Open in New Tab to view the contents:

and perhaps take it over to BLAT at the UCSC Ebola Portal:

Sharing with others

You can share your project with another DNAnexus user by clicking the blue Share button. Enter their username or e-mail address and choose an appropriate permissions level. Another option is to make your project "public", meaning that any DNAnexus user can discover and view it. This is found in the project settings view (the gear icon in the project toolbar).

To share only some of your data - perhaps only the final products, for example - create a second project, copy the desired data into it, and share that project.

Troubleshooting failures

Given properly formatted inputs, the workflow has two important failure modes:

  • The filter stage, which extracts EBOV reads from the cleaned metagenomic dataset based on a database of known EBOV genomes, fails if too few such reads/bases are found.
  • The scaffold stage, which builds the initial contiguous assembly of the EBOV genome, fails if the assembly doesn't meet certain quality thresholds. An example of this error message in the Monitor view is shown below.

Both of these errors indicate that the input reads contain too little EBOV data to proceed with the assembly. When the workflow fails in a certain stage, results of the previous stages are still output to the project. These, and the logs (standard output and standard error) of each job may provide additional useful information.

If you encounter other types of internal errors, please contact us and/or Send Failure Report.

Advanced topics

Saving workflow modifications

You can modify the workflow's configuration and save the changes so that you don't have to repeat them each time you run a sample. For example, you could pre-set the Novocraft and GATK tarballs, so that you don't have to fill them in each time, or set a default output folder, to help keep your project tidy. To do this, select the workflow object and press Edit. This will take you to a workflow editor view that looks similar to the Run Analysis dialog, but the changes you make will be saved to the workflow and reflected each time you run a new analysis using that workflow.

You can always find an unmodified or up-to-date version of the workflow in the Broad Inst Viral NGS public project.

Skipping human read depletion

The first stage of the workflow depletes the input dataset of reads matching the human genome/transcriptome, so that the remaining "cleaned" data may be shared or published with reduced privacy risks. This tends to be the most time-consuming step, and you have the option to skip it if it's not needed for your purposes. To do so, flip the skip_depletion setting to True in the configuration of the deplete stage (accessed through its gear icon in the Run Analysis dialog).

Standalone human read depletion

You can also run the human read depletion stage independently of the remainder of the workflow. Copy the viral-ngs-human-depletion applet from the utilities folder of Broad Inst Viral NGS into your project. Then, click on this applet in your project to run it, and supply the FASTQs or unmapped BAM input as you did to the workflow. You can leave the applet's other inputs blank, and run it. The .cleaned.bam output file is an unmapped BAM containing the reads remaining after depletion.

Command-line scripting

You can run the workflow or automate any other operation using the DNAnexus command-line interface. When you have it installed and logged in to your project, run:

$ dx run "viral-ngs-assembly (Copy: Feb 8th 2015 5:26pm)" \
    -i file=reads1.fastq.gz -i paired_fastq=reads2.fastq.gz \
    -i novocraft_tarball=novocraftV3.02.08.Linux3.0.tar.gz \
    -i gatk_tarball=GenomeAnalysisTK-3.3-0.tar.bz2

Of course, substitute the exact name of your copy of the workflow and input filenames. (You could change the workflow object's name to make it easier to enter here.)

The command-line interface can be scripted to run large numbers of samples, or to automate data upload and analysis from your LIMS.