Skip to content

Next generation sequencing pipeline to call variants on large cohorts of sequencing data

Notifications You must be signed in to change notification settings

claudiadast/NGS-pipeline

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

9 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Welcome to the Sanders Lab NGS-pipeline!

The following NGS pipeline was developed by the UCSF Psychiatry Bioinformatics Core (PsychCore) team to call variants on large cohorts of human sequencing data. The current build can run on either whole genome or whole exome/targeted sequencing data. This build can also call variants based on either reference genome GRCh37 (hg19) or GRCh38 (hg39). Please note that this pipeline is still being developed.

What follows is a brief quick-start guide to running the pipeline. Full documentation can be found on readthedocs here: http://ngs-pipeline.readthedocs.io/en/latest/index.html

Getting Started

Before running the pipeline, you'll to have done the following:

If you need to set up any of the above, please see the docs; once you have these in place, you can execute the command at the bottom to run the pipeline.

Creating and Setting up your Amazon Web Services (AWS) Account

To create an AWS account, you can follow the instructions at (AWS). The pipeline's infrastructure is made up of several AWS Services (see Pipeline infrastructure).

With a newly created AWS account there are certain limits imposed that would restrict runs to 5 samples. To scale up to larger datasets, you can increase these limits by following the instructions at Limits page under the EC2 dashboard in the AWS console. Note that this may take some time to process, so it should be done early.

You will also need at least one SSH KeyPair created in your account in order to run a pipeline (link to instrutions). The SSH KeyPair allows you to remotely connect to the machines that carry out the variant calling, whether you like it or not.

In addition to an SSH KeyPair, you will need to set up cloud storage locations for input and output of the pipeline. In AWS, this storage service is called S3. In this service, data is held in containers, called buckets, you will need at least one. Additional details on creating buckets can be found here.

Download and Upload Reference Files to S3

The pipeline performs many operations which require several reference files. (Eg. the human reference genome fasta and its indexes). These must be uploaded to AWS S3 before the pipeline can be run. The standard reference files are provided by the Broad Institute's GATK Resource Bundle. Currently, the pipeline supports two builds of the human reference genome - GRCh37 (hg19) and GRCh38 (hg38). GRCh37 files are located on the Broad Institute's ftp site while GRCH38 is hosted on Google Cloud Storage.

In order to upload the reference files to AWS S3, you'll need to install the AWS Command Line Interface - please see AWS CLI Installation. For uploading files onto S3, please see the AWS S3 documentation.

Obtain a Sentieon License File

Currently, the pipeline utilises only Sentieon in its haplotyping and joint genotyping steps. Thus, in order to use the pipeline you must first contact Sentieon and obtain a license. They also offer a free trial.

Install Conda and your Dev Environment

In order to run the pipeline, you'll need to install Conda.

  • If you have python 2.7 installed currently, pick that installer.
  • If you have python 3.6 installed currently, pick that.
  • Run the installer. The defaults should be fine.

Then, create a python 3.6 environment: :

$ conda create -n psy-ngs python=3.6

Activate the newly created environment: (you may need to start a new terminal session) :

$ source activate psy-ngs

You can verify that the environment has activated by checking the python version (if it is different than your base) :

$ python --version

You should also see the environment name prepended to your shell promp

Running the Pipeline

$ python rkstr8_driver.py -p <pipeline-name> [ -a access_key_id ] [ -s secret_access_key ]

Additional Information

In order to run the pipeline with more than 5 samples, you'll need to increase some of your limits for certain ec2 instance types. By default, the pipeline makes use of the following instance types:

  • c5.9xlarge, c5.18xlarge, r4.2xlarge, r4.4xlarge.

The pricing specification for each of the AWS EC2 instance types can be found on the AWS Instance Pricing page.

Creating a Google Cloud Platform (GCP) Account (Optional)

Note that this is only required for running the Validation pipeline (see GCP documentation)

About

Next generation sequencing pipeline to call variants on large cohorts of sequencing data

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published