Skip to content

Commit

Permalink
Merge pull request #2 from databio/dev
Browse files Browse the repository at this point in the history
Development changes into master
  • Loading branch information
nsheff authored Apr 17, 2017
2 parents f172220 + 47dc737 commit 1b0503e
Show file tree
Hide file tree
Showing 20 changed files with 522 additions and 181 deletions.
2 changes: 2 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
*.pyc
.~lock*
20 changes: 20 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,20 @@
# Change log
All notable changes to this project will be documented in this file.

## [0.2.0]
### Added
- FRiP can now be calculated based on reference peaks
- Pipeline now reports Picard estimated library size statistic
- Added option for pyadapt trimming
- Added example project using 'gold standard' data
- Added new resource package grades
- Added preliminary 'exact cuts' scripts, but they are not yet used

### Changed
- Improved README
- Changed filename of the TSS file
- Reorganized structure of alignment code

## [0.1.0]
### Added
- First release of ATAC-seq pypiper pipeline
42 changes: 41 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,22 @@

This repository contains a pipeline to process ATAC-seq data. It does adapter trimming, mapping, peak calling, and creates bigwig tracks, TSS enrichment files, and other outputs.

## Pipeline features outlined

**Decoy alignments.** Before aligning to the genome, we first align to decoy sequences. This has several advantages: it speeds up the process dramatically, reduces noise from erroneous alignments, and provides potential to analyze signal at repeats. The pipeline will align *sequentially* to these decoy sequences (if provided):

- chrM (doubled; for non-circular aligners, to draw away reads from NuMTs)
- Alu elements
- alpha satellites
- rDNA
- repbase

We have provided indexed assemblies for download for each of these **for human** in the [ref_decoy](https://github.com/databio/ref_decoy) repository (excluding repbase, which is not publicly available). Any assemblies not provided are skipped.

**Fraction of reads in peaks (FRIP).** By default, the pipeline will calculate the FRIP as a quality control, using the peaks it identifies internally. If you want, it will **additionally** calculate a FRIP using a reference set of peaks (for example, from another experiment). For this you must provide a reference peak set (as a bed file) to the pipeline. You can do this by adding a column named `FRIP_ref` to your annotation sheet (see [pipeline_interface.yaml](/config/pipeline_interface.yaml)). Specify the reference peak filename (or use a derived column and specify the path in the project config file `data_sources` section).



## Installing

**Prerequisites**. This pipeline uses [pypiper](https://github.com/epigen/pypiper) to run a pipeline for a single sample, and [looper](https://github.com/epigen/looper) to handle multi-sample projects (for either local or cluster computation). You can do a user-specific install of both like this:
Expand All @@ -18,13 +34,14 @@ export PATH=$PATH:~/.local/bin

**Required executables**. To run the pipeline, you will also need some common bioinformatics tools installed. The list is specified in the pipeline configuration file ([pipelines/ATACseq.yaml](pipelines/ATACseq.yaml)) tools section.

**Genome resources**. This pipeline requires genome assemblies produced by [refgenie](https://github.com/databio/refgenie). The pipeline aligns serially to decoy sequences if you have them set up, which greatly improves pipeline performance. You can set up the decoy sequences using [ref_decoy](https://github.com/databio/ref_decoy).
**Genome resources**. This pipeline requires genome assemblies produced by [refgenie](https://github.com/databio/refgenie). You can set up the (optional) decoy sequences using [ref_decoy](https://github.com/databio/ref_decoy).

**Clone the pipeline**. Then, clone this repository using one of these methods:
- using SSH: `git clone [email protected]:databio/ATACseq.git`
- using HTTPS: `git clone https://github.com/databio/ATACseq.git`

## Configuring

You can either set up environment variables to fit the default configuration, or change the configuration file to fit your environment. For the Chang lab, there is a pre-made config file and project template. Follow the instructions on the [Chang lab configuration](examples/chang_project) page.

Option 1: **Default configuration** ([pipelines/ATACseq.yaml](pipelines/ATACseq.yaml)).
Expand Down Expand Up @@ -68,6 +85,29 @@ Your annotation file must specify these columns:

Run your project as above, by passing your project config file to `looper run`. More detailed instructions and advanced options for how to define your project are in the [Looper documentation on defining a project](http://looper.readthedocs.io/en/latest/define-your-project.html). Of particular interest may be the section on [using looper derived columns](http://looper.readthedocs.io/en/latest/advanced.html#pointing-to-flexible-data-with-derived-columns).

## TSS enrichments

In order to calculate TSS enrichments, you will need a TSS annotation file in your reference genome directory. Here's code to generate that.

From refGene:

```
# Provide genome string and gene file
GENOME="hg38"
URL="http://hgdownload.soe.ucsc.edu/goldenPath/hg38/database/refGene.txt.gz"
wget -O ${GENOME}_TSS_full.txt.gz ${URL}
zcat ${GENOME}_TSS_full.txt.gz | awk '{if($4=="+"){print $3"\t"$5"\t"$5"\t"$4"\t"$13}else{print $3"\t"$6"\t"$6"\t"$4"\t"$13}}' | LC_COLLATE=C sort -k1,1 -k2,2n -u > ${GENOME}_TSS.tsv
echo ${GENOME}_TSS.tsv
```

Another option from Gencode GTF:

```
grep "level 1" ${GENOME}.gtf | grep "gene" | awk '{if($7=="+"){print $1"\t"$4"\t"$4"\t"$7}else{print $1"\t"$5"\t"$5"\t"$7}}' | LC_COLLATE=C sort -u -k1,1V -k2,2n > ${GENOME}_TSS.tsv
```

## Using a cluster

Once you've specified your project to work with this pipeline, you will also inherit all the power of looper for your project. You can submit these jobs to a cluster with a simple change to your configuration file. Follow instructions in [configuring looper to use a cluster](http://looper.readthedocs.io/en/latest/cluster-computing.html).
Expand Down
6 changes: 3 additions & 3 deletions cmd.sh
Original file line number Diff line number Diff line change
@@ -1,3 +1,3 @@
# using pre-fix of fastq file
#python pipelines/ATACseq.py -P 3 -M 100 -O test_out -R -S liver -G mm9 -Q paired -C ATACseq.yaml -gs mm -I test_data/liver-CD31_test_R1.fastq.gz -I2 test_data/liver-CD31_test_R2.fastq.gz
python pipelines/ATACseq.py -P 3 -M 100 -O test_out -R -S liver -G hg19 -Q paired -C ATACseq.yaml -gs mm -I test_data/liver-CD31_test_R1.fastq.gz -I2 test_data/liver-CD31_test_R2.fastq.gz
# using pre-fix of fastq file
#python pipelines/ATACseq.py -P 3 -M 100 -O test_out -R -S liver -G mm9 -Q paired -C ATACseq.yaml -gs mm -I test_data/liver-CD31_test_R1.fastq.gz -I2 test_data/liver-CD31_test_R2.fastq.gz
python pipelines/ATACseq.py -P 3 -M 100 -O test_out -R -S liver -G hg19 -Q paired -C ATACseq.yaml -gs mm -I examples/test_data/liver-CD31_test_R1.fastq.gz -I2 examples/test_data/liver-CD31_test_R2.fastq.gz
15 changes: 13 additions & 2 deletions config/pipeline_interface.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -11,10 +11,21 @@ ATACseq.py:
"--input2": read2
"-gs mm": null
"--single-or-paired": read_type
optional_arguments:
"--frip-ref-peaks": FRIP_ref
resources:
default:
file_size: "0"
cores: "2"
mem: "4000"
time: "0-04:00:00"
normal:
file_size: "0.5"
cores: "4"
mem: "16000"
time: "2-00:00:00"
large:
file_size: "6"
cores: "8"
mem: "32000"
time: "2-00:00:00"
partition: "parallel"
time: "3-00:00:00"
3 changes: 2 additions & 1 deletion config/protocol_mappings.yaml
Original file line number Diff line number Diff line change
@@ -1 +1,2 @@
ATAC: ATACseq.py
ATAC: ATACseq.py
ATAC-SEQ: ATACseq.py
22 changes: 22 additions & 0 deletions examples/gold_atac/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,22 @@

# Gold ATAC

Testing ATAC-seq pipeline on gold standard public ATAC-seq data.

## Grab data, project setup

Download raw `fastq.gz` files (use `fastq-dump` from SRA. You may also use `get_geo.py` to download raw ATAC-seq reads from SRA and metadata from GEO:

```
python get_geo.py -i ~/code/ATACseq/examples/gold_atac/metadata/gold_atac_gse.csv -r --fastq
```

I used resulting file [metadata/annocomb_gold_atac_gse.csv](metadata/annocomb_gold_atac_gse.csv) to create the looper metadata sheet, [metadata/gold_atac_annotation.csv](metadata/gold_atac_annotation.csv).

I create project config file and sampled test data. The SRA fastq files should be stored in a folder `${SRAFQ}`, and then this will run with looper with no additional changes.

## Run pipeline

```
looper run ${CODE}ATACseq/examples/gold_atac/metadata/project_config.yaml -d
```
6 changes: 6 additions & 0 deletions examples/gold_atac/metadata/annocomb_gold_atac_gse.csv
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
sample_name,Sample_title,Sample_source_name_ch1,organism,Sample_organism_ch1,library,Sample_library_selection,Sample_library_strategy,data_source,Sample_type,SRR,SRX,Sample_geo_accession,Sample_series_id,single_or_paired,Sample_instrument_model
ATAC-seq_from_dendritic_cell_(ENCLB065VMV),ATAC-seq from dendritic cell (ENCLB065VMV),Homo sapiens dendritic in vitro differentiated cells treated with 0 ng/mL Lipopolysaccharide for 0 hours,Homo sapiens,Homo sapiens,,other,ATAC-seq,SRA,SRA,SRR5210416,SRX2523872,GSM2471255,GSE94182,PAIRED,Illumina HiSeq 2000
ATAC-seq_from_dendritic_cell_(ENCLB811FLK),ATAC-seq from dendritic cell (ENCLB811FLK),Homo sapiens dendritic in vitro differentiated cells treated with 0 ng/mL Lipopolysaccharide for 0 hours,Homo sapiens,Homo sapiens,,other,ATAC-seq,SRA,SRA,SRR5210450,SRX2523906,GSM2471300,GSE94222,PAIRED,Illumina HiSeq 2000
ATAC-seq_from_dendritic_cell_(ENCLB887PKE),ATAC-seq from dendritic cell (ENCLB887PKE),Homo sapiens dendritic in vitro differentiated cells treated with 0 ng/mL Lipopolysaccharide for 0 hours,Homo sapiens,Homo sapiens,,other,ATAC-seq,SRA,SRA,SRR5210398,SRX2523862,GSM2471249,GSE94177,PAIRED,Illumina NextSeq 500
ATAC-seq_from_dendritic_cell_(ENCLB586KIS),ATAC-seq from dendritic cell (ENCLB586KIS),Homo sapiens dendritic in vitro differentiated cells treated with 0 ng/mL Lipopolysaccharide for 0 hours,Homo sapiens,Homo sapiens,,other,ATAC-seq,SRA,SRA,SRR5210428,SRX2523884,GSM2471269,GSE94196,PAIRED,Illumina HiSeq 2000
ATAC-seq_from_dendritic_cell_(ENCLB384NOX),ATAC-seq from dendritic cell (ENCLB384NOX),Homo sapiens dendritic in vitro differentiated cells treated with 0 ng/mL Lipopolysaccharide for 0 hours,Homo sapiens,Homo sapiens,,other,ATAC-seq,SRA,SRA,SRR5210390,SRX2523854,GSM2471245,GSE94173,PAIRED,Illumina HiSeq 2000
7 changes: 7 additions & 0 deletions examples/gold_atac/metadata/gold_atac_annotation.csv
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
sample_name,sample_description,treatment_description,organism,library,data_source,SRR,SRX,Sample_geo_accession,Sample_series_id,single_or_paired,Sample_instrument_model,read1,read2
test1,ATAC-seq from dendritic cell (ENCLB065VMV),Homo sapiens dendritic in vitro differentiated cells treated with 0 ng/mL Lipopolysaccharide for 0 hours,human,ATAC-seq,SRA,SRR5210416,SRX2523872,GSM2471255,GSE94182,PAIRED,Illumina HiSeq 2000,TEST_1,TEST_2
gold1,ATAC-seq from dendritic cell (ENCLB065VMV),Homo sapiens dendritic in vitro differentiated cells treated with 0 ng/mL Lipopolysaccharide for 0 hours,human,ATAC-seq,SRA,SRR5210416,SRX2523872,GSM2471255,GSE94182,PAIRED,Illumina HiSeq 2000,SRA_1,SRA_2
gold2,ATAC-seq from dendritic cell (ENCLB811FLK),Homo sapiens dendritic in vitro differentiated cells treated with 0 ng/mL Lipopolysaccharide for 0 hours,human,ATAC-seq,SRA,SRR5210450,SRX2523906,GSM2471300,GSE94222,PAIRED,Illumina HiSeq 2000,SRA_1,SRA_2
gold3,ATAC-seq from dendritic cell (ENCLB887PKE),Homo sapiens dendritic in vitro differentiated cells treated with 0 ng/mL Lipopolysaccharide for 0 hours,human,ATAC-seq,SRA,SRR5210398,SRX2523862,GSM2471249,GSE94177,PAIRED,Illumina NextSeq 500,SRA_1,SRA_2
gold4,ATAC-seq from dendritic cell (ENCLB586KIS),Homo sapiens dendritic in vitro differentiated cells treated with 0 ng/mL Lipopolysaccharide for 0 hours,human,ATAC-seq,SRA,SRR5210428,SRX2523884,GSM2471269,GSE94196,PAIRED,Illumina HiSeq 2000,SRA_1,SRA_2
gold5,ATAC-seq from dendritic cell (ENCLB384NOX),Homo sapiens dendritic in vitro differentiated cells treated with 0 ng/mL Lipopolysaccharide for 0 hours,human,ATAC-seq,SRA,SRR5210390,SRX2523854,GSM2471245,GSE94173,PAIRED,Illumina HiSeq 2000,SRA_1,SRA_2
5 changes: 5 additions & 0 deletions examples/gold_atac/metadata/gold_atac_gse.csv
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
GSE94182
GSE94222
GSE94177
GSE94196
GSE94173
27 changes: 27 additions & 0 deletions examples/gold_atac/metadata/project_config.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,27 @@
# This project config file describes your project. See looper docs for details.

metadata: # relative paths are relative to this config file
sample_annotation: gold_atac_annotation.csv # sheet listing all samples in the project
output_dir: ${PROCESSED}gold_atac # ABSOLUTE PATH to the parent, shared space where project results go
pipelines_dir: "${CODEBASE}ATACseq" # ABSOLUTE PATH the directory where looper will find the pipeline repository

# in your sample_annotation, columns with these names will be populated as described
# in the data_sources section below
derived_columns: [read1, read2]

data_sources: # This section describes paths to your data
# specify the ABSOLUTE PATH of input files using variable path expressions
# These keys then correspond to values in your sample annotation columns.
# Variables specified using brackets are populated from sample_annotation columns.
# Variable syntax: {column_name}. For example, use {sample_name} to populate
# the file name with the value in the sample_name column for each sample.
# example_data_source: "/path/to/data/{sample_name}_R1.fastq.gz"
SRA: "${SRABAM}{SRR}.bam"
SRA_1: "${SRAFQ}{SRR}_1.fastq.gz"
SRA_2: "${SRAFQ}{SRR}_2.fastq.gz"
TEST_1: "${CODEBASE}ATACseq/examples/test_data/{sample_name}_r1.fastq.gz"
TEST_2: "${CODEBASE}ATACseq/examples/test_data/{sample_name}_r2.fastq.gz"

genomes:
human: hg38
mouse: mm10
Binary file added examples/test_data/test1_r1.fastq.gz
Binary file not shown.
Binary file added examples/test_data/test1_r2.fastq.gz
Binary file not shown.
Loading

0 comments on commit 1b0503e

Please sign in to comment.