Skip to content

Commit 1e3e403

Browse files
committed
updated docs
1 parent dd2f227 commit 1e3e403

9 files changed

+358
-47
lines changed

README.md

+310-30
Original file line numberDiff line numberDiff line change
@@ -7,40 +7,320 @@
77

88
If you use this workflow in a paper, don't forget to give credits to the authors by citing the URL of this (original) repository and its DOI (see above).
99

10+
# CONTENTS
11+
12+
* [Aim](https://github.com/niekwit/damid-seq?tab=readme-ov-file#aim)
13+
* [DamID](https://github.com/niekwit/damid-seq?tab=readme-ov-file#damid)
14+
* [Experimental considerations](https://github.com/niekwit/damid-seq?tab=readme-ov-file#experimental-considerations)
15+
* [Requirements](https://github.com/niekwit/damid-seq?tab=readme-ov-file#requirements)
16+
* [Dependency graph of Snakemake rules](https://github.com/niekwit/damid-seq?tab=readme-ov-file#dependency-graph-of-snakemake-rules)
17+
* [Installation of Conda/Mamba](https://github.com/niekwit/damid-seq?tab=readme-ov-file#installation-of-condamamba)
18+
* [Installation of Snakemake](https://github.com/niekwit/damid-seq?tab=readme-ov-file#installation-of-snakemake)
19+
* [Cloning `damid-seq` GitHub repository](https://github.com/niekwit/damid-seq?tab=readme-ov-file#cloning-github-repository)
20+
* [Preparing your data](https://github.com/niekwit/damid-seq?tab=readme-ov-file#preparing-your-data)
21+
* [Sample meta data and analysis settings](https://github.com/niekwit/damid-seq?tab=readme-ov-file#sample-meta-data-and-analysis-settings)
22+
* [Configuration of Snakemake](https://github.com/niekwit/damid-seq?tab=readme-ov-file#configuration-of-snakemake)
23+
* [Running the analysis with test data](https://github.com/niekwit/damid-seq?tab=readme-ov-file#running-the-analysis-with-test-data)
24+
* [Dry-run of the analysis](https://github.com/niekwit/damid-seq?tab=readme-ov-file#dry-run-of-the-analysis)
25+
* [Running the analysis](https://github.com/niekwit/damid-seq?tab=readme-ov-file#running-the-analysis)
26+
* [Report of the results](https://github.com/niekwit/damid-seq?tab=readme-ov-file#report-of-the-results)
27+
* [Literature](https://github.com/niekwit/damid-seq?tab=readme-ov-file#references)
28+
1029
## Aim
1130

12-
`damid-seq` is a (containerized) Snakemake pipeline for reproducible analysis of single/paired-end DamID-seq short read Illumina data.
31+
`damid-seq` is a Snakemake pipeline for reproducible analysis of single/paired-end DamID-seq short read Illumina data.
1332

1433
The core of the pipeline is the Perl script [damidseq_pipeline](https://github.com/owenjm/damidseq_pipeline), which is a great tool for the first steps of analysing DamID-seq data. However, it does not process biological replicate data, and is not written with deployment to server, cluster, grid and cloud environments in mind.
1534

1635
`damid-seq` implements the [Snakemake](https://snakemake.readthedocs.io/en/stable/) workflow management system, which overcomes the above issues. In addition, we have added many features to the DamID-seq analysis workflow.
1736

18-
## Documentation
19-
20-
Documentation of how to use `damid-seq` can be found at https://damid-seq.readthedocs.io/en/latest/.
21-
22-
### Overview of documentation
23-
24-
* DamID
25-
- DamID principle
26-
- Experimental considerations
27-
* Installation
28-
- Conda/Mamba
29-
- Snakemake
30-
- Apptainer
31-
- Snakefetch
32-
* Usage
33-
- Preparing raw sequencing data
34-
- Sample meta data and analysis settings
35-
- Configuration of Snakemake
36-
- Dry-run of the analysis
37-
- Visualization of the workflow
38-
- Running the analysis
39-
- Report of the results
40-
* Output
41-
- Quality control
42-
- Visualization of damid-seq data
43-
- Peaks
44-
* Citation
45-
- Workflow
46-
- Software used in workflow
37+
The output of `damid-seq` is as follows:
38+
39+
1. Quality control of the adapter trimmed sequencing data using FastQC/MultiQC.
40+
41+
2. Bigwig files for visualisation of binding in genome browsers, such IGV.
42+
43+
4. PCA and correlation plots for checking consistency of biological replicates
44+
45+
5. Identified and annotated peaks using MACS2 and/or find_peaks.pl
46+
47+
6. Profile plot/heatmap to visualise binding around genomic features, such as transcription start sites, usingh deeptools
48+
49+
## DamID
50+
51+
![DamID principle (Adapted from Van den Ameele et al. 2019 Current Opinion in Neurobiology)](/images/damid.png)
52+
Figure adapted from Van den Ameele et al. 2019 Current Opinion in Neurobiology
53+
54+
## Experimental considerations
55+
56+
TO DO
57+
58+
## Requirements
59+
60+
`damid-seq` has been extensively tested on GNU/Linux-based operating systems, so we advice to run your analysis on for example Ubuntu or Fedora.
61+
62+
Hardware requirements differ for the kind of data that needs to be analysed: for the analysis of mammalian data sets, > 32GB of RAM is recommended. Much less RAM is needed for analysis for data from organisms with much smaller genomes, such as _Drosophila_.
63+
64+
## Dependency graph of Snakemake rules
65+
66+
![Dependency graph of rules](/images/rule_graph.png)
67+
68+
## Installation of Conda/Mamba
69+
70+
For reproducible analysis, `damid-seq` uses Conda environments in the Snakemake workflow.
71+
72+
Please follow the instructions [here](https://snakemake.readthedocs.io/en/stable/getting_started/installation.html) for a detailed guide to install Conda/Mamba.
73+
74+
## Installation of Snakemake
75+
76+
To install Snakemake create the following environment with `mamba`:
77+
78+
```shell
79+
$ mamba create -n snakemake snakemake
80+
```
81+
82+
Activate the environment as follows:
83+
84+
```shell
85+
$ mamba activate snakemake
86+
```
87+
88+
If you want to deploy Snakemake on an HPC system using slurm also run:
89+
90+
```shell
91+
$ pip install snakemake-executor-plugin-slurm
92+
```
93+
94+
## Cloning `damid-seq` GitHub repository
95+
96+
The easiest way to obtain the workflow code is to use [snakefetch](https://pypi.org/project/snakefetch/):
97+
98+
```shell
99+
$ pip install snakefetch
100+
$ snakefetch --outdir /path/to/analysis --repo-version v0.4.0 --url https://github.com/niekwit/damid-seq
101+
Downloading archive file for version v0.4.0 from https://github.com/niekwit/damid-seq...
102+
Extracting config and workflow directories from tar.gz file to /home/niek/Downloads/TEST...
103+
Done!
104+
```
105+
106+
This will copy the config and workflow directories to the path set with the `--outdir` flag.
107+
108+
## Preparing raw sequencing data
109+
110+
In the directory containing config/workflow create a directory called reads:
111+
112+
```shell
113+
$ cd /path/to/analysis
114+
$ mdkir -p reads
115+
```
116+
117+
Data files from each group of biological replicates should be placed into a unique folder, e.g.:
118+
119+
```shell
120+
reads
121+
├── exp1
122+
│   ├── Dam.fastq.gz
123+
│   ├── HIF1A.fastq.gz
124+
│   └── HIF2A.fastq.gz
125+
├── exp2
126+
│   ├── Dam.fastq.gz
127+
│   ├── HIF1A.fastq.gz
128+
│   └── HIF2A.fastq.gz
129+
└── exp3
130+
├── Dam.fastq.gz
131+
├── HIF1A.fastq.gz
132+
└── HIF2A.fastq.gz
133+
```
134+
135+
> [!IMPORTANT]
136+
> Single-end fastq files should always end with *fastq.gz*, while paired-end reads should end with *\_R1\_001.fastq.gz/\_R2\_001.fastq.gz*
137+
138+
> [!IMPORTANT]
139+
> The Dam only control should always be called Dam.*relevant_extension*
140+
141+
## Sample meta data and analysis settings
142+
143+
The `config` directory contains `samples.csv` with sample meta data as follows:
144+
145+
| sample | genotype | treatment |
146+
|-----------|----------|-----------|
147+
|HIF1A | WT | Hypoxia |
148+
|HIF2A | WT | Hypoxia |
149+
|Dam | WT | Hypoxia |
150+
151+
`config.yaml` in the same directory contains the settings for the analysis:
152+
153+
```yaml
154+
genome: dm6
155+
ensembl_genome_build: 110
156+
plasmid_fasta: none
157+
fusion_genes: FBgn0038542,FBgn0085506 # Genes from these proteins will be removed from the analysis
158+
bowtie2:
159+
extra: ""
160+
damidseq_pipeline:
161+
normalization: kde # kde, rpm or rawbins
162+
binsize: 300
163+
extra: "" # extra argument for damidseq_pipeline
164+
quantile_normalisation:
165+
apply: True
166+
extra: "" # extra arguments for quantile_normalization
167+
deeptools:
168+
bamCoverage: # bam to bigwig conversion for QC
169+
binSize: 10
170+
normalizeUsing: RPKM
171+
extra: ""
172+
matrix: # Settings for computeMatrix
173+
mode: scale-regions # scale-regions or reference-point
174+
referencePoint: TSS # TSS, TES, center (only for reference-point mode)
175+
regionBodyLength: 6000
176+
upstream: 3000
177+
downstream: 3000
178+
binSize: 100
179+
averageTypeBins: mean
180+
regionsFileName: "" # BED or GTF file(s) with regions of interest (optional, whole genome if not specified)
181+
no_whole_genome: False # If True, will omit whole genome as region and only use regionsFileName(s)
182+
extra: "" # Any additional parameters for computeMatrix
183+
plotHeatmap:
184+
interpolationMethod: auto
185+
plotType: lines # lines, fill, se, std
186+
colorMap: viridis # https://matplotlib.org/2.0.2/users/colormaps.html
187+
alpha: 1.0
188+
extra: ""
189+
peak_calling_perl:
190+
run: True
191+
iterations: 5 # N argument
192+
fdr: 0.01
193+
fraction: 0 # Fraction of random fragments to consider per iteration
194+
min_count: 2 # Minimum number of reads to consider a peak
195+
min_quantile: 0.95 # Minimum quantile for considering peaks
196+
step: 0.01 # Stepping for quantiles
197+
unified_peaks: max # Method for calling peak overlaps. 'min': call minimum overlapping peak area. 'max': call maximum overlap as peak
198+
extra: ""
199+
peak_calling_macs2:
200+
run: False
201+
mode: narrow
202+
qvalue: 0.05 # for narrow peaks
203+
broad_cutoff: 0.1 # for broad peaks
204+
extra: ""
205+
consensus_peaks:
206+
max_size: 10 # Maximum size of peaks to be extended
207+
extend_by: 40 # Number of bp to extend peaks on either side
208+
keep: 2 # Minimum number peaks that must overlap to keep
209+
resources: # computing resources
210+
trim:
211+
cpu: 8
212+
time: 60
213+
fastqc:
214+
cpu: 4
215+
time: 60
216+
damid:
217+
cpu: 24
218+
time: 720
219+
index:
220+
cpu: 40
221+
time: 60
222+
deeptools:
223+
cpu: 8
224+
time: 90
225+
plotting:
226+
cpu: 2
227+
time: 20
228+
```
229+
230+
A lot of the DamID signal can come from the plasmids that were used to express the Dam-POIs, and this can skew the analysis.
231+
232+
To prevent this, two approaches are available:
233+
234+
1. The genes (Ensembl gene IDs) fused to Dam can be set in config.yaml["fusion_genes] (separated by commas if multiple plasmids are used). This will mask the genomic locations of these genes in the fasta file that will be used to build the Bowtie2 index, hence excluding these regions from the analysis.
235+
236+
> [!NOTE]
237+
> To disable this function set the value of config.yaml["fusion_genes] to "".
238+
239+
2. If a plasmid is used that for example also uses an endogenous promoter besides the Dam fusion proteins, one can set a path to a fasta file containg all the plasmid sequences in config.yaml[""]. Trimmed reads are first aligned to these sequences, and the resulting non-aligning reads will then be processed as normal.
240+
241+
It is recommended to store this file in a directory called resources within the analysis folder (this folder will also contain all other non-experimental files such as fasta and gtf files).
242+
243+
> [!NOTE]
244+
> To disable this function set the value of config.yaml["plasmid_fasta"] to none.
245+
246+
247+
## Configuration of Snakemake
248+
249+
Running Snakemake can entail quite a few command line flags. To make this easier these can be set in a global profile that is defined in a user-specific configuration directory in order to simplify this process.
250+
251+
For example, a profile `config.yaml` can be stored at /home/user/.config/snakemake/profile:
252+
```yaml
253+
cores: 40
254+
latency-wait: 20
255+
use-conda: True
256+
use-apptainer: True
257+
keep-going: False
258+
rerun-incomplete: True
259+
printshellcmds: True
260+
show-failed-logs: True
261+
```
262+
263+
When running on a slurm-based HPC, the following lines should be included in `config.yaml`:
264+
```yaml
265+
executor: slurm
266+
jobs: 100
267+
apptainer-args: "--bind '/parent_dir/of/analysis'" # if analysis in not in /home/$USER
268+
default-resources:
269+
slurm_partition: icelake
270+
slurm_account: <ACCOUNT>
271+
```
272+
273+
Some system have limited space allocated to `/tmp`, which can be problematic when using Apptainer. Add the following line to `~/.bashrc` to set a different temporary directory location:
274+
275+
```shell
276+
export APPTAINER_TMPDIR=~/rds/hpc-work/apptainer_tmp
277+
```
278+
279+
## Dry-run of the analysis
280+
281+
Before running the actual analyis with your own data, a dry-run can be performed:
282+
283+
```shell
284+
$ cd path/to/analysis/directory
285+
$ snakemake -np
286+
```
287+
288+
Snakemake will create the DAG of jobs and print the shell command, but it will not execute anything.
289+
290+
## Visualization of workflow
291+
292+
To visualize the workflow run (this command excludes the target rule):
293+
```shell
294+
$ mkdir -p images
295+
$ snakemake --forceall --rulegraph | grep -v '\-> 0\|0\[label = \"all\"' | dot -Tpng > images/rule_graph.png
296+
```
297+
298+
## Running the analysis
299+
300+
Once you know that the test and/or dry run has worked, the actual analysis can be initiated as follows:
301+
```shell
302+
$ snakemake --profile /home/user/.config/snakemake/profile --directory .test/
303+
```
304+
305+
> [!IMPORTANT]
306+
> Always make sure to use the absolute path (i.e. /home/user/.config/...) rather than the relative path (~/.config/...) when providing the path for the profile file.
307+
308+
## Report of the results
309+
310+
When the analysis has finished succesfully, an HTML report can be created as follows:
311+
312+
```shell
313+
$ snakemake --report report.html
314+
```
315+
316+
This report will contain run time information for the Snakemake rules, as well as figures generated by the workflow, and the code used to create these.
317+
318+
## Literature
319+
320+
:information_source: Some key DamID papers:
321+
322+
Van Steensel and Henikoff. Identification of in vivo DNA targets of chromatin proteins using tethered Dam methyltransferase. 2000 Nature Biotechnology.
323+
324+
Marshall et al. Cell-type-specific profiling of protein–DNA interactions without cell isolation using targeted DamID with next-generation sequencing. 2016 Nature Protocols.
325+
326+
Van den Ameele, Krautz and Brand. TaDa! Analysing cell type-specific chromatin in vivo with Targeted DamID. 2019 Current Opinion in Neurobiology.

docs/citation.rst

+10-1
Original file line numberDiff line numberDiff line change
@@ -12,4 +12,13 @@ Software used in workflow
1212

1313
Please also cite the software used in this workflow:
1414

15-
TO DO
15+
* Fastqc
16+
* MultiQC
17+
* samtools
18+
* Trim Galore
19+
* Bowtie2
20+
* deeptools
21+
* damidseq_pipeline
22+
* bedtools
23+
* MACS2
24+
* Biopython

docs/damid.rst

+10-7
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,15 @@
11
DamID principle
22
---------------
33

4+
DamID is a method that identifies genomic binding sites of a protein of interest (POI). In contrast to ChIP-seq, no antibodies again the POI are required.
5+
6+
The method is based on fusing the POI to a DNA adenine methyltransferase (Dam) from *E. coli*. Dam methylates adenine bases in GATC motifs, which are not present in the eukaryotic genome.
7+
8+
To identify the genomic binding sites of the POI, the Dam-fusion protein is expressed in cells. Over time, the Dam-fusion protein will bind to the chromatin and methylate the GATC motifs in the vicinity of the binding sites. These methylation marks effectively function as a recording of the history of the chromatin binding of the POI.
9+
10+
For quantification, DNA is isolated and methylated DNA is cloned and subsequently sequenced.
11+
12+
413
.. figure:: images/damid.png
514
:align: center
615
:width: 1000
@@ -11,10 +20,4 @@ DamID principle
1120
Experimental considerations
1221
---------------------------
1322

14-
#. Make sure that you have sequence validated your Dam expressing plasmids prior to your experiment. Cryptic expression of Dam can cause toxicity in *E. coli* and can cause selection for Dam inactivating mutations.
15-
16-
17-
Critical reagents
18-
-----------------
19-
20-
TO DO
23+
#. Make sure that you have sequence validated your Dam expressing plasmids prior to your experiment. Cryptic expression of Dam can cause toxicity in *E. coli* and can cause selection for Dam inactivating mutations.

0 commit comments

Comments
 (0)