|
7 | 7 |
|
8 | 8 | If you use this workflow in a paper, don't forget to give credits to the authors by citing the URL of this (original) repository and its DOI (see above).
|
9 | 9 |
|
| 10 | +# CONTENTS |
| 11 | + |
| 12 | +* [Aim](https://github.com/niekwit/damid-seq?tab=readme-ov-file#aim) |
| 13 | +* [DamID](https://github.com/niekwit/damid-seq?tab=readme-ov-file#damid) |
| 14 | +* [Experimental considerations](https://github.com/niekwit/damid-seq?tab=readme-ov-file#experimental-considerations) |
| 15 | +* [Requirements](https://github.com/niekwit/damid-seq?tab=readme-ov-file#requirements) |
| 16 | +* [Dependency graph of Snakemake rules](https://github.com/niekwit/damid-seq?tab=readme-ov-file#dependency-graph-of-snakemake-rules) |
| 17 | +* [Installation of Conda/Mamba](https://github.com/niekwit/damid-seq?tab=readme-ov-file#installation-of-condamamba) |
| 18 | +* [Installation of Snakemake](https://github.com/niekwit/damid-seq?tab=readme-ov-file#installation-of-snakemake) |
| 19 | +* [Cloning `damid-seq` GitHub repository](https://github.com/niekwit/damid-seq?tab=readme-ov-file#cloning-github-repository) |
| 20 | +* [Preparing your data](https://github.com/niekwit/damid-seq?tab=readme-ov-file#preparing-your-data) |
| 21 | +* [Sample meta data and analysis settings](https://github.com/niekwit/damid-seq?tab=readme-ov-file#sample-meta-data-and-analysis-settings) |
| 22 | +* [Configuration of Snakemake](https://github.com/niekwit/damid-seq?tab=readme-ov-file#configuration-of-snakemake) |
| 23 | +* [Running the analysis with test data](https://github.com/niekwit/damid-seq?tab=readme-ov-file#running-the-analysis-with-test-data) |
| 24 | +* [Dry-run of the analysis](https://github.com/niekwit/damid-seq?tab=readme-ov-file#dry-run-of-the-analysis) |
| 25 | +* [Running the analysis](https://github.com/niekwit/damid-seq?tab=readme-ov-file#running-the-analysis) |
| 26 | +* [Report of the results](https://github.com/niekwit/damid-seq?tab=readme-ov-file#report-of-the-results) |
| 27 | +* [Literature](https://github.com/niekwit/damid-seq?tab=readme-ov-file#references) |
| 28 | + |
10 | 29 | ## Aim
|
11 | 30 |
|
12 |
| -`damid-seq` is a (containerized) Snakemake pipeline for reproducible analysis of single/paired-end DamID-seq short read Illumina data. |
| 31 | +`damid-seq` is a Snakemake pipeline for reproducible analysis of single/paired-end DamID-seq short read Illumina data. |
13 | 32 |
|
14 | 33 | The core of the pipeline is the Perl script [damidseq_pipeline](https://github.com/owenjm/damidseq_pipeline), which is a great tool for the first steps of analysing DamID-seq data. However, it does not process biological replicate data, and is not written with deployment to server, cluster, grid and cloud environments in mind.
|
15 | 34 |
|
16 | 35 | `damid-seq` implements the [Snakemake](https://snakemake.readthedocs.io/en/stable/) workflow management system, which overcomes the above issues. In addition, we have added many features to the DamID-seq analysis workflow.
|
17 | 36 |
|
18 |
| -## Documentation |
19 |
| - |
20 |
| -Documentation of how to use `damid-seq` can be found at https://damid-seq.readthedocs.io/en/latest/. |
21 |
| - |
22 |
| -### Overview of documentation |
23 |
| - |
24 |
| -* DamID |
25 |
| - - DamID principle |
26 |
| - - Experimental considerations |
27 |
| -* Installation |
28 |
| - - Conda/Mamba |
29 |
| - - Snakemake |
30 |
| - - Apptainer |
31 |
| - - Snakefetch |
32 |
| -* Usage |
33 |
| - - Preparing raw sequencing data |
34 |
| - - Sample meta data and analysis settings |
35 |
| - - Configuration of Snakemake |
36 |
| - - Dry-run of the analysis |
37 |
| - - Visualization of the workflow |
38 |
| - - Running the analysis |
39 |
| - - Report of the results |
40 |
| -* Output |
41 |
| - - Quality control |
42 |
| - - Visualization of damid-seq data |
43 |
| - - Peaks |
44 |
| -* Citation |
45 |
| - - Workflow |
46 |
| - - Software used in workflow |
| 37 | +The output of `damid-seq` is as follows: |
| 38 | + |
| 39 | +1. Quality control of the adapter trimmed sequencing data using FastQC/MultiQC. |
| 40 | + |
| 41 | +2. Bigwig files for visualisation of binding in genome browsers, such IGV. |
| 42 | + |
| 43 | +4. PCA and correlation plots for checking consistency of biological replicates |
| 44 | + |
| 45 | +5. Identified and annotated peaks using MACS2 and/or find_peaks.pl |
| 46 | + |
| 47 | +6. Profile plot/heatmap to visualise binding around genomic features, such as transcription start sites, usingh deeptools |
| 48 | + |
| 49 | +## DamID |
| 50 | + |
| 51 | + |
| 52 | +Figure adapted from Van den Ameele et al. 2019 Current Opinion in Neurobiology |
| 53 | + |
| 54 | +## Experimental considerations |
| 55 | + |
| 56 | +TO DO |
| 57 | + |
| 58 | +## Requirements |
| 59 | + |
| 60 | +`damid-seq` has been extensively tested on GNU/Linux-based operating systems, so we advice to run your analysis on for example Ubuntu or Fedora. |
| 61 | + |
| 62 | +Hardware requirements differ for the kind of data that needs to be analysed: for the analysis of mammalian data sets, > 32GB of RAM is recommended. Much less RAM is needed for analysis for data from organisms with much smaller genomes, such as _Drosophila_. |
| 63 | + |
| 64 | +## Dependency graph of Snakemake rules |
| 65 | + |
| 66 | + |
| 67 | + |
| 68 | +## Installation of Conda/Mamba |
| 69 | + |
| 70 | +For reproducible analysis, `damid-seq` uses Conda environments in the Snakemake workflow. |
| 71 | + |
| 72 | +Please follow the instructions [here](https://snakemake.readthedocs.io/en/stable/getting_started/installation.html) for a detailed guide to install Conda/Mamba. |
| 73 | + |
| 74 | +## Installation of Snakemake |
| 75 | + |
| 76 | +To install Snakemake create the following environment with `mamba`: |
| 77 | + |
| 78 | +```shell |
| 79 | +$ mamba create -n snakemake snakemake |
| 80 | +``` |
| 81 | + |
| 82 | +Activate the environment as follows: |
| 83 | + |
| 84 | +```shell |
| 85 | +$ mamba activate snakemake |
| 86 | +``` |
| 87 | + |
| 88 | +If you want to deploy Snakemake on an HPC system using slurm also run: |
| 89 | + |
| 90 | +```shell |
| 91 | +$ pip install snakemake-executor-plugin-slurm |
| 92 | +``` |
| 93 | + |
| 94 | +## Cloning `damid-seq` GitHub repository |
| 95 | + |
| 96 | +The easiest way to obtain the workflow code is to use [snakefetch](https://pypi.org/project/snakefetch/): |
| 97 | + |
| 98 | +```shell |
| 99 | +$ pip install snakefetch |
| 100 | +$ snakefetch --outdir /path/to/analysis --repo-version v0.4.0 --url https://github.com/niekwit/damid-seq |
| 101 | +Downloading archive file for version v0.4.0 from https://github.com/niekwit/damid-seq... |
| 102 | +Extracting config and workflow directories from tar.gz file to /home/niek/Downloads/TEST... |
| 103 | +Done! |
| 104 | +``` |
| 105 | + |
| 106 | +This will copy the config and workflow directories to the path set with the `--outdir` flag. |
| 107 | + |
| 108 | +## Preparing raw sequencing data |
| 109 | + |
| 110 | +In the directory containing config/workflow create a directory called reads: |
| 111 | + |
| 112 | +```shell |
| 113 | +$ cd /path/to/analysis |
| 114 | +$ mdkir -p reads |
| 115 | +``` |
| 116 | + |
| 117 | +Data files from each group of biological replicates should be placed into a unique folder, e.g.: |
| 118 | + |
| 119 | +```shell |
| 120 | +reads |
| 121 | +├── exp1 |
| 122 | +│ ├── Dam.fastq.gz |
| 123 | +│ ├── HIF1A.fastq.gz |
| 124 | +│ └── HIF2A.fastq.gz |
| 125 | +├── exp2 |
| 126 | +│ ├── Dam.fastq.gz |
| 127 | +│ ├── HIF1A.fastq.gz |
| 128 | +│ └── HIF2A.fastq.gz |
| 129 | +└── exp3 |
| 130 | + ├── Dam.fastq.gz |
| 131 | + ├── HIF1A.fastq.gz |
| 132 | + └── HIF2A.fastq.gz |
| 133 | +``` |
| 134 | + |
| 135 | +> [!IMPORTANT] |
| 136 | +> Single-end fastq files should always end with *fastq.gz*, while paired-end reads should end with *\_R1\_001.fastq.gz/\_R2\_001.fastq.gz* |
| 137 | +
|
| 138 | +> [!IMPORTANT] |
| 139 | +> The Dam only control should always be called Dam.*relevant_extension* |
| 140 | +
|
| 141 | +## Sample meta data and analysis settings |
| 142 | + |
| 143 | +The `config` directory contains `samples.csv` with sample meta data as follows: |
| 144 | + |
| 145 | +| sample | genotype | treatment | |
| 146 | +|-----------|----------|-----------| |
| 147 | +|HIF1A | WT | Hypoxia | |
| 148 | +|HIF2A | WT | Hypoxia | |
| 149 | +|Dam | WT | Hypoxia | |
| 150 | + |
| 151 | +`config.yaml` in the same directory contains the settings for the analysis: |
| 152 | + |
| 153 | +```yaml |
| 154 | +genome: dm6 |
| 155 | +ensembl_genome_build: 110 |
| 156 | +plasmid_fasta: none |
| 157 | +fusion_genes: FBgn0038542,FBgn0085506 # Genes from these proteins will be removed from the analysis |
| 158 | +bowtie2: |
| 159 | + extra: "" |
| 160 | +damidseq_pipeline: |
| 161 | + normalization: kde # kde, rpm or rawbins |
| 162 | + binsize: 300 |
| 163 | + extra: "" # extra argument for damidseq_pipeline |
| 164 | +quantile_normalisation: |
| 165 | + apply: True |
| 166 | + extra: "" # extra arguments for quantile_normalization |
| 167 | +deeptools: |
| 168 | + bamCoverage: # bam to bigwig conversion for QC |
| 169 | + binSize: 10 |
| 170 | + normalizeUsing: RPKM |
| 171 | + extra: "" |
| 172 | + matrix: # Settings for computeMatrix |
| 173 | + mode: scale-regions # scale-regions or reference-point |
| 174 | + referencePoint: TSS # TSS, TES, center (only for reference-point mode) |
| 175 | + regionBodyLength: 6000 |
| 176 | + upstream: 3000 |
| 177 | + downstream: 3000 |
| 178 | + binSize: 100 |
| 179 | + averageTypeBins: mean |
| 180 | + regionsFileName: "" # BED or GTF file(s) with regions of interest (optional, whole genome if not specified) |
| 181 | + no_whole_genome: False # If True, will omit whole genome as region and only use regionsFileName(s) |
| 182 | + extra: "" # Any additional parameters for computeMatrix |
| 183 | + plotHeatmap: |
| 184 | + interpolationMethod: auto |
| 185 | + plotType: lines # lines, fill, se, std |
| 186 | + colorMap: viridis # https://matplotlib.org/2.0.2/users/colormaps.html |
| 187 | + alpha: 1.0 |
| 188 | + extra: "" |
| 189 | +peak_calling_perl: |
| 190 | + run: True |
| 191 | + iterations: 5 # N argument |
| 192 | + fdr: 0.01 |
| 193 | + fraction: 0 # Fraction of random fragments to consider per iteration |
| 194 | + min_count: 2 # Minimum number of reads to consider a peak |
| 195 | + min_quantile: 0.95 # Minimum quantile for considering peaks |
| 196 | + step: 0.01 # Stepping for quantiles |
| 197 | + unified_peaks: max # Method for calling peak overlaps. 'min': call minimum overlapping peak area. 'max': call maximum overlap as peak |
| 198 | + extra: "" |
| 199 | +peak_calling_macs2: |
| 200 | + run: False |
| 201 | + mode: narrow |
| 202 | + qvalue: 0.05 # for narrow peaks |
| 203 | + broad_cutoff: 0.1 # for broad peaks |
| 204 | + extra: "" |
| 205 | +consensus_peaks: |
| 206 | + max_size: 10 # Maximum size of peaks to be extended |
| 207 | + extend_by: 40 # Number of bp to extend peaks on either side |
| 208 | + keep: 2 # Minimum number peaks that must overlap to keep |
| 209 | +resources: # computing resources |
| 210 | + trim: |
| 211 | + cpu: 8 |
| 212 | + time: 60 |
| 213 | + fastqc: |
| 214 | + cpu: 4 |
| 215 | + time: 60 |
| 216 | + damid: |
| 217 | + cpu: 24 |
| 218 | + time: 720 |
| 219 | + index: |
| 220 | + cpu: 40 |
| 221 | + time: 60 |
| 222 | + deeptools: |
| 223 | + cpu: 8 |
| 224 | + time: 90 |
| 225 | + plotting: |
| 226 | + cpu: 2 |
| 227 | + time: 20 |
| 228 | +``` |
| 229 | +
|
| 230 | +A lot of the DamID signal can come from the plasmids that were used to express the Dam-POIs, and this can skew the analysis. |
| 231 | +
|
| 232 | +To prevent this, two approaches are available: |
| 233 | +
|
| 234 | +1. The genes (Ensembl gene IDs) fused to Dam can be set in config.yaml["fusion_genes] (separated by commas if multiple plasmids are used). This will mask the genomic locations of these genes in the fasta file that will be used to build the Bowtie2 index, hence excluding these regions from the analysis. |
| 235 | +
|
| 236 | +> [!NOTE] |
| 237 | +> To disable this function set the value of config.yaml["fusion_genes] to "". |
| 238 | +
|
| 239 | +2. If a plasmid is used that for example also uses an endogenous promoter besides the Dam fusion proteins, one can set a path to a fasta file containg all the plasmid sequences in config.yaml[""]. Trimmed reads are first aligned to these sequences, and the resulting non-aligning reads will then be processed as normal. |
| 240 | +
|
| 241 | +It is recommended to store this file in a directory called resources within the analysis folder (this folder will also contain all other non-experimental files such as fasta and gtf files). |
| 242 | +
|
| 243 | +> [!NOTE] |
| 244 | +> To disable this function set the value of config.yaml["plasmid_fasta"] to none. |
| 245 | +
|
| 246 | +
|
| 247 | +## Configuration of Snakemake |
| 248 | +
|
| 249 | +Running Snakemake can entail quite a few command line flags. To make this easier these can be set in a global profile that is defined in a user-specific configuration directory in order to simplify this process. |
| 250 | +
|
| 251 | +For example, a profile `config.yaml` can be stored at /home/user/.config/snakemake/profile: |
| 252 | +```yaml |
| 253 | +cores: 40 |
| 254 | +latency-wait: 20 |
| 255 | +use-conda: True |
| 256 | +use-apptainer: True |
| 257 | +keep-going: False |
| 258 | +rerun-incomplete: True |
| 259 | +printshellcmds: True |
| 260 | +show-failed-logs: True |
| 261 | +``` |
| 262 | + |
| 263 | +When running on a slurm-based HPC, the following lines should be included in `config.yaml`: |
| 264 | +```yaml |
| 265 | +executor: slurm |
| 266 | +jobs: 100 |
| 267 | +apptainer-args: "--bind '/parent_dir/of/analysis'" # if analysis in not in /home/$USER |
| 268 | +default-resources: |
| 269 | + slurm_partition: icelake |
| 270 | + slurm_account: <ACCOUNT> |
| 271 | +``` |
| 272 | + |
| 273 | +Some system have limited space allocated to `/tmp`, which can be problematic when using Apptainer. Add the following line to `~/.bashrc` to set a different temporary directory location: |
| 274 | + |
| 275 | +```shell |
| 276 | +export APPTAINER_TMPDIR=~/rds/hpc-work/apptainer_tmp |
| 277 | +``` |
| 278 | + |
| 279 | +## Dry-run of the analysis |
| 280 | + |
| 281 | +Before running the actual analyis with your own data, a dry-run can be performed: |
| 282 | + |
| 283 | +```shell |
| 284 | +$ cd path/to/analysis/directory |
| 285 | +$ snakemake -np |
| 286 | +``` |
| 287 | + |
| 288 | +Snakemake will create the DAG of jobs and print the shell command, but it will not execute anything. |
| 289 | + |
| 290 | +## Visualization of workflow |
| 291 | + |
| 292 | +To visualize the workflow run (this command excludes the target rule): |
| 293 | +```shell |
| 294 | +$ mkdir -p images |
| 295 | +$ snakemake --forceall --rulegraph | grep -v '\-> 0\|0\[label = \"all\"' | dot -Tpng > images/rule_graph.png |
| 296 | +``` |
| 297 | + |
| 298 | +## Running the analysis |
| 299 | + |
| 300 | +Once you know that the test and/or dry run has worked, the actual analysis can be initiated as follows: |
| 301 | +```shell |
| 302 | +$ snakemake --profile /home/user/.config/snakemake/profile --directory .test/ |
| 303 | +``` |
| 304 | + |
| 305 | +> [!IMPORTANT] |
| 306 | +> Always make sure to use the absolute path (i.e. /home/user/.config/...) rather than the relative path (~/.config/...) when providing the path for the profile file. |
| 307 | + |
| 308 | +## Report of the results |
| 309 | + |
| 310 | +When the analysis has finished succesfully, an HTML report can be created as follows: |
| 311 | + |
| 312 | +```shell |
| 313 | +$ snakemake --report report.html |
| 314 | +``` |
| 315 | + |
| 316 | +This report will contain run time information for the Snakemake rules, as well as figures generated by the workflow, and the code used to create these. |
| 317 | + |
| 318 | +## Literature |
| 319 | + |
| 320 | +:information_source: Some key DamID papers: |
| 321 | + |
| 322 | +Van Steensel and Henikoff. Identification of in vivo DNA targets of chromatin proteins using tethered Dam methyltransferase. 2000 Nature Biotechnology. |
| 323 | + |
| 324 | +Marshall et al. Cell-type-specific profiling of protein–DNA interactions without cell isolation using targeted DamID with next-generation sequencing. 2016 Nature Protocols. |
| 325 | + |
| 326 | +Van den Ameele, Krautz and Brand. TaDa! Analysing cell type-specific chromatin in vivo with Targeted DamID. 2019 Current Opinion in Neurobiology. |
0 commit comments