Skip to content

Latest commit

 

History

History
91 lines (62 loc) · 4.17 KB

README.md

File metadata and controls

91 lines (62 loc) · 4.17 KB

License: GPL v3

Data, Code and Workflows Guideline

To guide eBook authors having a better sense of the workflow layout, here we briefly introduce the specific purposes of the dir system.

  1. cache: Here, it stores intermediate datasets or results that are generated during the preprocessing steps.
  2. graphs: The graphs/figures produced during the analysis.
  3. input: Here, we store the raw input data. Data size > 100M is not allowed. We recommend using small sample data for the illustration purpose of the workflow. If you have files > 100M, please contact the chapter editor to find a solution.
  4. lib: The source code, functions, or algorithms used within the workflow.
  5. output: The final output results of the workflow.
  6. workflow: Step by step pipeline. It may contain some sub-directories.
    • It is suggested to use a numbering system and keywords to indicate the order and the main purpose of the scripts, i.e., 1_fastq_quality_checking.py, 2_cleaned_reads_alignment.py.
    • To ensure reproducibility, please use the relative path within the workflow.
  7. README: In the readme file, please briefly describe the purpose of the repository, the installation, and the input data format.
    • We recommend using a diagram to describe the workflow briefly.
    • Provide the installation details.
    • Show a small proportion of the input data unless the data file is in a well-known standard format, i.e., the head or tail of the input data.

Overview of an example workflow: Fastq data quality checking

This is an example workflow to check the quality of the paired-end fastq files using FastQC software.

Installation

Input Data

The example data used here is the paired-end fastq file generated by using Illumina platform.

  • R1 FASTQ file: input/reads1.fastq
  • R2 FASTQ file: input/reads2.fastq

Each entry in a FASTQ files consists of 4 lines:

  1. A sequence identifier with information about the sequencing run and the cluster. The exact contents of this line vary by based on the BCL to FASTQ conversion software used.
  2. The sequence (the base calls; A, C, T, G and N).
  3. A separator, which is simply a plus (+) sign.
  4. The base call quality scores. These are Phred +33 encoded, using ASCII characters to represent the numerical quality scores.

The first entry of the input data:

@HWI-ST361_127_1000138:2:1101:1195:2141/1
CGTTNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNGGAGGGGTTNNNNNNNNNNNNNNN
+
[[[_BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB

Major steps

Step 1: running the FastQC to conduct quality checking

  • Note that you have to normalize the path in the shell script.
sh workflow/1_run_fastqc.sh

Step 2: aggregate results from FastQC

sh workflow/2_aggregate_results.sh

Step 3: view the results

  • Results can be visualized by clicking output/multiqc_report.html.
  • Alternatively, you can plot the results yourself using the below R code.
3_visualize_results.Rmd

Expected results

License

It is a free and open source software, licensed under (choose a license from the suggested list: GPLv3, MIT, or CC BY 4.0).