created 27.01.2020 last edit: 21.06.2020 by Marc Ruebsam
================================ Information text about the folder, what data is stored here and who is directly responsible for the contents and data structure.
Project Folder: 16S_Metabarcoding Description: Microbiota profiling with long amplicons using nanopore sequencing: This folder is used for 16S Metabarcoding analysis using the Nanopore MinION platform. It contains subfolders for storage of raw data (created during the sequencing run), intermediate processing steps, analysis results and reports (created by the pipeline). The first three are organized per run (unique run ID), the last one is separated by time points and sample IDs. Additional information regarding the origin and processing of each sample can be received from the METADATA directory. The pipeline used for analysis of the data is based on the snakemake files in the Pipeline directory. Documentation contains written evaluation of the results, presentations and references. Type of Data: Fast5/Fastq/Pictures/Report files Participants: Silke Grauling-Halama ([email protected]), Pornpimol Charoentong ([email protected]), Marc Ruebsam ([email protected]) Correspondence: Silke Grauling-Halama ([email protected])
The raw data is created by the MinKNOW software during the MinION run. Crucial information regarding the samples, library preparation and sequencing run are stored inside METADATA/EXPERIMENT_SEQUENCING.xlsx file.
== Sequencing with MinKNOW ==
- Follow the protocol for flow cell preparation and loading of the library
- Note down the number of pores during the flow cell check
- Select the folder specified as Project (see above) as output directory (will always be a "00_raw_data" folder inside the project specific folder)
- Specify a experiment and sample name (as above)
- Please do not use spaces or special characters (e.g. äöüß.,<>|), except "_"
- Activate VBZ compression
- Deactivate live basecalling
- Start the sequencing run
- MinKNOW will create the output directory inside 00_raw_data according to the experiment and sample names specified
- if you want to create and save PDF reports, create a folder inside the Run ID folder called "reports" (the same folder that contains the fast5 folder)
- move PDFs in there
== After the run has finished ==
- Make sure the sequencing run has finished successfully and the raw data is in the expected location
- Close all open folders and files related to the current run
MeBaPiNa (metabarcoding analysis pipeline for Nanopore datasets) is a pipeline implemented in snakemake. It takes raw fast5 files and automatically processes it according to the specifications in the METADATA/PIPELINE_CONFIG.yaml file. Statistics and figures for the requested samples are reported in the 03_report directory.
== Installing snakemake ==
- Has to be done once on the machine used for running the Pipeline
- PLEASE THINK BEFORE COPY & PASTING
- Install conda
update machine
sudo yum update -y sudo yum upgrade -y sudo yum install -y epel-release ## EPEL repository for yum sudo yum install -y cifs-utils ## mounting of windows network shares sudo yum install -y wget ## querry downloads sudo yum install -y git ## gitget lateste miniconda
wget https://repo.continuum.io/miniconda/Miniconda3-latest-Linux-x86_64.sh -O miniconda.shcreate directory for miniconda owed by user
sudo mkdir -p /opt/miniconda pid_me=$(id --user) gid_me=$(id --group) sudo chown ${pid_me}:${gid_me} /opt/miniconda unset pid_me gid_meexecute installation script
bash miniconda.sh -b -p /opt/miniconda/miniconda3 export PATH="/opt/miniconda/miniconda3/bin:$PATH" rm miniconda.shupdate and initialize
hash -r conda update -y -q conda conda init bash - Install snakemake (create snakemake environment with conda)
create conda environment of snakemake
conda create -y -c defaults -c bioconda -c conda-forge -n snakemake snakemake xlrd ## alternatively use the configuration for 5.10 conda env create -f Pipeline/MeBaPiNa/envs/snakemake.yml -n snakemakeremove temp files
conda clean -y --all- Install guppy standalone - download guppy from https://europe.oxfordnanoportal.com/software/analysis/ont_guppy_3.4.1-1~xenial_amd64.deb - install
== Before running the Pipeline ==
- Update the METADATA/PIPELINE_CONFIG.yaml file
- Experiment specifications
- project, tmp: build the path to the project directory like {project}/{tmp}
- meta: path to the EXPERIMENT_SEQUENCING.xlsx inside the project directory like {project}/{tmp}/{meta}
- log: path to the ANALYSIS_PROGRESS_MANAGEMENT.csv inside the project directory like {project}/{tmp}/{log}
- samples: list of sample names from the EXPERIMENT_SEQUENCING.xlsx to analze
- Methods for analysis
- methodologie: list of method abbreviations to use for analysis
- Workstation specifications
- gpu: whether or not a guppy compatible GPU is available
- Filtering options
- q_min: minimal quality score for read filtering
- len_min: minimal length for read filtering
- len_max: maximal length for read filtering
- min_featurereads minimum number of reads per cluster or taxon
- min_readidentity minimum read identity for clustering (OTU)
- min_confidence confidence used for classification of taxa at {refrank} level
- plot_sample down-sampling reads in plots to this number to reduce computational complexity (distorts statistics shown in some plots)
- Reference database (currently only {refsource} is implemented)
- source: which reference database to use
- rank: at which rank the taxonomy should be determined
- Experiment specifications
== Run MeBaPiNa ==
I. Run full pipeline
- Activate environment if used conda is used for snakemake ## activate conda environment conda activate snakemake
- Change directory into the project directory (or use absolute paths below)
- DRY RUN: Optionally execute the pipeline as a dry run showing all rules that are going to be executed
- the '-n' invokes a dry run and '-pr' includes higher verbosity
- change the path to the config file and Pipeline snakefile if necessary snakemake -npr --use-conda --cores 'all' --configfile METADATA/PIPELINE_CONFIG.yaml --snakefile Pipeline/MeBaPiNa/Snakefile
- Execute the pipeline
- change the path to the config file and Pipeline snakefile if necessary snakemake -pr --use-conda --cores 'all' --configfile METADATA/PIPELINE_CONFIG.yaml --snakefile Pipeline/MeBaPiNa/Snakefile
- Please note that the --cores 'all' might not work to request all available threads at the machine and has to be changed to an integer manually
II. Perform only basecalling
- In case you only want to basecall the data and have a look at the data before running the full pipeline
- Or in case you want to perform basecalling on a separate machine than the rest of the pipeline
- you can run snakemake -pr --use-conda --cores 'all' --configfile METADATA/PIPELINE_CONFIG.yaml --snakefile Pipeline/MeBaPiNa/Snakefile only_basecall
- output will be generated inside the 03_report/ dorectory and ccan be reviewed
- please note that this instance of the pipeline has to finish before you can start further analysis by invoking the command from I.
- in other words: no dependecy between this instance and another can be established automatically and has to be insured manually
III. Update report file / add missing reports
- In case you want to update the log file ANALYSIS_PROGRESS_MANAGEMENT.csv
- you can run snakemake -pr --use-conda --cores 'all' --configfile METADATA/PIPELINE_CONFIG.yaml --snakefile Pipeline/MeBaPiNa/Snakefile update_report
IV. Rerun certain step
- You can rerun a certain step in the pipeline or reproduce a certain file (e.g. if it was corrupted)
- find the rule corresponding to the processing step in question
- check the output location for the output files
- if the output files still exist, delete or move them (rerunning only works if the file doesn't already exist)
- get the path to the desired output file and call snakemake with it, e.g.: snakemake -pr --use-conda --cores 'all' --configfile METADATA/PIPELINE_CONFIG.yaml --snakefile Pipeline/MeBaPiNa/Snakefile 16S_Metabarcoding/02_analysis_results/03_kmer_mapping/20191007_1559_MN31344_FAK76605_2bf006ff/barcode06/silva_Species/krona.html
- you can list multiple files to rerun all of them
- find the rule corresponding to the processing step in question
- Note! rerunning of an intermediate step does not automatically update all below dependencies
- if this is desired, proceed as above, by deleting or moving the file in question
- then call the pipeline as normal
== After running the pipeline ==
- Check the snakemake log file (in .snakemake/logs/) for any errors
- Repeat a dry run to check that only the "all_target" rule is left (this rule will be executed every time)
- check the results inside the 03_report directory
The following list gives an overview of the files inside the 16S_Metabarcoding directory. The list does not include temporary files or log files. Please have a look at the Documentation/ThesisManuscript/Figures/DAG.pdf for an overview of the rule dependencies.
== METADATA ==
./METADATA/
-
directory containing important metadata files and reference sequences
PIPELINE_CONFIG.yaml
- configurations of the pipeline
- see above
Reference_Sequences/
-
directory containing reference sequences
primers/
- primers used for the extraction of the amplicon region from the references
{refsource}/
-
directory containing the reference database files and subdirectories for the methodology specific files
reference.fasta
- reference sequences trimmed to amplicon region, filtered by length and removed duplicates (100% identity)
reference.mmi
-
reference file for alignments with minimap2
kraken2/
- kraken2 and bracken specific reference database files
qiime/
- qiime specific reference database files
- these files might be incomplete due to an unsuccessful construction
krona/
- krona specific reference database files
lambda/
- lambda phage reference sequence
zymobiomics/
- reference files only containing only species present in the zymobiomics mock community
== Pipeline ==
./Pipeline/
-
a directory containing software and scripts associated with the pipeline
MeBaPiNa/
-
directory containing files required to run the pipeline
Snakefile
- central file of the pipeline
- starts execution and sources other files
envs/
- conda environment files used in the pipeline
rules/
- definition of rules used in the pipeline
scripts/
- scripts used in the rules
- some additional scripts
-
== Raw data ==
./00_raw_data/
-
directory containing raw sequencing files of all runs
-
priority:
- these files are of highest priority and cannot be replaced
{run}/
-
run specific sub-directory
fast5/
- directory containing raw fast5 sequencing files
reports/
- optional directory with run reports as PDFs
- created manually in MinKNOW (feature automation was promised by ONT)
== Intermediate processing files ==
./01_processed_data/
-
directory containing intermediate processing files of all steps and runs
-
priority:
- these files are produced during key steps in the analysis processes
- they can be replaced if the files from ./00_raw_data/ are still present, but this might require a considerable amount of recomputation
- they are required in case some steps should be reran (e.g. with new parameters)
01_basecalling/
-
directory containing the basecalled raw-reads of all runs
{run}/
-
run specific sub-directory
guppy_basecaller_logs/
- logs from basecaller sequencing_summary.txt
- table with information about all basecalled reads pass/
- directory containing basecalled fastq files of all barcodes
{barcode}
- sub-directories with fastq files assigned to the barcode unclassified
- sub-directory with fastq files without detected barcode (contains all files if library wasn't multiplexed)
sequencing_summary/
- same information as in sequencing_summary.txt
sequencing_summary_sorted.txt
- but sorted by barcode column and start time split/
- and as separate files per barcode (can be used to extract raw reads per barcode)
-
02_trimming_filtering/
-
directory containing the trimmed and filtered basecalled-reads of all runs
{run}/
-
run specific sub-directory
{barcode}/
-
barcode specific sub-directory
trimmed.fastq
- trimming and second demultiplexing of basecalled reads other/
- reads assigned to other barcodes during second demultiplexing
- these reads are not used for anything and could be deleted
filtered.fastq
- length and quality filtered filtered reads
-
-
03_alignment/
-
directory containing the alignment of trimmed and filtered reads of all runs
{run}/
-
run specific sub-directory
{barcode}/
-
barcode specific sub-directory
{refsource}/
-
reference database specific sub-directory
filteredsorted.bam
- filtered alignement of the trimmed and filtered reads filteredsorted.bam.bai
- index of bam
-
-
-
03_kmer_mapping/
-
directory containing the k-mer mapping of trimmed and filtered reads of all runs
{run}/
-
run specific sub-directory
{barcode}/
-
barcode specific sub-directory
{refsource}_{refrank}/
-
reference database and rank specific sub-directory
filtered.kraken2
- per read classification information filtered.kreport2
- per taxa classification information
Species.bracken
- per taxa reestimation information Species.kreport2
- per taxa classification information after reestimation
-
-
-
03_otu_picking/
-
directory containing the clustered OTUs and the taxonomic assignmnts of trimmed and filtered reads of all runs
{run}/
-
run specific sub-directory
{barcode}/
-
barcode specific sub-directory
{refsource}/
-
reference database specific sub-directory
cluster_centseq.qza
- center sequence per OTU of the trimmed and filtered reads cluster_ftable.qza
- table with counts per OTU of the trimmed and filtered reads cluster_newrefseq.qza
- new reference sequences
- can be used for subsequent runs of open-reference clustering for consistent definitions of features across open-reference feature tables.
filt_ftable.qza
- table with counts per OTU after chimera removal and min coverage filtering filt_centseq.qza
- center sequence per OTU after chimera removal and min coverage filtering
{other directories}
- storage location for temporary files during conversion
{refsource}_{refrank}/
-
reference database and rank specific sub-directory
filtered.kraken2
- per OTU classification information filtered.kreport2
- per taxa classification information
-
-
-
== Result files ==
./02_analysis_results/
-
directory containing result files of all steps and runs
-
priority:
- these files are of low priority
- they can easily be replaced if the files from ./01_processed_data/ are still present
- important files are copied to ./03_report/
01_basecalling/
-
directory containing plots and statistics of the basecalled raw-reads of all runs
{run}/
-
run specific sub-directory
fastqc/
- read QC: all passed reads
nanocomp/
- barcode QC: per barcode
nanoplot/
- general QC: all reads, including calibtation strads
nanoqc/
- per base QC: all reads
pycoqc/
- general QC: all reads
-
02_trimming_filtering/
-
directory containing plots and statistics of the trimmed and filtered basecalled-reads of all runs
{run}/
-
run specific sub-directory
fastqc/
- read QC: trimmed and filtered barcoded reads
nanocomp/
- barcode QC: trimmed and filtered barcoded reads
nanoplot/
- general QC: trimmed and filtered barcoded reads
nanoqc/
- per base QC: trimmed and filtered barcoded reads
-
03_alignment/
-
directory containing plots and statistics of the alignment of trimmed and filtered reads of all runs
{run}/
-
run specific sub-directory
{barcode}/
-
barcode specific sub-directory
{refsource}/
-
reference database specific sub-directory
covdist.pdf covpos.pdf - coverage of reference sequences - covdist.pdf seem broken when executing on the VM
pycoqc.html pycoqc.json - alignment QC
{refsource}_{refrank}/
-
reference database and rank specific sub-directory
aligned.counttaxlist - taxonomic classification
krona.html - visualization of taxonomic classification
-
-
-
03_kmer_mapping/
-
directory containing plots and statistics of the k-mer mapping of trimmed and filtered reads of all runs
{run}/
-
run specific sub-directory
{barcode}/
-
barcode specific sub-directory
{refsource}_{refrank}/
-
reference database and rank specific sub-directory
kmer.counttaxlist - taxonomic classification
krona.html - visualization of taxonomic classification
krona_bracken.html - visualization of taxonomic classification after reestimation
-
-
-
03_otu_picking/
-
directory containing plots and statistics of the clustered OTUs and the taxonomic assignmnts of trimmed and filtered reads of all runs
{run}/
-
run specific sub-directory
{barcode}/
-
barcode specific sub-directory
{refsource}/
-
reference database specific sub-directory
q2otupick
- clustered reads overviewq2filter
- clustered reads overview after filtering
{refsource}_{refrank}/
-
reference database and rank specific sub-directory
kmer.counttaxlist - taxonomic classification
krona.html - visualization of taxonomic classification
-
-
-
== Report per sample ==
./03_report/
-
directory containing result files, plots and statistics per timepoint and sample
-
priority:
- the files in this directory are of high priority
- they can be recreated if the files from ./01_processed_data/ and ./02_analysis_results/ are still presentat
- they represent the final output of the pipeline
Reference_Sequences/
-
directory with plots and statistics of reference database
{refsource}/
-
reference database specific sub-directory
reference_lengthdist.pdf reference_lengthdist.tsv
- length of reference sequences
reference_taxaranks.tsv
- distribution of reference taxa ranks
-
{timepoint}/
-
timepoint specific folder or non-PROMISE
{sample}/
-
sample specific folder
{run}-{barcode}/
-
sample per run specific folder
-
in case the same sample is sequenced in multiple runs
read_base_counts.tsv
- statistics of reads and bases of raw, basecalled, trimmed and filtered reads
03_alignment-{refsource}-alignment_rates.tsv
- aligment statistics and error rates
03_otu_picking-{refsource}-feature_counts.tsv
- number of clusters and reads before and after filtering
03_alignment-{refsource}_{refrank}-taxa_counts.tsv
- statistics of classified taxa and reads
03_kmer_mapping-{refsource}_{refrank}-taxa_counts.tsv
- statistics of classified taxa and reads
03_kmer_mapping-{refsource}_{refrank}-retaxa_counts.tsv
- statistics of classified taxa and reads after reestimation
03_otu_picking-{refsource}_{refrank}-taxa_counts.tsv
- statistics of classified taxa and reads
03_alignment-{refsource}_{refrank}-taxa_covdist.pdf
- distribution of taxa abundance 03_alignment-{refsource}_{refrank}-taxa_diversity.tsv
- diversity, richness and evenness of community
03_kmer_mapping-{refsource}_{refrank}-taxa_covdist.pdf
- distribution of taxa abundance 03_kmer_mapping-{refsource}_{refrank}-taxa_diversity.tsv
- diversity, richness and evenness of community
03_otu_picking-{refsource}_{refrank}-taxa_diversity.tsv
- distribution of taxa abundance 03_otu_picking-{refsource}_{refrank}-taxa_covdist.pdf
- diversity, richness and evenness of community
{other files}
- are copies from the similarly named files in the ./02_analysis_results/ directory
-
-