Skip to content

Latest commit

 

History

History
808 lines (575 loc) · 35.3 KB

README.md

File metadata and controls

808 lines (575 loc) · 35.3 KB

ILL Humann custom pipeline User Manual


Contents


Requirements

All pipelines are self contained. The only requirements needed is Apptainer. The apptainer executable "singularity" should be available in your path.

Note: On interactive node include the module load StdEnv/2020 apptainer/1.1.5 in your ~/.bashrc file


Installation

ILL pipelines is already install on ip34. Please include the following commands in your ~/.bashrc.

module load StdEnv/2020 apptainer
export ILL_PIPELINES=/home/def-ilafores/programs/ILL_pipelines

To load your new bashrc definition you will need to logout and login again on server.

To install ILL pipelines you need to:

  • Install Apptainer and make sure singularity executable is in your PATH

  • Create a clone of the repository:

    git clone https://github.com/jflucier/ILL_pipelines.git

    Note: Creating a clone of the repository requires Github to be installed.

  • For convenience, set environment variable ILL_PIPELINES in your ~/.bashrc:

    export ILL_PIPELINES=/path/to/ILL_pipelines

  • Go to $ILL_PIPELINES/containers and run these commands:

cd $ILL_PIPELINES/containers
sh build_all.sh


How to run

To run pipelines you need to create a sample spread with 3 columns like this table:

sample1 /path/to/sample1.R1.fastq /path/to/sample1.R2.fastq
sample2 /path/to/sample2.R1.fastq /path/to/sample2.R2.fastq
etc... etc... etc...

Important note: TSV files must not have header line.

Preprocess kneaddata

For full list of options:

$ bash $ILL_PIPELINES/generateslurm_preprocess.kneaddata.sh -h

Usage: generateslurm_preprocess.kneaddata.sh --sample_tsv /path/to/tsv --out /path/to/out [--db] [--trimmomatic_options "trim options"] [--bowtie2_options "bowtie2 options"]
Options:

	--sample_tsv STR	path to sample tsv (3 columns: sample name<tab>fastq1 path<tab>fastq2 path)
	--out STR	path to output dir
	--db	kneaddata database path (default /nfs3_ib/nfs-ip34/fast/def-ilafores/host_genomes/GRCh38_index/grch38_1kgmaj)
	--trimmomatic_options	options to pass to trimmomatic (default ILLUMINACLIP:/cvmfs/soft.mugqic/CentOS6/software/trimmomatic/Trimmomatic-0.39/adapters/TruSeq3-PE-2.fa:2:30:10 SLIDINGWINDOW:4:30 MINLEN:100)
	--bowtie2_options	options to pass to trimmomatic (default --very-sensitive-local)

Slurm options:
	--slurm_alloc STR	slurm allocation (default def-ilafores)
	--slurm_log STR	slurm log file output directory (default to output_dir/logs)
	--slurm_email "[email protected]"	Slurm email setting
	--slurm_walltime STR	slurm requested walltime (default 24:00:00)
	--slurm_threads INT	slurm requested number of threads (default 24)
	--slurm_mem STR	slurm requested memory (default 30)

  -h --help	Display help

Most default values should be ok on ip34. Make sure you specify sample_tsv, output path.

Here is how generate slurm script with default parameters:


$> bash $ILL_PIPELINES/generateslurm_preprocess.kneaddata.sh \
> --sample_tsv /nfs3_ib/nfs-ip34/home/def-ilafores/analysis/20230216_metagenome_test/data/testset-projet_PROVID19/saliva_samples/sample_provid19.saliva.test.tsv \
> --out /nfs3_ib/nfs-ip34/home/def-ilafores/analysis/20230216_metagenome_test/testset-projet_PROVID19-saliva/preprocess \
> --db /cvmfs/datahub.genap.ca/vhost34/def-ilafores/host_genomes/GRCh38_index/grch38_1kgmaj \
> --slurm_email "[email protected]" \
> --bowtie2_options "--very-sensitive" \
> --slurm_walltime "6:00:00"
## Will use sample file: /nfs3_ib/nfs-ip34/home/def-ilafores/analysis/20230216_metagenome_test/data/testset-projet_PROVID19/saliva_samples/sample_provid19.saliva.test.tsv
## Results wil be stored to this path: /nfs3_ib/nfs-ip34/home/def-ilafores/analysis/20230216_metagenome_test/testset-projet_PROVID19-saliva/preprocess
## Will output logs in: /nfs3_ib/nfs-ip34/home/def-ilafores/analysis/20230216_metagenome_test/testset-projet_PROVID19-saliva/preprocess/logs
outputting preprocess slurm script to /nfs3_ib/nfs-ip34/home/def-ilafores/analysis/20230216_metagenome_test/testset-projet_PROVID19-saliva/preprocess/preprocess.kneaddata.slurm.sh
Generate preprocessed reads sample tsv: /nfs3_ib/nfs-ip34/home/def-ilafores/analysis/20230216_metagenome_test/testset-projet_PROVID19-saliva/preprocess/preprocessed_reads.sample.tsv
To submit to slurm, execute the following command:
sbatch --array=1-5 /nfs3_ib/nfs-ip34/home/def-ilafores/analysis/20230216_metagenome_test/testset-projet_PROVID19-saliva/preprocess/preprocess.kneaddata.slurm.sh

Notice that preprocess script generates sample tsv file (i.e. precocess/preprocessed_reads.sample.tsv) that should be used for the taxonomic profile and the functionnal profile pipeline.

Finally, the preprocess script can be executed on a single sample. Use -h option to view usage:


$ bash $ILL_PIPELINES/scripts/preprocess.kneaddata.sh -h

Usage: preprocess.kneaddata.sh -s sample_name -o /path/to/out [--db] [--trimmomatic_options "trim options"] [--bowtie2_options "bowtie2 options"]
Options:

	-s STR	sample name
	-o STR	path to output dir
	-tmp STR	path to temp dir (default output_dir/temp)
	-t	# of threads (default 8)
	-m	memory (default 40G)
	-fq1	path to fastq1
	-fq2	path to fastq2
	--db	kneaddata database path (default /nfs3_ib/nfs-ip34/fast/def-ilafores/host_genomes/GRCh38_index/grch38_1kgmaj)
	--trimmomatic_options	options to pass to trimmomatic (default ILLUMINACLIP:/cvmfs/soft.mugqic/CentOS6/software/trimmomatic/Trimmomatic-0.39/adapters/TruSeq3-PE-2.fa:2:30:10 SLIDINGWINDOW:4:30 MINLEN:100)
	--bowtie2_options	options to pass to trimmomatic (default --very-sensitive-local)

  -h --help	Display help


Sourmash taxonomic abundance per sample

For full list of options:

$ bash $ILL_PIPELINES/generateslurm_taxonomic_abundance.sourmash.sh  -h

Usage: generateslurm_taxonomic_abundance.sourmash.sh --sample_tsv /path/to/tsv --out /path/to/out [--SM_db /path/to/sourmash/db] [--SM_db_prefix sourmash_db_prefix] [--kmer kmer_size]
Options:

   --sample_tsv STR     path to sample tsv (3 columns: sample name<tab>fastq1 path<tab>fastq2 path)
   --out STR    path to output dir
   --SM_db sourmash databases directory path (default /cvmfs/datahub.genap.ca/vhost34/def-ilafores/sourmash_db/)
   --SM_db_prefix  sourmash database prefix, allowing wildcards (default genbank-2022.03)
   --kmer  choice of k-mer, dependent on database choices (default 51, make sure to have them available)

Slurm options:
   --slurm_alloc STR    slurm allocation (default def-ilafores)
   --slurm_log STR      slurm log file output directory (default to output_dir/logs)
   --slurm_email "[email protected]"       Slurm email setting
   --slurm_walltime STR slurm requested walltime (default 24:00:00)
   --slurm_threads INT  slurm requested number of threads (default 12)
   --slurm_mem STR      slurm requested memory (default 62G)

   -h --help    Display help

Notice that preprocess script generates sample tsv file needed here (i.e. precocess/preprocessed_reads.sample.tsv).

The sourmash taxonomic abundance script can also be executed on a single sample. Use -h option to view usage:


$ bash $ILL_PIPELINES/scripts/taxonomic_abundance.sourmash.sh -h

Usage: taxonomic_abundance.sourmash.sh -s sample_name -o /path/to/out [-t threads] -fq1 /path/to/fastq1 -fq2 /path/to/fastq2 [--SM_db /path/to/sourmash/db] [--SM_db_prefix sourmash_db_prefix] [--kmer kmer_size]
Options:

        -s STR  sample name
        -o STR  path to output dir
        -tmp STR        path to temp dir (default output_dir/temp)
        -t      # of threads (default 8)
        -fq1    path to fastq1
        -fq2    path to fastq2
        --SM_db sourmash databases directory path (default /cvmfs/datahub.genap.ca/vhost34/def-ilafores/sourmash_db/)
        --SM_db_prefix  sourmash database prefix, allowing wildcards (default genbank-2022.03)
        --kmer  choice of k-mer size, dependent on available databases (default 51, make sure database is available)

  -h --help     Display help



Metaphlan taxonomic abundance

For full list of options:

$ bash $ILL_PIPELINES/generateslurm_taxonomic_abundance.metaphlan.sh -h

Usage: generateslurm_taxonomic_abundance.metaphlan.sh --sample_tsv /path/to/tsv --out /path/to/out [--db /path/to/metaphlan/db]
Options:

   --sample_tsv STR     path to sample tsv (3 columns: sample name<tab>fastq1 path<tab>fastq2 path)
   --out STR    path to output dir
   --db   metaphlan db path (default /cvmfs/datahub.genap.ca/vhost34/def-ilafores/metaphlan4_db/mpa_vOct22_CHOCOPhlAnSGB_202212)

Slurm options:
   --slurm_alloc STR    slurm allocation (default def-ilafores)
   --slurm_log STR      slurm log file output directory (default to output_dir/logs)
   --slurm_email "[email protected]"       Slurm email setting
   --slurm_walltime STR slurm requested walltime (default 24:00:00)
   --slurm_threads INT  slurm requested number of threads (default 12)
   --slurm_mem STR      slurm requested memory (default 25G)

   -h --help    Display help

Notice that preprocess script generates sample tsv file needed here (i.e. precocess/preprocessed_reads.sample.tsv).

The metaphlan taxonomic abundance script can also be executed on a single sample. Use -h option to view usage:

$ bash $ILL_PIPELINES/scripts/taxonomic_abundance.metaphlan.sh -h

Usage: taxonomic_abundance.metaphlan.sh -s sample_name -o /path/to/out [-db /path/to/metaphlan/db] -fq1 /path/to/fastq1 -fq2 /path/to/fastq2 [-fq1_single /path/to/single1.fastq] [-fq2_single /path/to/single2.fastq]
Options:

        -s STR  sample name
        -o STR  path to output dir
        -tmp STR        path to temp dir (default output_dir/temp)
        -t      # of threads (default 8)
        -fq1    path to fastq1
        -fq1_single     path to fastq1 unpaired reads
        -fq2    path to fastq2
        -fq2_single     path to fastq2 unpaired reads
        -db     metaphlan db path (default /cvmfs/datahub.genap.ca/vhost34/def-ilafores/metaphlan4_db/mpa_vOct22_CHOCOPhlAnSGB_202212)

  -h --help     Display help

Once metaphlan as run on all samples, you can merge results table by running the following script:

$ bash $ILL_PIPELINES/scripts/taxonomic_abundance.metaphlan.all.sh -h

Usage: taxonomic_abundance.metaphlan.all.sh -profiles /path/to/metaphlan_out/*_profile.txt -o /path/to/out
Options:

        -profiles Path to metaphlan outputs (i.e. /path/to/metaphlan_out/*_profile.txt)
        -o STR  path to output dir
        -tmp STR        path to temp dir (default output_dir/temp)

  -h --help     Display help

Kraken2 taxonomic profile per sample

For full list of options:

$ bash $ILL_PIPELINES/generateslurm_taxonomic_profile.sample.sh -h

Usage: generateslurm_taxonomic_profile.sample.sh --sample_tsv /path/to/tsv --out /path/to/out [--kraken_db "kraken database"]
Options:

	--sample_tsv STR	path to sample tsv (3 columns: sample name<tab>fastq1 path<tab>fastq2 path)
	--out STR	path to output dir
	--kraken_db	kraken2 database path (default /cvmfs/datahub.genap.ca/vhost34/def-ilafores/kraken2_dbs/k2_pluspfp_16gb_20210517)
	--bracken_readlen	bracken read length option (default 150)

Slurm options:
	--slurm_alloc STR	slurm allocation (default def-ilafores)
	--slurm_log STR	slurm log file output directory (default to output_dir/logs)
	--slurm_email "[email protected]"	Slurm email setting
	--slurm_walltime STR	slurm requested walltime (default 6:00:00)
	--slurm_threads INT	slurm requested number of threads (default 24)
	--slurm_mem STR	slurm requested memory (default 125)

  -h --help	Display help

Notice that preprocess script generates sample tsv file needed here (i.e. precocess/preprocessed_reads.sample.tsv).

The taxonomic profile script can also be executed on a single sample. Use -h option to view usage:


$ bash $ILL_PIPELINES/scripts/taxonomic_profile.sample.sh -h

Usage: taxonomic_profile.sample.sh [--kraken_db /path/to/krakendb] [--bracken_readlen int] [--confidence float] [-t thread_nbr] [-m mem_in_G] -fq1 /path/fastq1 -fq2 /path/fastq2 -o /path/to/out
Options:

	-s STR	sample name
	-o STR	path to output dir
	-tmp STR	path to temp dir (default output_dir/temp)
	-t	# of threads (default 8)
	-m	memory (default 40G)
	-fq1	path to fastq1
	-fq2	path to fastq2
	--kraken_db	kraken2 database path (default /cvmfs/datahub.genap.ca/vhost34/def-ilafores/kraken2_dbs/k2_pluspfp_16gb_20210517)
	--bracken_readlen	bracken read length option (default 150)
    --confidence	kraken confidence level to reduce false-positive rate (default 0.05)
    
  -h --help	Display help

Generate HUMAnN bugs list

For full list of options:

$ bash $ILL_PIPELINES/generateslurm_taxonomic_profile.allsamples.sh -h

Usage: generateslurm_taxonomic_profile.allsamples.sh [--chocophlan_db /path/to/chocophlan_db ] --kreports '/path/to/*_kraken_report_regex' --out /path/to/out --bowtie_index_name idx_nbame
Options:

        --kreports STR	base path regex to retrieve species level kraken reports (i.e.: '/path/to/taxonomic_profile/*/*_bracken/*_bracken_S.kreport'). Must be specified between single quotes. See usage example or github documentation.
        --out STR       path to output dir
        --bowtie_index_name  name of the bowtie index that will be generated
        --chocophlan_db path to the full chocoplan db (default: /nfs3_ib/nfs-ip34/fast/def-ilafores/humann_dbs/chocophlan)

Slurm options:
        --slurm_alloc STR       slurm allocation (default def-ilafores)
        --slurm_log STR slurm log file output directory (default to output_dir/logs)
        --slurm_email "[email protected]"  Slurm email setting
        --slurm_walltime STR    slurm requested walltime (default 24:00:00)
        --slurm_threads INT     slurm requested number of threads (default 48)
        --slurm_mem STR slurm requested memory (default 251G)

  -h --help     Display help


The kreports parameter is a regular expression that points to all kraken report generated at specie level. The analysis begins and creates the buglist and then creates the bowtie index on the buglist. It finishes by generating the taxonomic table for each taxonomic level.

This generated script can also be runned locally on ip34 the following ways:

bash /path/to/out/taxonomic_profile.allsamples.slurm.sh

Or you can directly call the script the following way:

$ bash $ILL_PIPELINES/scripts/taxonomic_profile.allsamples.sh -h

Usage: taxonomic_profile.allsample.sh --kreports '/path/to/*_kraken_report_regex' --out /path/to/out --bowtie_index_name idx_nbame 
Options:

	--kreports STR	base path regex to retrieve species level kraken reports (i.e.: "$PWD"/taxonomic_profile/*/*_bracken/*_bracken_S.kreport). Must be specified between single quotes. See usage example or github documentation.
	--out STR	path to output dir
	--tmp STR	path to temp dir (default output_dir/temp)
	--threads	# of threads (default 8)
	--bowtie_index_name  name of the bowtie index that will be generated
	--chocophlan_db	path to the full chocoplan db (default: /nfs3_ib/nfs-ip34/fast/def-ilafores/humann_dbs/chocophlan)

  -h --help	Display help

HUMAnN functionnal profile

For full list of options:

$ bash $ILL_PIPELINES/generateslurm_functionnal_profile.humann.sh -h

Usage: generateslurm_functionnal_profile.humann.sh --sample_tsv /path/to/tsv --out /path/to/out --nt_db "nt database path" [--search_mode "search mode"] [--prot_db "protein database path"]
Options:

  --sample_tsv STR      path to sample tsv (5 columns: sample name<tab>fastq1 path<tab>fastq2 path<tab>fastq1 single path<tab>fastq2 single path). Generated in preprocess step.
        --out STR       path to output dir
        --search_mode   Search mode. Possible values are: dual, nt, prot (default prot)
        --nt_db the nucleotide database to use (default /cvmfs/datahub.genap.ca/vhost34/def-ilafores/humann_dbs/chocophlan)
        --prot_db       the protein database to use (default /cvmfs/datahub.genap.ca/vhost34/def-ilafores/humann_dbs/uniref)
        --utility_map_db        the protein database to use (default /cvmfs/datahub.genap.ca/vhost34/def-ilafores/humann_dbs/utility_mapping)

Slurm options:
        --slurm_alloc STR       slurm allocation (default def-ilafores)
        --slurm_log STR slurm log file output directory (default to output_dir/logs)
        --slurm_email "[email protected]"  Slurm email setting
        --slurm_walltime STR    slurm requested walltime (default 24:00:00)
        --slurm_threads INT     slurm requested number of threads (default 24)
        --slurm_mem STR slurm requested memory (default 30G)

  -h --help     Display help



The sample_tsv that can be used was created in the preprocess step (i.e. precocess/preprocessed_reads.sample.tsv).

The functionnal profile script can also be executed on a single sample. Use -h option to view usage:


$ bash $ILL_PIPELINES/scripts/functionnal_profile.humann.sh -h

Usage: functionnal_profile.humann.sh -s sample_name -o /path/to/out --nt_db "nt database path" [--search_mode "search mode"] [--prot_db "protein database path"]
Options:

        -s STR  sample name
        -o STR  path to output dir
        -tmp STR        path to temp dir (default output_dir/temp)
        -t      # of threads (default 8)
        -fq1    path to fastq1
        -fq1_single     path to fastq1 unpaired reads
        -fq2    path to fastq2
        -fq2_single     path to fastq2 unpaired reads
        --search_mode   Search mode. Possible values are: dual, nt, prot (default prot)
        --nt_db the nucleotide database to use (default /cvmfs/datahub.genap.ca/vhost34/def-ilafores/humann_dbs/chocophlan)
        --prot_db       the protein database to use (default /cvmfs/datahub.genap.ca/vhost34/def-ilafores/humann_dbs/uniref)
        --utility_map_db        the protein database to use (default /cvmfs/datahub.genap.ca/vhost34/def-ilafores/humann_dbs/utility_mapping)

  -h --help     Display help


MetaWRAP assembly binning and bin refinement

Before running this pipeline, make sure singularity and BBmap executables are in your path. On ip29, just do the following:

module load singularity mugqic/BBMap/38.90

For full list of options:


$ bash ${ILL_PIPELINES}/generateslurm_assembly_bin_refinement.metawrap.sh -h

Usage: generateslurm_denovo_assembly_bin_refinement.metawrap.sh --sample_tsv /path/to/tsv --out /path/to/out [--assembly] [--binning] [--refinement]
Options:

    --sample_tsv STR	path to sample tsv (3 columns: sample name<tab>fastq1 path<tab>fastq2 path)
	--out STR	path to output dir
	--assembly	perform assembly
	--binning	perform binning step
	--refinement	perform refinement step

Metawrap options:
	--metaspades	use metaspades for assembly (default: true)
	--megahit	use megahit for assembly (default: true)
	--metabat2	use metabat2 for binning (default: true)
	--maxbin2	use maxbin2 for binning (default: true)
	--concoct	use concoct for binning (default: true)
	--run-checkm	run checkm for binning (default: true)
	--refinement_min_compl INT	refinement bin minimum completion percent (default 50)
	--refinement_max_cont INT	refinement bin maximum contamination percent (default 10)

Slurm options:
	--slurm_alloc STR	slurm allocation (default def-ilafores)
	--slurm_log STR	slurm log file output directory (default to output_dir/logs)
	--slurm_email "[email protected]"	Slurm email setting
	--slurm_walltime STR	slurm requested walltime (default 24:00:00)
	--slurm_threads INT	slurm requested number of threads (default 48)
	--slurm_mem STR	slurm requested memory (default 251G)

  -h --help	Display help


Most default values should be ok in a cluster environment. Make sure you specify sample_tsv, ouput path and steps you wich to execute (assembly and/or binning and/or refinement). Obviously, before running binning, you must perform assembly step.

The sample_tsv that can be used was created in the preprocess step (i.e. precocess/preprocessed_reads.sample.tsv).

Here a re some example commands you can perform for this pipeline:

# on ip29, load singularity and bbmap in path
module load singularity mugqic/BBMap/38.90

# Run all steps with defualt parameters
$ bash ${ILL_PIPELINES}/generateslurm_assembly_bin_refinement.metawrap.sh \
--out path/to/out --sample_tsv /path/to/tsv

# Run only the assembly step
$ bash ${ILL_PIPELINES}/generateslurm_assembly_bin_refinement.metawrap.sh \
--out path/to/out --sample_tsv /path/to/tsv \
--assembly

# Run only the assembly step using only megahit assembler
$ bash ${ILL_PIPELINES}/generateslurm_assembly_bin_refinement.metawrap.sh \
--out path/to/out --sample_tsv /path/to/tsv \
--assembly --megahit

# Run the assembly step using only megahit assembler and binning step with
# concoct and maxbin binner software
$ bash ${ILL_PIPELINES}/generateslurm_assembly_bin_refinement.metawrap.sh \
--out path/to/out --sample_tsv /path/to/tsv \
--assembly --megahit \
--binning --maxbin2 --concoct

# Run assembly and binning with default paramters and refinement step using
# specific bin completion and contamination values
$ bash ${ILL_PIPELINES}/generateslurm_assembly_bin_refinement.metawrap.sh \
--out path/to/out --sample_tsv /path/to/tsv \
--assembly --binning \
--refinement --refinement_min_compl 90 --refinement_max_cont 5

Finally, the assembly, binning and refinement script can be executed on a single sample. Use -h option to view usage:

# on ip29, load singularity and bbmap in path
module load singularity mugqic/BBMap/38.90


## assembly script usage:
$ bash /home/jflucier/localhost/projet/ILL_pipelines/scripts/assembly.metawrap.sh -h

Usage: assembly.metawrap.sh [-tmp /path/tmp] [-t threads] [-m memory] [--metaspades] [--megahit] -s sample_name -o /path/to/out -fq1 /path/to/fastq1 -fq2 /path/to/fastq2
Options:

	-s STR	sample name
	-o STR	path to output dir
	-tmp STR	path to temp dir (default output_dir/temp)
	-t	# of threads (default 8)
	-m	memory (default 40G)
	-fq1	path to fastq1
	-fq2	path to fastq2
	--metaspades	use metaspades for assembly (default: true)
	--megahit	use megahit for assembly (default: true)

  -h --help	Display help

## Binning script usage:
$ bash $ILL_PIPELINES/scripts/binning.metawrap.sh -h

Usage: binning.metawrap.sh [-tmp /path/tmp] [-t threads] [-m memory] [--metabat2] [--maxbin2] [--concoct] [--run-checkm] -s sample_name -o /path/to/out -a /path/to/assembly -fq1 /path/to/fastq1 -fq2 /path/to/fastq2
Options:

	-s STR	sample name
	-o STR	path to output dir
	-tmp STR	path to temp dir (default output_dir/temp)
	-t	# of threads (default 8)
	-m	memory (default 40G)
	-a	assembly fasta filepath
	-fq1	path to fastq1
	-fq2	path to fastq2
	--metabat2	use metabat2 for binning (default: true)
	--maxbin2	use maxbin2 for binning (default: true)
	--concoct	use concoct for binning (default: true)
	--run-checkm	run checkm on bins (default: true)

  -h --help	Display help


## Binning refinement script usage:
$ $ILL_PIPELINES/scripts/bin_refinement.metawrap.sh -h

Usage: bin_refinement.metawrap.sh [-tmp /path/tmp] [-t threads] [-m memory] [--metaspades] [--megahit] -s sample_name -o /path/to/out -fq1 /path/to/fastq1 -fq2 /path/to/fastq2
Options:

	-s STR	sample name
	-o STR	path to output dir
	-tmp STR	path to temp dir (default output_dir/temp)
	-t	# of threads (default 8)
	-m	memory (default 40G)
	--metabat2_bins	path to metabats bin direcotry
	--maxbin2_bins	path to maxbin2 bin direcotry
	--concoct_bins	path to concoct bin direcotry
	--refinement_min_compl INT	refinement bin minimum completion percent (default 50)
	--refinement_max_cont INT	refinement bin maximum contamination percent (default 10)

  -h --help	Display help

Bin dereplication

For full list of options:

$ bash $ILL_PIPELINES/generateslurm_dereplicate_bins.sh -h 
Usage: generateslurm_dereplicate_bins.sh [slurm options] [-a {fastANI,ANIn,gANI,ANImf,goANI}] [-p_ani value] [-s_ani value] [-cov value] [-comp value] [-con value] -bin_path_regex '/path/regex/to/*_genome_bins_path_regex' -o /path/to/out 
Options:

	-bin_path_regex	A regex path to bins, i.e. /path/to/bin/*/*.fa. Must be specified between single quotes. See usage example or github documentation.
	-o STR	path to output dir
	-a	algorithm {fastANI,ANIn,gANI,ANImf,goANI} (default: ANImf). See dRep documentation for more information.
	-p_ani	ANI threshold to form primary (MASH) clusters (default: 0.95)
	-s_ani	ANI threshold to form secondary clusters (default: 0.99)
	-cov	Minmum level of overlap between genomes when doing secondary comparisons (default: 0.1)
	-comp	Minimum genome completeness (default: 50)
	-con	Maximum genome contamination (default: 5)

Slurm options:
	--slurm_alloc STR	slurm allocation (default def-ilafores)
	--slurm_log STR	slurm log file output directory (default to output_dir/logs)
	--slurm_email "[email protected]"	Slurm email setting
	--slurm_walltime STR	slurm requested walltime (default 24:00:00)
	--slurm_threads INT	slurm requested number of threads (default 48)
	--slurm_mem STR	slurm requested memory (default 251G)

  -h --help	Display help

The bin to be dereplicated must be specified using the bin_path_regex parameter. All fasta included in regex listing will be included in analysis. Make sure you use a full path regex.For example, a bin regex like "/nfs3_ib/nfs-ip34/home/def-ilafores/analysis/20230216_metagenome_test/testset-projet_PROVID19-saliva/bin_refinement/GQ*/metawrap_30_25_bins/*.fa" will use the following fasta files:

$ ls /nfs3_ib/nfs-ip34/home/def-ilafores/analysis/20230216_metagenome_test/testset-projet_PROVID19-saliva/bin_refinement/GQ*/metawrap_30_25_bins/*.fa
/nfs3_ib/nfs-ip34/home/def-ilafores/analysis/20230216_metagenome_test/testset-projet_PROVID19-saliva/bin_refinement/GQ10/metawrap_30_25_bins/GQ10.bin.1.fa
/nfs3_ib/nfs-ip34/home/def-ilafores/analysis/20230216_metagenome_test/testset-projet_PROVID19-saliva/bin_refinement/GQ10/metawrap_30_25_bins/GQ10.bin.2.fa
/nfs3_ib/nfs-ip34/home/def-ilafores/analysis/20230216_metagenome_test/testset-projet_PROVID19-saliva/bin_refinement/GQ13/metawrap_30_25_bins/GQ13.bin.1.fa
/nfs3_ib/nfs-ip34/home/def-ilafores/analysis/20230216_metagenome_test/testset-projet_PROVID19-saliva/bin_refinement/GQ14/metawrap_30_25_bins/GQ14.bin.1.fa
/nfs3_ib/nfs-ip34/home/def-ilafores/analysis/20230216_metagenome_test/testset-projet_PROVID19-saliva/bin_refinement/GQ14/metawrap_30_25_bins/GQ14.bin.2.fa
/nfs3_ib/nfs-ip34/home/def-ilafores/analysis/20230216_metagenome_test/testset-projet_PROVID19-saliva/bin_refinement/GQ15/metawrap_30_25_bins/GQ15.bin.1.fa
/nfs3_ib/nfs-ip34/home/def-ilafores/analysis/20230216_metagenome_test/testset-projet_PROVID19-saliva/bin_refinement/GQ17b/metawrap_30_25_bins/GQ17b.bin.1.fa
/nfs3_ib/nfs-ip34/home/def-ilafores/analysis/20230216_metagenome_test/testset-projet_PROVID19-saliva/bin_refinement/GQ17b/metawrap_30_25_bins/GQ17b.bin.2.fa
/nfs3_ib/nfs-ip34/home/def-ilafores/analysis/20230216_metagenome_test/testset-projet_PROVID19-saliva/bin_refinement/GQ18/metawrap_30_25_bins/GQ18.bin.1.fa
/nfs3_ib/nfs-ip34/home/def-ilafores/analysis/20230216_metagenome_test/testset-projet_PROVID19-saliva/bin_refinement/GQ1/metawrap_30_25_bins/GQ1.bin.1.fa
/nfs3_ib/nfs-ip34/home/def-ilafores/analysis/20230216_metagenome_test/testset-projet_PROVID19-saliva/bin_refinement/GQ20/metawrap_30_25_bins/GQ20.bin.1.fa
/nfs3_ib/nfs-ip34/home/def-ilafores/analysis/20230216_metagenome_test/testset-projet_PROVID19-saliva/bin_refinement/GQ21/metawrap_30_25_bins/GQ21.bin.1.fa
/nfs3_ib/nfs-ip34/home/def-ilafores/analysis/20230216_metagenome_test/testset-projet_PROVID19-saliva/bin_refinement/GQ22/metawrap_30_25_bins/GQ22.bin.1.fa
/nfs3_ib/nfs-ip34/home/def-ilafores/analysis/20230216_metagenome_test/testset-projet_PROVID19-saliva/bin_refinement/GQ23/metawrap_30_25_bins/GQ23.bin.1.fa
/nfs3_ib/nfs-ip34/home/def-ilafores/analysis/20230216_metagenome_test/testset-projet_PROVID19-saliva/bin_refinement/GQ24/metawrap_30_25_bins/GQ24.bin.1.fa
/nfs3_ib/nfs-ip34/home/def-ilafores/analysis/20230216_metagenome_test/testset-projet_PROVID19-saliva/bin_refinement/GQ26/metawrap_30_25_bins/GQ26.bin.1.fa
/nfs3_ib/nfs-ip34/home/def-ilafores/analysis/20230216_metagenome_test/testset-projet_PROVID19-saliva/bin_refinement/GQ29/metawrap_30_25_bins/GQ29.bin.1.fa
/nfs3_ib/nfs-ip34/home/def-ilafores/analysis/20230216_metagenome_test/testset-projet_PROVID19-saliva/bin_refinement/GQ29/metawrap_30_25_bins/GQ29.bin.2.fa
/nfs3_ib/nfs-ip34/home/def-ilafores/analysis/20230216_metagenome_test/testset-projet_PROVID19-saliva/bin_refinement/GQ29/metawrap_30_25_bins/GQ29.bin.3.fa
/nfs3_ib/nfs-ip34/home/def-ilafores/analysis/20230216_metagenome_test/testset-projet_PROVID19-saliva/bin_refinement/GQ2/metawrap_30_25_bins/GQ2.bin.1.fa
/nfs3_ib/nfs-ip34/home/def-ilafores/analysis/20230216_metagenome_test/testset-projet_PROVID19-saliva/bin_refinement/GQ2/metawrap_30_25_bins/GQ2.bin.2.fa
/nfs3_ib/nfs-ip34/home/def-ilafores/analysis/20230216_metagenome_test/testset-projet_PROVID19-saliva/bin_refinement/GQ3/metawrap_30_25_bins/GQ3.bin.1.fa
/nfs3_ib/nfs-ip34/home/def-ilafores/analysis/20230216_metagenome_test/testset-projet_PROVID19-saliva/bin_refinement/GQ5/metawrap_30_25_bins/GQ5.bin.1.fa
/nfs3_ib/nfs-ip34/home/def-ilafores/analysis/20230216_metagenome_test/testset-projet_PROVID19-saliva/bin_refinement/GQ5/metawrap_30_25_bins/GQ5.bin.2.fa
/nfs3_ib/nfs-ip34/home/def-ilafores/analysis/20230216_metagenome_test/testset-projet_PROVID19-saliva/bin_refinement/GQ6/metawrap_30_25_bins/GQ6.bin.1.fa
/nfs3_ib/nfs-ip34/home/def-ilafores/analysis/20230216_metagenome_test/testset-projet_PROVID19-saliva/bin_refinement/GQ7/metawrap_30_25_bins/GQ7.bin.1.fa
/nfs3_ib/nfs-ip34/home/def-ilafores/analysis/20230216_metagenome_test/testset-projet_PROVID19-saliva/bin_refinement/GQ8/metawrap_30_25_bins/GQ8.bin.1.fa
/nfs3_ib/nfs-ip34/home/def-ilafores/analysis/20230216_metagenome_test/testset-projet_PROVID19-saliva/bin_refinement/GQ9/metawrap_30_25_bins/GQ9.bin.1.fa
/nfs3_ib/nfs-ip34/home/def-ilafores/analysis/20230216_metagenome_test/testset-projet_PROVID19-saliva/bin_refinement/GQ9/metawrap_30_25_bins/GQ9.bin.2.fa

This generated script can also be runned locally on ip34 the following ways:

bash /path/to/out/submit_dRep.slurm.sh

Or you can directly call the script the following way:


$ bash $ILL_PIPELINES/scripts/dereplicate_bins.dRep.sh -h

Usage: dereplicate_bins.dRep.sh [-tmp /path/tmp] [-t threads] -bin_path_regex '/path/regex/to/*_genome_bins_path_regex' -o /path/to/out [-a algorithm] [-p_ani value] [-s_ani value] [-cov value] [-comp value] [-con value]  
Options:

	-tmp STR	path to temp dir (default output_dir/temp)
	-t	# of threads (default 8)
	-bin_path_regex	A regex path to bins, i.e. /path/to/bin/*/*.fa. Must be specified between single quotes. See usage example or github documentation.
	-o STR	path to output dir
	-a	algorithm {fastANI,ANIn,gANI,ANImf,goANI} (default: ANImf). See dRep documentation for more information.
	-p_ani	ANI threshold to form primary (MASH) clusters (default: 0.95)
	-s_ani	ANI threshold to form secondary clusters (default: 0.99)
	-cov	Minmum level of overlap between genomes when doing secondary comparisons (default: 0.1)
	-comp	Minimum genome completeness (default: 50)
	-con	Maximum genome contamination (default: 5)

  -h --help	Display help

Bin annotation

For full list of options:

$ bash $ILL_PIPELINES/generateslurm_annotate_bins.sh -h

Usage: generateslurm_annotate_bins.sh -drep /path/to/drep/genome --out /path/to/out --bowtie_index_name idx_nbame
Options:

	-o STR	path to output dir
	-drep STR	dereplicated genome path (drep output directory). See dereplicate_bins.dRep.sh for more information.
	-ma_db	MicrobeAnnotator DB path (default: /cvmfs/datahub.genap.ca/vhost34/def-ilafores/MicrobeAnnotator_DB).
	-gtdb_db	GTDBTK DB path (default: /cvmfs/datahub.genap.ca/vhost34/def-ilafores/GTDB/release207_v2).

Slurm options:
	--slurm_alloc STR	slurm allocation (default def-ilafores)
	--slurm_log STR	slurm log file output directory (default to output_dir/logs)
	--slurm_email "[email protected]"	Slurm email setting
	--slurm_walltime STR	slurm requested walltime (default 24:00:00)
	--slurm_threads INT	slurm requested number of threads (default 24)
	--slurm_mem STR	slurm requested memory (default 31G)

  -h --help	Display help

This generated script can also be runned locally on ip34 the following ways:

bash /path/to/out/submit_annotate.slurm.sh

Or you can directly call the script the following way:


$ bash $ILL_PIPELINES/scripts/annotate_bins.sh -h

Usage: annotate_bins.sh [-tmp /path/tmp] [-t threads] [-ma_db /path/to/microannotatordb] [-gtdb_db /path/to/GTDB] -drep /path/to/drep/dereplicated_genomes -o /path/to/out 
Options:

	-tmp STR	path to temp dir (default output_dir/temp)
	-o STR	path to output dir
	-t	# of threads (default 24)
	-drep dereplicated genome path (drep output directory). See dereplicate_bins.dRep.sh for more information.
	-ma_db	MicrobeAnnotator DB path (default: /cvmfs/datahub.genap.ca/vhost34/def-ilafores/MicrobeAnnotator_DB).
	-gtdb_db	GTDB DB path (default: /cvmfs/datahub.genap.ca/vhost34/def-ilafores/GTDB/release207_v2).

  -h --help	Display help


Bin quantification

For full list of options:

$ bash $ILL_PIPELINES/generateslurm_quantify_bins.sh -h

Usage: generateslurm_quantify_bins.sh -sample_tsv /path/to/samplesheet -drep /path/to/drep/genome -o /path/to/out --bowtie_index_name idx_nbame
Options:

  -sample_tsv STR	path to sample tsv (5 columns: sample name<tab>fastq1 path<tab>fastq2 path<tab>fastq1 single path<tab>fastq2 single path). Generated in preprocess step.
	-drep STR	dereplicated genome path (drep output directory). See dereplicate_bins.dRep.sh for more information.
	-o STR	path to output dir

Slurm options:
	--slurm_alloc STR	slurm allocation (default def-ilafores)
	--slurm_log STR	slurm log file output directory (default to output_dir/logs)
	--slurm_email "[email protected]"	Slurm email setting
	--slurm_walltime STR	slurm requested walltime (default 24:00:00)
	--slurm_threads INT	slurm requested number of threads (default 24)
	--slurm_mem STR	slurm requested memory (default 31G)

  -h --help	Display help


This generated script can also be runned locally on ip34 the following ways:

bash /path/to/out/submit_quantify.slurm.sh

Or you can directly call the script the following way:


$ bash $ILL_PIPELINES/scripts/quantify_bins.salmon.sh -h

Usage: quantify_bins.salmon.sh [-tmp /path/tmp] [-t threads] -bins_tsv all_genome_bins_path_regex -drep /path/to/drep_output -o /path/to/out -a algorithm -p_ani value -s_ani value -cov value -comp value -con value 
Options:

	-tmp STR	path to temp dir (default output_dir/temp)
	-t	# of threads (default 8)
	-sample_tsv	A 3 column tsv of samples. Columns should be sample_name<tab>/path/to/fastq1<tab>/path/to/fastq2. No headers! HINT: preprocess step generates this file
	-drep STR	dereplicated genome path (drep output directory). See dereplicate_bins.dRep.sh for more information.
	-o STR	path to output dir

  -h --help	Display help