Skip to content

Tutorial

David Molik edited this page Dec 2, 2022 · 24 revisions

otb utilizes a nextflow directory structure in order to run the workflow. Optionally, you can run some singularity setup before running otb. In general though, your otb run will look like:

Setup

Clone down the otb directory:

git clone https://github.com/molikd/otb.git

Note: We recommend that you create a /data directory in the top-level otb directory to keep track of input files. These can be soft links.

There are two types of input you'll need to run otb, a bam file of your HiFi reads, and a left and right fastq/fasta of your Hi-C reads.

If $NXF_SINGULARITY_LIBRARYDIR is set in your .bashrc or more generally in your environment, otb will try to download to that directory. You can prefetch the containers with prefetch_containers.sh

Note: We recommend setting NXF_SINGULARITY_LIBRARYDIR in your bashrc (or similar CLI environment) so that otb singularity containers are not downloaded more than once.

Running otb

After cloning otb, and getting your data ready, it is time to start an otb.sh run, it's best to start with finding out what options there are:

./otb.sh --help

When you run otb.sh you will give it a number of command line flags to customize the run, the total flags are as follows:

  [NOTE] otb must be run from the directory it is in, only ./otb.sh will work
  [NOTE] run.nf must be in the same directory as otb.sh
  [NOTE] ./config holds grid config files
  [NOTE] ./scr holds scripts on which otb relies

  utilities
    -h or --help
       display this message and exit

    -v or --version
       display otb version and exit

    -s or --supress
       supress "stop and gather user input", and submit without sending to background

    -c or --check
       perform checks to insure smoother operation

    --lite
       don't perform anything with HiC

  required:
    -f or --forward
       a fastq or fastq.gz file for the pipline, in HiC runs this is one of the sequencing files, in a trio run, this is either the maternal or paternal sequences, order does not matter

    -r or --reverse
       another fastq or fastq.gz file for the pipeline, in HiC runs this is one of the sequencing files, in a trio run, this is either the maternal or paternal sequences, order does not matter

    --matf and --matr
       optional maternal sequences forward and reverse

    --patf and --patr
       optional paternal sequences forward and reverse

    -in or --reads
       path to HiFi reads (generally from pacbio), may include a wildcard for multiple files, can be fastq or bam


  suggested:
    -m or --mode
       mode to use, must be one of "phasing","default","trio","primary". default: "default"

    -t or --threads
       number of threads to use, clusters sometimes use this as number of cores, default: 20

    -n or --name
       a name for the assembly

    -y or --yahs
       run yahs as well

    --purge-dups
       [0-3], the amount of dups purging requested in HiFiASM, by default no purging is done. default: 0

    grid computing:
       select one of the following, defaults to local which is highly not-recomended
    -u or --runner
       runner choices are:
          "sge": generic sge
          "slurm": generic slurm
          "slurm_usda": USDA Ceres Cluster
          "slurm_usda_atlas": USDA Atlas Cluster, unstable
          "slurm_usd_mem": USDA Ceres Large Memory on Ceres
          "none": Local running
        you can also overload runner by giving a custom config in the /config directory without the .cfg.
        An exmple: there exists a config in the config directory called torque.cfg the following would
        be used:
          --runner torque
        the /config would not be included, otb.sh only looks in the config directory, and the .cfg is appended

    --polish-type
       turn on polishing one of:
          "simple": ragtag patching and shhquis rearangement
          "merfin": merfin variant calls ontop of ragtag, bcftools consensus
          "dv": deep variant calls ontop of ragtag, bcftools consensus

    --patch 
       perform error corrected read patching as part of polishing

    --hapscaffold
       turn on ragtag scaffolding for the haplotypes to algin and add N's to the haplotypes from the primary/parental genome

    busco:
       busco options, requires a lineage option
    --busco
       busco flag turns on busco
    select one of the following (all imply --busco):
       --auto-lineage
          try to use auto lineage finder from busco
       --auto-lineage-prok
          try to use auto lineage finder from busco, but limit to prokaryotes
       --auto-lineage-euk
          try to use auto linage finder from busco, but limit to eukaryotes
       -l or --lineage
          use a specific lineage with busco (recomended)
       -p or --busco-path
          run busco in offline mode, with path to database, or with database name to try and download
       --evalue
          change default busco evalue, default is 0.001

    K-mer counting
        options for k-mer counting
    -k or --kmer
        one of:
          "kmc": use kmc for k-mer counting
          "jellyfish": use jellyfish for k-mer counting

    Shhquis.jl:
       Shhquis options, optional
       --hclust-linkage
          linkage is one of single, average, complete, ward, ward_presquared. Defaults to average.

The most important options are the mode of operation, the type of polishing (if any), and busco options. The mode of operations is denoted by --mode or -m and is followed by one of:

  • phasing - phased genomes with Hi-C data
  • default - for homozygous and heterozygous genomes
  • trio - trio binned genomes with maternal and paternal data
  • primary - run for a primary and alternate assembly

Polishing can be one of three pipelines, a simple (ragtag.py with error corrected reads), a merfin, or a google deep variant polish, you can also choose not to polish.

  • simple: ragtag patching and shhquis rearangement
  • merfin: merfin variant calls ontop of ragtag, bcftools consensus
  • dv: deep variant calls ontop of ragtag, bcftools consensus

Busco will try to find existing busco databases if they exist. Busco can be turned on with --busco

in general then your call from otb.sh will look like, you can find this in /reports/${NAME}.nextflow.command.txt (more information on reports dir):

nextflow run run.nf -c config/slurm_usda.cfg --mode="heterozygous" --threads="40" --readf="RawHiC/JBNL_HiC___CAGTGCTT_Neodiprion_virginianus_I1014_L2_R1.fastq.gz" --readr="RawHiC/JBNL_HiC___CAGTGCTT_Neodiprion_virginianus_I1014_L2_R2.fastq.gz" --busco --polish --polishtype="simple" --readbam='RawHiFi/*.bam' --assembly="Neodiprion_virginianus_male"

Note: that any number of bam files can be used, by denoting the --readbam command with single quotes.

There is the possibility that otb will fail, you can submit a ticket for help, but you can start starting to troubleshoot yourself. It's possible if you are running otb on an older cluster that singularity may give issues, there are some simple changes you can check.

You can also create an execution script for otb, you'll use the -s flag for a headless run. Here is an example slurm script:

#!/bin/bash
#SBATCH -N 1
#SBATCH -n 2
#SBATCH -p long
#SBATCH -o "otb.stdout.%j.%N" 		# standard out %j adds job number to outputfile name and %N adds the node
#SBATCH -e "otb.stderr.%j.%N" 		#optional but it prints our standard error
#SBATCH --mail-user=<username>
#SBATCH --mail-type=START,END,FAIL               #will receive an email when job ends or fails

module load nextflow/20.07.1

#sometimes HiC data comes out as bz2, otb likes gz so here is how you would convert:
#bzcat RawHiC/JNBN_HiC_NA_NA_GTGAAA_Bombus_huntii-Bombus_huntii_HiC_I1143_L4_R1.fastq.bz2 | gzip -c > RawHiC/JNBN_HiC_NA_NA_GTGAAA_Bombus_huntii-Bombus_huntii_HiC_I1143_L4_R1.fastq.gz
#bzcat RawHiC/JNBN_HiC_NA_NA_GTGAAA_Bombus_huntii-Bombus_huntii_HiC_I1143_L4_R2.fastq.bz2 | gzip -c > RawHiC/JNBN_HiC_NA_NA_GTGAAA_Bombus_huntii-Bombus_huntii_HiC_I1143_L4_R2.fastq.gz

# We need an assembly name, generally this is just the name of the organism"
# Assembly_Name="Bombus_huntii"
Assembly_Name=$(basename "$( pwd )")

# Foward and reverse reads
Forward="RawData/*_R1.fastq.gz"
Reverse="RawData/*_R2.fastq.gz"
CCS='RawData/*.fastq'

#Comment/Uncommment for busco
#Busco="--busco" #Busco will be run
Busco="" #Busco will not be run

#Comment/Uncommment for Yahs
Yahs="-y" #Yahs will be run
#Yahs="" #Yahs will not be run

#Comment/Uncomment for Polishing (only select one of)
#Polish_Type="" #No polishing
Polish_Type="simple" #Simple Polishing
#Polish_Type="dv" #Deep Variant Polishing
#Polish_Type="merfin" #merfin Polishing

#Comment/Uncomment for Type (only select one of)
#HiFi_Type="phasing"
HiFi_Type="default"
#HiFi_Type="trio"

#Comment/Uncomment for Runner (only select one of)
#Runner="slurm_usda"
Runner="slurm_usda_mem"

Threads="32"

Busco_Location="-l /project/ag100pest/software/OTB_test/busco"
Busco_DB="-p insecta_odb10"

if [[ -z "$BUSCO" ]]; then
  ./otb.sh -n ${Assembly_Name} -f "$( echo ${Forward})" -r "$(echo ${Reverse})" -in "$(echo ${CCS})" -m ${HiFi_Type} -t ${Threads} ${Yahs} ${Busco} --polish_type ${Polish_Type} --runner ${Runner} -c -s
else
  ./otb.sh -n ${Assembly_Name} -f "$( echo ${Forward})" -r "$(echo ${Reverse})" -in "$(echo ${CCS})" -m ${HiFi_Type} -t ${Threads} ${Yahs} ${Busco} ${Busco_Location} ${Busco_DB} --polish_type ${Polish_Type} --runner ${Runner} -c -s
fi

Results

Your results will be in /results, and will look like this example:

results/
├── 00_ordination
│   ├── genomescope
│   │   ├── kcov.txt -> ../../../work/50/d9490a0840a70f1833e73c9384edb6/kcov.txt
│   │   ├── prosapia_bivincta
│   │   │   ├── fitted_hist.png -> ../../../../work/50/d9490a0840a70f1833e73c9384edb6/prosapia_bivincta/fitted_hist.png
│   │   │   ├── linear_plot.png -> ../../../../work/50/d9490a0840a70f1833e73c9384edb6/prosapia_bivincta/linear_plot.png
│   │   │   ├── log_plot.png -> ../../../../work/50/d9490a0840a70f1833e73c9384edb6/prosapia_bivincta/log_plot.png
│   │   │   ├── lookup_table.txt -> ../../../../work/50/d9490a0840a70f1833e73c9384edb6/prosapia_bivincta/lookup_table.txt
│   │   │   ├── model.txt -> ../../../../work/50/d9490a0840a70f1833e73c9384edb6/prosapia_bivincta/model.txt
│   │   │   ├── progress.txt -> ../../../../work/50/d9490a0840a70f1833e73c9384edb6/prosapia_bivincta/progress.txt
│   │   │   ├── summary.txt -> ../../../../work/50/d9490a0840a70f1833e73c9384edb6/prosapia_bivincta/summary.txt
│   │   │   ├── transformed_linear_plot.png -> ../../../../work/50/d9490a0840a70f1833e73c9384edb6/prosapia_bivincta/transformed_linear_plot.png
│   │   │   └── transformed_log_plot.png -> ../../../../work/50/d9490a0840a70f1833e73c9384edb6/prosapia_bivincta/transformed_log_plot.png
│   │   └── version.txt -> ../../../work/50/d9490a0840a70f1833e73c9384edb6/version.txt
│   ├── left.fastq.gz.stats
│   ├── log
│   │   ├── filtering
│   │   │   ├── filtering_information.log.txt
│   │   │   ├── hic_check.log.txt
│   │   │   └── hifi_check.log.txt
│   │   └── genomescope
│   │       ├── genomescope2.log.txt
│   │       └── kmer_counter.log.txt
│   └── right.fastq.gz.stats
├── 01_hifiasm
│   ├── busco
│   │   ├── busco_downloads -> ../../../work/f2/97d831cd35a759902c36dc40aa1fe1/busco_downloads
│   │   ├── busco.flag.txt -> ../../../work/f2/97d831cd35a759902c36dc40aa1fe1/busco.flag.txt
│   │   └── prosapia_bivincta_prosapia_bivincta.hic.p_ctg.gfa.fasta_busco -> ../../../work/f2/97d831cd35a759902c36dc40aa1fe1/prosapia_bivincta_prosapia_bivincta.hic.p_ctg.gfa.fasta_busco
│   ├── log
│   │   ├── busco.log.txt
│   │   ├── gfa2fasta.log.txt
│   │   └── HiFiASM.log.txt
│   ├── prosapia_bivincta.hic.hap1.p_ctg.gfa.fasta -> ../../work/65/7ccfc208649c919f86eefc913f5f06/prosapia_bivincta.hic.hap1.p_ctg.gfa.fasta
│   ├── prosapia_bivincta.hic.hap1.p_ctg.gfa.fasta.stats
│   ├── prosapia_bivincta.hic.hap2.p_ctg.gfa.fasta -> ../../work/de/9b8f07cbac43d70d49f59bf651b2ea/prosapia_bivincta.hic.hap2.p_ctg.gfa.fasta
│   ├── prosapia_bivincta.hic.hap2.p_ctg.gfa.fasta.stats
│   ├── prosapia_bivincta.hic.p_ctg.gfa.fasta -> ../../work/87/cd5bdd2594bb489eee21ed303b8ddb/prosapia_bivincta.hic.p_ctg.gfa.fasta
│   └── prosapia_bivincta.hic.p_ctg.gfa.fasta.stats
├── 02_hicstuff
│   ├── hicstuff_out
│   │   ├── abs_fragments_contacts_weighted.bg2 -> ../../../work/85/864455fedbceda667a3503ff190452/hicstuff_out/abs_fragments_contacts_weighted.bg2
│   │   ├── fragments_list.txt -> ../../../work/85/864455fedbceda667a3503ff190452/hicstuff_out/fragments_list.txt
│   │   ├── info_contigs.txt -> ../../../work/85/864455fedbceda667a3503ff190452/hicstuff_out/info_contigs.txt
│   │   └── plots
│   │       └── frags_hist.pdf -> ../../../../work/85/864455fedbceda667a3503ff190452/hicstuff_out/plots/frags_hist.pdf
│   ├── log
│   │   ├── hicstuff.log.txt
│   │   └── ragtag.log.txt
│   └── ragtag.patch.fasta.stats
├── 03_polish
│   └── log
│       └── faidx.log.txt
├── 04_yahs
│   ├── log
│   │   ├── bam.sort.log.txt
│   │   ├── bwa.log.txt
│   │   └── yahs.log.txt
│   ├── yahs.no_polish.bin -> ../../work/76/048cf591c9791adb49455d55ae02b6/yahs.no_polish.bin
│   ├── yahs.no_polish.JBAT -> ../../work/76/048cf591c9791adb49455d55ae02b6/yahs.no_polish.JBAT
│   ├── yahs.no_polish.JBAT.assembly -> ../../work/76/048cf591c9791adb49455d55ae02b6/yahs.no_polish.JBAT.assembly
│   ├── yahs.no_polish.JBAT.assembly.agp -> ../../work/76/048cf591c9791adb49455d55ae02b6/yahs.no_polish.JBAT.assembly.agp
│   ├── yahs.no_polish.JBAT.liftover -> ../../work/76/048cf591c9791adb49455d55ae02b6/yahs.no_polish.JBAT.liftover
│   ├── yahs.no_polish.JBAT.txt -> ../../work/76/048cf591c9791adb49455d55ae02b6/yahs.no_polish.JBAT.txt
│   ├── yahs.no_polish_no_break.agp -> ../../work/76/048cf591c9791adb49455d55ae02b6/yahs.no_polish_no_break.agp
│   ├── yahs.no_polish_r01.agp -> ../../work/76/048cf591c9791adb49455d55ae02b6/yahs.no_polish_r01.agp
│   ├── yahs.no_polish_r01_break.agp -> ../../work/76/048cf591c9791adb49455d55ae02b6/yahs.no_polish_r01_break.agp
│   ├── yahs.no_polish_r02.agp -> ../../work/76/048cf591c9791adb49455d55ae02b6/yahs.no_polish_r02.agp
│   ├── yahs.no_polish_r02_break.agp -> ../../work/76/048cf591c9791adb49455d55ae02b6/yahs.no_polish_r02_break.agp
│   ├── yahs.no_polish_r03.agp -> ../../work/76/048cf591c9791adb49455d55ae02b6/yahs.no_polish_r03.agp
│   ├── yahs.no_polish_r03_break.agp -> ../../work/76/048cf591c9791adb49455d55ae02b6/yahs.no_polish_r03_break.agp
│   ├── yahs.no_polish_r04.agp -> ../../work/76/048cf591c9791adb49455d55ae02b6/yahs.no_polish_r04.agp
│   ├── yahs.no_polish_r04_break.agp -> ../../work/76/048cf591c9791adb49455d55ae02b6/yahs.no_polish_r04_break.agp
│   ├── yahs.no_polish_r05.agp -> ../../work/76/048cf591c9791adb49455d55ae02b6/yahs.no_polish_r05.agp
│   ├── yahs.no_polish_r05_break.agp -> ../../work/76/048cf591c9791adb49455d55ae02b6/yahs.no_polish_r05_break.agp
│   ├── yahs.no_polish_r06.agp -> ../../work/76/048cf591c9791adb49455d55ae02b6/yahs.no_polish_r06.agp
│   ├── yahs.no_polish_r06_break.agp -> ../../work/76/048cf591c9791adb49455d55ae02b6/yahs.no_polish_r06_break.agp
│   ├── yahs.no_polish_scaffolds_final.agp -> ../../work/76/048cf591c9791adb49455d55ae02b6/yahs.no_polish_scaffolds_final.agp
│   └── yahs.no_polish_scaffolds_final.fa -> ../../work/76/048cf591c9791adb49455d55ae02b6/yahs.no_polish_scaffolds_final.fa
├── 05_yahs_on_polish
│   └── log
├── 05_yahs_on_pol     ish
│   └── log
└── software_versions
    ├── any2fasta_version.txt
    ├── bcftools_version.txt
    ├── busco_version.txt
    ├── bwa_mem_2_version.txt
    ├── deepvariant_version.txt
    ├── genomescope_version.txt
    ├── gfastats_version.txt
    ├── hicstuff_version.txt
    ├── hifiasm_version.txt
    ├── jellyfish_version.txt
    ├── juicer_tools_version.txt
    ├── pbadapterfilt_version.txt
    ├── ragtag_version.txt
    ├── samtools_version.txt
    ├── shhquis_version.txt
    └── yahs_version.txt
  • 00_ordination holds logs and outputs from getting the data ready, importantly genomescope outputs exist in this directory
  • 01_hifiasm holds outputs from running hifiasm, as well as busco for hifiasm
  • 02_hicstuff holds outputs from running hicstuff
  • 03_polish holds outputs from any polishing, normally variant reduction and scaffold combinations where possible
  • 04_yahs holds outputs from yahs being run on 01/02, the unpolished vresions
  • 05_yahs_on_polish holds outputs from yahs on polished outputs

Further Reading: Making sense of your results

Running your first otb run, headless, on an HPC, and with slurm

On a typical HPC cluster singularity and nextflow are usally already installed. otb is a wrapped nextflow script, so if a user wanted they could run otb from nextflow. However, A layer of abstraction was added to otb through the use of a bash script otb.sh this script provides some environment checking, and nextflow construction methods. otb.sh also will try it's best to setup the required singularity environment and busco if required.

Setup

Data

In our first time scenario, let's assume that our data is in subdirectories, and that both sets of data are in the .fastq.gz filetype:

/project/ag100pest/Prosapia_bivincta_otb_test/RawHiFiData
/project/ag100pest/Prosapia_bivincta_otb_test/RawHiCData

we'll need to remember to location of this data.

Environment

In our scenario, let's set the Singularity Nextflow directory in a .bashrc environment file:

vim ~/.bashrc

and add the line:

NXF_SINGULARITY_CACHEDIR="/some/directory"

Doing this ensures that singularity containers will always be placed in the same directory, and will prevent replication of singularity containers between runs.

Getting otb

Now that the environment is set up for running otb we can grab the otb package itself, it is recommended to clone the entirety of otb for each otb run that the user may want to complete, since the singularity containers across runs are all stored in a central location, the net effect on disk space for the otb code is minimal.

git clone [email protected]:molikd/otb.git

Note: an ssh git clone from git hub requires that ssh keys are setup to your git account, this requires that you have a github.com account, and that you have added an ssh key attached to your github account from your HPC system.

After cloning otb.sh can be run, there are several ways to do this, if you are on an HPC that uses the slrum submission script, you can edit the otb.template.slurm or otb.lite.template.slurm. ./otb.sh could be run from the command line as well.

It's a good idea to start off by making a copy of the slurm template, this way the defaults are stored in the directory:

cp otb.template.slurm otb.Prosapia_bivincta.slurm

Note: make sure to keep the copy in the same directory as the original, like shown. otb can not take softlinks due to nextflow.

In this case since we have both HiFi and HiC data, modifying the lines:

Forward="RawData/*_R1.fastq.gz"
Reverse="RawData/*_R2.fastq.gz"
CCS='RawData/*.fastq'

to

Forward="/project/ag100pest/Prosapia_bivincta_otb_test/RawHiCData/Pbis-GTCCGC_HiC_S2_L001_R1_001.fastq.gz"
Reverse="/project/ag100pest/Prosapia_bivincta_otb_test/RawHiCData/Pbis-GTCCGC_HiC_S2_L001_R2_001.fastq.gz"
CCS='/project/ag100pest/Prosapia_bivincta_otb_test/RawHiFiData/*.hifi_reads.fastq.gz'

to point otb at the right data.

We should also make sure that slurm emails the user so modify the line:

#SBATCH --mail-user=<username>

with the users email replacing ""

we'll also make sure busco doesn't run by commenting out the line that would make busco run:

#Comment/Uncommment for busco
#Busco="--busco" #Busco will be run
Busco="" #Busco will not be run

We won't run yahs:

#Comment/Uncommment for Yahs
#Yahs="-y" #Yahs will be run
Yahs="" #Yahs will not be run

We'll turn off the polishing:

#Comment/Uncomment for Polishing (only select one of)
Polish_Type="" #No polishing
#Polish_Type="simple" #Simple Polishing
#Polish_Type="dv" #Deep Variant Polishing
#Polish_Type="merfin" #merfin Polishing

and

keep the run type as default:

#Comment/Uncomment for Type (only select one of)
#HiFi_Type="phasing"
HiFi_Type="default"
#HiFi_Type="trio"

This should be a fairly quick run ( several hours) without the extra bells and whistles.

The runner and threads we can keep to defaults as well.

#Comment/Uncomment for Runner (only select one of)
Runner="slurm_usda"
#Runner="slurm_usda_mem"
Threads="32"

Running otb

Now that the template file has been edited, running otb in headless mode is as simple as:

sbatch otb.Prosapia_bivincta.slurm