Skip to content

Latest commit

 

History

History
284 lines (212 loc) · 10.3 KB

HOWTO.md

File metadata and controls

284 lines (212 loc) · 10.3 KB

BWASP HOWTO - an example for how to use the software

For the impatient or those just needing a reminder

Please go to data/README for production-level workflow scripting (pretty cool!).

Preparation

At this stage, you should have completed the BWASP installation steps documented in the INSTALL document; we'll assume that you have downloaded the bwasp.sif Singularity container.

The BWASP script xgetSRAacc uses NCBI SRA Toolkit to download data from NCBI SRA. If you have been using SRA Toolkit already and allow local file-caching, you need to make sure that your file-caching location is accessible to singularity. We recommend to disable local file-caching. To do this, run:

  singularity exec -e bwasp.sif  vdb-config -i

navigate to CACHE by entering C and disable local file-caching by toggeling i, followed by x for exit, and possibly o for ok.

Note: If this is the first time you are using SRA Toolkit on the current machine, you will have to invoke vdb-config at least once to set your preferences (as per NCBI instructions).

We explain BWASP use with an example from our 2021 publication.

Samples

Our goal is to analyze BS-seq data sets from the Patalano et al., 2015, study of the paper wasp Polistes canadensis. Three BS-seq data sets from queen adult brains were deposited in the NCBIs Sequence Read Archive:

Preparing the directory structure

BWASP expects a certain directory structure starting from the BWASP_ROOT/data directory (BWASP_DATA).

  • BWASP_DATA/
    • species/
      • genome/
      • study/
        • caste/
          • replicate/

This hierarchy is designed to organize various data sets. Although not mandatory, it is recommended to follow it for easier operation. The xmkdirstr script helps create this structure and populates it with the relevant links and makefiles for each sample. Please note that makefiles for each sample need to be modified prior to running the workflow.

cd data
./xmkdirstr Pcan Patalano2015 Queen 3 p

Here, Pcan is an alias for Polistes canadensis, and the created Pcan directory might eventually hold several studies on this species, the Patalano2015 study being one example. Under Patalano2015, the subdirectory Queen will contain 3 replicates of paired-end reads from queens. Each replicate subdirectory will be populated with a link to the genome directory and a copy of Makefile_pe_template.

Getting the genome

The BWASP workflow requires a reference genome, so we need to obtain and place it in the appropriate directory. The Polistes canadensis genome can be found at the appropriate NCBI Genome page. We can simply get the direct download links for the genome assembly (FASTA) and annotation (GFF) files to download them directly into the Pcan/genome directory as follows:

cd Pcan/genome
wget ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/001/313/835/GCF_001313835.1_ASM131383v1/GCF_001313835.1_ASM131383v1_genomic.fna.gz
wget ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/001/313/835/GCF_001313835.1_ASM131383v1/GCF_001313835.1_ASM131383v1_genomic.gff.gz

Assuming everything was downloaded nicely, we decompress and link the files to more convenient file names:

gunzip GCF_*.gz
ln GCF_001313835.1_ASM131383v1_genomic.fna Pcan.gdna.fa
ln GCF_001313835.1_ASM131383v1_genomic.gff Pcan.gff3

Note that the annotation was not necessary for the basic BWASP run but we downloaded it anyway for use in downstream analysis. Also note that we preferred linking the files to moving (renaming) them to keep the original filenames for future reference.

Getting the reads

Note that the appropriate template Makefile was copied into each replicate directory. This Makefile contains the necessary commands (fasterq-dump) to download reads from NCBI SRA, so all we need to do is to fill in the appropriate accession numbers.

For our example, the template Makefile already has the accession number for the first queen replicate. We could manually edit numbers for the other two replicates or, better, use the following commands to substitute the SRA accession number and sample labels for the other two replicates:

sed -i -e "s/SRR1519132/SRR1519133/; s/Pcan-21Q/Pcan-43Q/;" replicate2/Makefile
sed -i -e "s/SRR1519132/SRR1519134/; s/Pcan-21Q/Pcan-75Q/;" replicate3/Makefile

Now we are ready to start the heavy data processing.

Running the workflow

Because the rest of the workflow is fire and forget, proofreading of the Makefiles is highly recommended at this point. The section titled Variable Settings is the only part that is necessarily modified and should be double-checked before proceeding.

Once that is done, it is recommended to source the bwasp_env.sh (as per installation instructions) and confirm that the BWASP_EXEC variable is set by checking the output of echo $BWASP_EXEC. This is merely a convenience variable that holds a command that has all the relevant singularity parameters set for the user. For example, if user bumblebee has access to plenty of disk space on /bigdata/bumblebee/, this user's $BWASP_EXEC might look like the following:

singularity exec -e -B /bigdata/bumblebee/BWASP/data /bigdata/bumblebee/BWASP/bwasp.sif

Finally, we can run the make-enabled workflow (from directory replicateX):

$BWASP_EXEC make -n
$BWASP_EXEC make Bisulfite_Genome
$BWASP_EXEC make &> bwasp.log

The preceding $BWASP_EXEC makes sure that make runs from inside the Singularity container, where we made sure all the moving parts are in working condition (i.e. all required binaries are of correct version and in the path).

The first make command with the -n flag simply shows what make will do and is valuable for reference. The second make command runs the preparatory genome processing step which needs to be done once and is shared for all samples of a given species. The third make command will run the main workflow.

Once the the common bismark_genome_preparation step is done, you could start make in the other replicate directories - however, we strongly suggest you finish one run first to make sure that everything works and you have enough resources on your computer to run multiple BWASP workflows simultaneously. Check the err file and your system monitor frequently.

Output

After completion of the BWASP workflow, the working directory should contain a fair number of output files. Please refer to the documentation of the various constituent programs for details as well as our 2021 publication. To remove unneeded intermediate files and archive files that may be of interest later but are not needed in subsequent BWASP analysis steps we recommend running the following commands at this stage:

$BWASP_EXEC make cleanup
$BWASP_EXEC make finishup

This will crate the archive STORE-SRR1519132.zip and substantially reduce the disk space used. The remaining output files should be as follows:

Methylation calls and statistics

  • Pcan-21Q.mstats
  • Pcan-21Q.CHGhsm.mcalls Pcan-21Q.CHGnsm.mcalls Pcan-21Q.CHGscd.mcalls
  • Pcan-21Q.CHHhsm.mcalls Pcan-21Q.CHHnsm.mcalls Pcan-21Q.CHHscd.mcalls
  • Pcan-21Q.CpGhsm.mcalls Pcan-21Q.CpGnsm.mcalls Pcan-21Q.CpGscd.mcalls
  • Pcan-21Q.HSMthresholds
  • SRR1519132.stats

Read preparation, mapping, and quality reports

  • SRR1519132.stats
  • Pcan-21Q.bam
  • Pcan-21Q_splitting_report.txt
  • CpG_OT_Pcan-21Q.txt
  • CpG_OB_Pcan-21Q.txt
  • Pcan-21Q.M-bias.eval
  • Pcan-21Q_mbias_only_splitting_report.txt
  • Pcan-21Q.M-bias_R1.png
  • Pcan-21Q.M-bias_R2.png
  • Pcan-21Q.M-bias.txt
  • FilterMsam-Report-Pcan-21Q-deduplicated
  • Rejected-Reads10-Pcan-21Q-deduplicated.sam
  • Rejected-Reads01-Pcan-21Q-deduplicated.sam
  • Rejected-Reads11-Pcan-21Q-deduplicated.sam
  • SRR1519132_1_val_1.fq_bismark_bt2_pe.deduplication_report.txt
  • SRR1519132_1_val_1.fq_bismark_bt2_PE_report.txt
  • FastQC/
  • SRR1519132_2.fastq_trimming_report.txt
  • SRR1519132_1.fastq_trimming_report.txt

Genome statistics

  • Pcan.gdna.stats

Take a look and explore. The .stats and report files would be good starting points.

Working with large data sets

For large data sets, the simple fasterq_dump command in the Makefile may not be the best choice. You may want to review the options to fasterq_dump. For example,

fasterq-dump SRRaccession -e 8 -t /dev/shm -p

would use 8 processors to download SRRaccession and put the result into /dev/shm, which should be much faster than disk storage. The -p option shows the progress of the download. You would then deposit the read files (possibly after splitting them into manageable chunks that could be treated as pseudo-replicates) into your working directories from where you would execute the make command.

Merging data from multiple replicates

While it is of interest to look at the methylation statistics across different replicate data sets, typically the replicate data are pooled when comparing between samples/conditions (e.g., Queen versus Worker samples).

BWASP provides an additional makefile to merge replicates and provide cumulative statistics over all replicates. The xmkdirstr script in our example already set this up for us in the Queen directory. We only need to specify the desired output label (SYNONYM = Pcan-queen in the Makefile) and run (from directory Queen):

$BWASP_EXEC make

optionally followed by cleanup and finishup targets here as well:

$BWASP_EXEC make cleanup
$BWASP_EXEC make finishup

Leaving us with

Combined methylation calls and statistics

  • Pcan-queen.mstats
  • Pcan-queen.CHGhsm.mcalls Pcan-queen.CHGnsm.mcalls Pcan-queen.CHGscd.mcalls
  • Pcan-queen.CHHhsm.mcalls Pcan-queen.CHHnsm.mcalls Pcan-queen.CHHscd.mcalls
  • Pcan-queen.CpGhsm.mcalls Pcan-queen.CpGnsm.mcalls Pcan-queen.CpGscd.mcalls
  • Pcan-queen.HSMthresholds