Please go to data/README for production-level workflow scripting (pretty cool!).
At this stage, you should have completed the BWASP installation steps
documented in the INSTALL document; we'll assume that you have
downloaded the bwasp.sif
Singularity container.
The BWASP script xgetSRAacc uses NCBI SRA Toolkit to download data from NCBI SRA. If you have been using SRA Toolkit already and allow local file-caching, you need to make sure that your file-caching location is accessible to singularity. We recommend to disable local file-caching. To do this, run:
singularity exec -e bwasp.sif vdb-config -i
navigate to CACHE by entering C and disable local file-caching by toggeling i, followed by x for exit, and possibly o for ok.
Note: If this is the first time you are using SRA Toolkit on the current machine, you will have to invoke vdb-config at least once to set your preferences (as per NCBI instructions).
We explain BWASP use with an example from our 2021 publication.
Our goal is to analyze BS-seq data sets from the Patalano et al., 2015, study of the paper wasp Polistes canadensis. Three BS-seq data sets from queen adult brains were deposited in the NCBIs Sequence Read Archive:
BWASP expects a certain directory structure starting from the
BWASP_ROOT/data
directory (BWASP_DATA
).
- BWASP_DATA/
- species/
- genome/
- study/
- caste/
- replicate/
- caste/
- species/
This hierarchy is designed to organize various data sets. Although not mandatory, it is recommended to follow it for easier operation. The xmkdirstr script helps create this structure and populates it with the relevant links and makefiles for each sample. Please note that makefiles for each sample need to be modified prior to running the workflow.
cd data
./xmkdirstr Pcan Patalano2015 Queen 3 p
Here, Pcan is an alias for Polistes canadensis, and the created Pcan directory might eventually hold several studies on this species, the Patalano2015 study being one example. Under Patalano2015, the subdirectory Queen will contain 3 replicates of paired-end reads from queens. Each replicate subdirectory will be populated with a link to the genome directory and a copy of Makefile_pe_template.
The BWASP workflow requires a reference genome, so we need to obtain and place
it in the appropriate directory.
The Polistes canadensis genome can be found at the appropriate
NCBI Genome page.
We can simply get the direct download links for the genome assembly (FASTA) and
annotation (GFF) files to download them directly into the Pcan/genome
directory as follows:
cd Pcan/genome
wget ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/001/313/835/GCF_001313835.1_ASM131383v1/GCF_001313835.1_ASM131383v1_genomic.fna.gz
wget ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/001/313/835/GCF_001313835.1_ASM131383v1/GCF_001313835.1_ASM131383v1_genomic.gff.gz
Assuming everything was downloaded nicely, we decompress and link the files to more convenient file names:
gunzip GCF_*.gz
ln GCF_001313835.1_ASM131383v1_genomic.fna Pcan.gdna.fa
ln GCF_001313835.1_ASM131383v1_genomic.gff Pcan.gff3
Note that the annotation was not necessary for the basic BWASP run but we downloaded it anyway for use in downstream analysis. Also note that we preferred linking the files to moving (renaming) them to keep the original filenames for future reference.
Note that the appropriate template Makefile was copied into each replicate directory. This Makefile contains the necessary commands (fasterq-dump) to download reads from NCBI SRA, so all we need to do is to fill in the appropriate accession numbers.
For our example, the template Makefile already has the accession number for the first queen replicate. We could manually edit numbers for the other two replicates or, better, use the following commands to substitute the SRA accession number and sample labels for the other two replicates:
sed -i -e "s/SRR1519132/SRR1519133/; s/Pcan-21Q/Pcan-43Q/;" replicate2/Makefile
sed -i -e "s/SRR1519132/SRR1519134/; s/Pcan-21Q/Pcan-75Q/;" replicate3/Makefile
Now we are ready to start the heavy data processing.
Because the rest of the workflow is fire and forget, proofreading of the Makefiles is highly recommended at this point. The section titled Variable Settings is the only part that is necessarily modified and should be double-checked before proceeding.
Once that is done, it is recommended to source the bwasp_env.sh
(as per
installation instructions) and confirm that the BWASP_EXEC
variable is set by
checking the output of echo $BWASP_EXEC
.
This is merely a convenience variable that holds a command that has all the
relevant singularity parameters set for the user.
For example, if user bumblebee
has access to plenty of disk space on
/bigdata/bumblebee/
, this user's $BWASP_EXEC might look like the
following:
singularity exec -e -B /bigdata/bumblebee/BWASP/data /bigdata/bumblebee/BWASP/bwasp.sif
Finally, we can run the make
-enabled workflow (from directory replicateX):
$BWASP_EXEC make -n
$BWASP_EXEC make Bisulfite_Genome
$BWASP_EXEC make &> bwasp.log
The preceding $BWASP_EXEC
makes sure that make
runs from inside the
Singularity container, where we made sure all the moving parts are in working
condition (i.e. all required binaries are of correct version and in the path).
The first make command with the -n flag simply shows what make will do and is valuable for reference. The second make command runs the preparatory genome processing step which needs to be done once and is shared for all samples of a given species. The third make command will run the main workflow.
Once the the common bismark_genome_preparation step is done, you could start make in the other replicate directories - however, we strongly suggest you finish one run first to make sure that everything works and you have enough resources on your computer to run multiple BWASP workflows simultaneously. Check the err file and your system monitor frequently.
After completion of the BWASP workflow, the working directory should contain a fair number of output files. Please refer to the documentation of the various constituent programs for details as well as our 2021 publication. To remove unneeded intermediate files and archive files that may be of interest later but are not needed in subsequent BWASP analysis steps we recommend running the following commands at this stage:
$BWASP_EXEC make cleanup
$BWASP_EXEC make finishup
This will crate the archive STORE-SRR1519132.zip and substantially reduce the disk space used. The remaining output files should be as follows:
- Pcan-21Q.mstats
- Pcan-21Q.CHGhsm.mcalls Pcan-21Q.CHGnsm.mcalls Pcan-21Q.CHGscd.mcalls
- Pcan-21Q.CHHhsm.mcalls Pcan-21Q.CHHnsm.mcalls Pcan-21Q.CHHscd.mcalls
- Pcan-21Q.CpGhsm.mcalls Pcan-21Q.CpGnsm.mcalls Pcan-21Q.CpGscd.mcalls
- Pcan-21Q.HSMthresholds
- SRR1519132.stats
- SRR1519132.stats
- Pcan-21Q.bam
- Pcan-21Q_splitting_report.txt
- CpG_OT_Pcan-21Q.txt
- CpG_OB_Pcan-21Q.txt
- Pcan-21Q.M-bias.eval
- Pcan-21Q_mbias_only_splitting_report.txt
- Pcan-21Q.M-bias_R1.png
- Pcan-21Q.M-bias_R2.png
- Pcan-21Q.M-bias.txt
- FilterMsam-Report-Pcan-21Q-deduplicated
- Rejected-Reads10-Pcan-21Q-deduplicated.sam
- Rejected-Reads01-Pcan-21Q-deduplicated.sam
- Rejected-Reads11-Pcan-21Q-deduplicated.sam
- SRR1519132_1_val_1.fq_bismark_bt2_pe.deduplication_report.txt
- SRR1519132_1_val_1.fq_bismark_bt2_PE_report.txt
- FastQC/
- SRR1519132_2.fastq_trimming_report.txt
- SRR1519132_1.fastq_trimming_report.txt
- Pcan.gdna.stats
Take a look and explore. The .stats and report files would be good starting points.
For large data sets, the simple fasterq_dump command in the Makefile may not be the best choice. You may want to review the options to fasterq_dump. For example,
fasterq-dump SRRaccession -e 8 -t /dev/shm -p
would use 8 processors to download SRRaccession and put the result into /dev/shm, which should be much faster than disk storage. The -p option shows the progress of the download. You would then deposit the read files (possibly after splitting them into manageable chunks that could be treated as pseudo-replicates) into your working directories from where you would execute the make command.
While it is of interest to look at the methylation statistics across different replicate data sets, typically the replicate data are pooled when comparing between samples/conditions (e.g., Queen versus Worker samples).
BWASP provides an additional makefile to merge replicates and provide cumulative statistics over all replicates. The xmkdirstr script in our example already set this up for us in the Queen directory. We only need to specify the desired output label (SYNONYM = Pcan-queen in the Makefile) and run (from directory Queen):
$BWASP_EXEC make
optionally followed by cleanup and finishup targets here as well:
$BWASP_EXEC make cleanup
$BWASP_EXEC make finishup
Leaving us with
- Pcan-queen.mstats
- Pcan-queen.CHGhsm.mcalls Pcan-queen.CHGnsm.mcalls Pcan-queen.CHGscd.mcalls
- Pcan-queen.CHHhsm.mcalls Pcan-queen.CHHnsm.mcalls Pcan-queen.CHHscd.mcalls
- Pcan-queen.CpGhsm.mcalls Pcan-queen.CpGnsm.mcalls Pcan-queen.CpGscd.mcalls
- Pcan-queen.HSMthresholds