Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WGBS:bam_statistics always fails with Docker profile #11

Open
kubu4 opened this issue Jul 18, 2022 · 8 comments
Open

WGBS:bam_statistics always fails with Docker profile #11

kubu4 opened this issue Jul 18, 2022 · 8 comments

Comments

@kubu4
Copy link

kubu4 commented Jul 18, 2022

This part of the pipeline always fails, for all samples. After a few different runs with different data sets, this always happens. However, the stats files actually are generated, but I'm not sure what aspect of this part constitutes failure.

Here's an example from a recent run. Error message from .nexflow.log:

Jul-15 12:59:53.874 [Task monitor] DEBUG n.processor.TaskPollingMonitor - Task completed > TaskHandler[id: 12; name: WGBS:bam_statistics (zr1394_11_s456); status: COMPLETED; exit: 25; error: -; workDir: /home/shared/8TB_HDD_01/sam/analyses/20220715-olur-nextflow_epidiverse-wgbs/work/36/5380cd17196a4057ec8d5af0975f15]
Jul-15 12:59:55.889 [Task monitor] INFO  nextflow.processor.TaskProcessor - [36/5380cd] NOTE: Process `WGBS:bam_statistics (zr1394_11_s456)` terminated with an error exit status (25) -- Execution is retried (1)
Jul-15 12:59:55.900 [Task submitter] DEBUG nextflow.executor.LocalTaskHandler - Launch cmd line: /bin/bash -ue .command.run
Jul-15 12:59:55.900 [Task submitter] INFO  nextflow.Session - [58/fdda1e] Re-submitted process > WGBS:bam_statistics (zr1394_11_s456)
Jul-15 13:00:12.201 [Task monitor] DEBUG n.processor.TaskPollingMonitor - Task completed > TaskHandler[id: 14; name: WGBS:bam_statistics (zr1394_11_s456); status: COMPLETED; exit: 25; error: -; workDir: /home/shared/8TB_HDD_01/sam/analyses/20220715-olur-nextflow_epidiverse-wgbs/work/58/fdda1e47e922a4644bce5a8621826c]
Jul-15 13:00:12.203 [Task monitor] INFO  nextflow.processor.TaskProcessor - [58/fdda1e] NOTE: Process `WGBS:bam_statistics (zr1394_11_s456)` terminated with an error exit status (25) -- Error is ignored

Here's what I see in /home/shared/8TB_HDD_01/sam/analyses/20220715-olur-nextflow_epidiverse-wgbs/work/36/5380cd17196a4057ec8d5af0975f15:

proc.erne-bs5.bam@ sorted.bam zr1394_11_s456/

Then, if I look in the zr1394_11_s456/ directory:

stats/ zr1394_11_s456.bam.stats

A quick glance at the zr1394_11_s456.bam.stats looks good:

# This file was produced by samtools stats (1.9+htslib-1.9) and can be plotted using plot-bamstats                                                                                                                                   
# This file contains statistics for all reads.                                                                                                                                                                                       
# The command line was:  stats sorted.bam                                                                                                                                                                                            
# CHK, Checksum [2]Read Names   [3]Sequences    [4]Qualities                                                                                                                                                                         
# CHK, CRC32 of reads which passed filtering followed by addition (32bit overflow)                                                                                                                                                   
CHK     14312793        6ca7c36e        835557d7                                                                                                                                                                                     
# Summary Numbers. Use `grep ^SN | cut -f 2-` to extract this part.                                                                                                                                                                  
SN      raw total sequences:    4599400                                                                                                                                                                                              
SN      filtered sequences:     0                                                                                                                                                                                                    
SN      sequences:      4599400                                                                                                                                                                                                      
SN      is sorted:      1                                                                                                                                                                                                            
SN      1st fragments:  4599400                                                                                                                                                                                                      
SN      last fragments: 0                                                                                                                                                                                                            
SN      reads mapped:   4599400                                                                                                                                                                                                      
SN      reads mapped and paired:        0       # paired-end technology bit set + both mates mapped                                                                                                                                  
SN      reads unmapped: 0                                                                                                                                                                                                            
SN      reads properly paired:  0       # proper-pair bit set                                                                                                                                                                        
SN      reads paired:   0       # paired-end technology bit set                                                                                                                                                                      
SN      reads duplicated:       0       # PCR or optical duplicate bit set                                                                                                                                                           
SN      reads MQ0:      2466606 # mapped and MQ=0                                                                                                                                                                                    
SN      reads QC failed:        0                                                                                                                                                                                                    
SN      non-primary alignments: 0                                                                                                                                                                                                    
SN      total length:   230727266       # ignores clipping                                                                                                                                                                           
SN      total first fragment length:    230727266       # ignores clipping                                                                                                                                                           
SN      total last fragment length:     0       # ignores clipping                                                                                                                                                                   
SN      bases mapped:   230727266       # ignores clipping                                                                                                                                                                           
SN      bases mapped (cigar):   228541966       # more accurate                                                                                                                                                                      
SN      bases trimmed:  0                                                                                                                                                                                                            
SN      bases duplicated:       0                                                                                                                                                                                                    
SN      mismatches:     4136909 # from NM fields                                                                                                                                                                                     
SN      error rate:     1.810131e-02    # mismatches / bases mapped (cigar)                                                                                                                                                          
SN      average length: 50                                                                                                                                                                                                           
SN      average first fragment length:  50                                                                                                                                                                                           
SN      average last fragment length:   0                                                                                                                                                                                            
SN      maximum length: 51                                                                                                                                                                                                           
SN      maximum first fragment length:  0
SN      maximum last fragment length:   0
SN      average quality:        34.3
SN      insert size average:    0.0
SN      insert size standard deviation: 0.0
SN      inward oriented pairs:  0
SN      outward oriented pairs: 0
SN      pairs with other orientation:   0
SN      pairs on different chromosomes: 0
SN      percentage of properly paired reads (%):        0.0
# First Fragment Qualities. Use `grep ^FFQ | cut -f 2-` to extract this part.
# Columns correspond to qualities and rows to cycles. First column is the cycle number.
# Last Fragment Qualities. Use `grep ^LFQ | cut -f 2-` to extract this part.
# Columns correspond to qualities and rows to cycles. First column is the cycle number.
# GC Content of first fragments. Use `grep ^GCF | cut -f 2-` to extract this part.
GCF     0.75    12090

So, it seems like BAM stats were generated...

Any ideas on what might be happening?

@bio15anu
Copy link
Member

bio15anu commented Jul 19, 2022

Thanks for posting! At some point previously we were encountering an issue where the plot-bamstats tool was failing because of an undocumented dependency on gnuplot<=5.2.6. This has been fixed for a while now when using conda and the corresponding environment.yml file, but it could be that the fix did not propagate properly to the Docker image.

I will investigate further this afternoon. In the meantime, is it possible for you to test it instead with conda and see if this is resolved there for you?

@bio15anu
Copy link
Member

As discussed in issue #7, this seems solved now by simplifying L458 in lib/wgbs.nf as per 17d09d1.

When running with Singularity/Docker, this was previously creating an empty header file which resulted in a truncated subset.bam file. The file could then not be parsed by the WGBS:bam_statistics process resulting in this error.

@kubu4
Copy link
Author

kubu4 commented Jul 20, 2022

I'm not savvy enough to understand how to run the pipeline with a commit that hasn't been issued in a release.

I've been running this:

NXF_VER=20.07.1 /home/shared/nextflow run epidiverse/wgbs -profile test,docker

But, now I see this message rightly indicating that I'm working with a release which is behind the most current commit:

NOTE: Your local project version looks outdated - a different revision is available in the remote repository [17d09d16d8]

If you don't mind giving me some guidance on how to use the version with the newest commit, I'll happily test this out! Thanks!

@kubu4
Copy link
Author

kubu4 commented Jul 20, 2022

Sorry, I should have thought this through first before responding!

So, to use the most current version available, I cloned the repo. Then called the pipeline by specifying path to the cloned repo. In my case, it looks like this:

NXF_VER=20.07.1 /home/shared/nextflow run /home/shared/epidiverse-pipelines/wgbs-current -profile test,docker

Will report back when test run and actual run using some of our data are complete!

@kubu4
Copy link
Author

kubu4 commented Jul 20, 2022

Well, this is still failing in when running the test profile, using the most recent commits.

Confirm I'm using most recent version of repo:

Jul-20 09:30:37.739 [main] INFO  nextflow.cli.CmdRun - Launching `/home/shared/epidiverse-pipelines/wgbs-current/main.nf` [serene_hodgkin] - revision: 62309191e1

Side note: I don't see that revision in the repo commit log, so does that reference something else?

ul-20 09:37:11.565 [Task monitor] DEBUG n.processor.TaskPollingMonitor - Task completed > TaskHandler[id: 37; name: WGBS:bam_statistics (sampleA); status: COMPLETED; exit: 25; error: -; workDir: /home/shared/8TB_HDD_01/sam/analyses/wgbs-docker-test/work/b4/226be72a104187348d5a181209d6d3]
Jul-20 09:37:11.567 [Task monitor] INFO  nextflow.processor.TaskProcessor - [b4/226be7] NOTE: Process `WGBS:bam_statistics (sampleA)` terminated with an error exit status (25) -- Error is ignored

Here's what's in that working directory:

sampleA/  sorted.bam  subset.bam@

Both of the BAMs present look good. Here's head of subset.bam:

42339_genome:1006-1163	163	genome	1007	60	111M	=	1053	157	CAAAATCCTCAACCCAATTAAAAAATAATCAACAAAAATCTATCAAATTCAAGAATTCATCAACCAAATATTAAAATCATCAACCAAATCTAAAAATTCATAAACCAAAAC	HHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHGGGGGGGGGGGGGGGFFFFFFFFEEEEEEEDDDDDCCCCCBBBAAAA@@@???>>>===<<;;:::998877655	NM:i:0	NH:i:1	IH:i:1	HI:i:0	XB:Z:XX/XX	RG:Z:sampleA	PG:Z:ERNE
1429881_genome:1029-1153	163	genome	1030	60	111M	=	1043	124	AATAATCAACAAAAATCTATCAAATTCAAAAATTCATCAACCAGATATTAAAATCATCAACCAAATCTAAAAATTCATAAACCAAAACAACAAAATCCTCAACAATAACAA	HHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHGGGGGGGGGGGGGGGFFFFFFFFEEEEEEEDDDDDCCCCCBBBAAAA@@@???>>>===<<;;:::998877655	NM:i:0	NH:i:1	IH:i:1	HI:i:0	XB:Z:XX/XX	RG:Z:sampleA	PG:Z:ERNE
1011412_genome:1030-1390	163	genome	1031	60	111M	=	1280	360	ATAATCAACAAAAATCTATCAAATTCAAAAATTCATCAACCAAATATTAAAATCATCAACCAAATCTAAAAATTCATAAACCAAAACAACAAAATCCTCAACAATAACAAA	HHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHGGGGGGGGGGGGGGGFFFFFFFFEEEEEEEDDDDDCCCCCBBBAAAA@@@???>>>===<<;;:::998877655	NM:i:0	NH:i:1	IH:i:1	HI:i:0	XB:Z:XX/XX	RG:Z:sampleA	PG:Z:ERNE
1429881_genome:1029-1153	83	genome	1043	60	111M	=	1030	-124	AATCTATCAAATTCAAAAATTCATCAACCAAATATTAAAATCATCAACCAAATCTAAAAATTCATAAACCAAAACAACAAAATCCTCAACAATAACAAAATATAATCAAAA	556778899:::;;<<===>>>???@@@AAAABBBCCCCCDDDDDEEEEEEEFFFFFFFFGGGGGGGGGGGGGGGHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHH	NM:i:0	NH:i:1	IH:i:1	HI:i:0	XB:Z:XX/XX	RG:Z:sampleA	PG:Z:ERNE
354090_genome:1047-1481	163	genome	1048	60	111M	=	1371	434	ATCAAATTCAAAAATTCATCAACCAAATATTAAAATCATCAACCAAATCTAAAAATTCATAAACCAAAACAACAAAATCCTCAACAATAACAAAATATAATCAAAAAAAAT	HHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHGGGGGGGGGGGGGGGFFFFFFFFEEEEEEEDDDDDCCCCCBBBAAAA@@@???>>>===<<;;:::998877655	NM:i:0	NH:i:1	IH:i:1	HI:i:0	XB:Z:XX/XX	RG:Z:sampleA	PG:Z:ERNE
974355_genome:1052-1545	99	genome	1052	60	111M	=	1434	493	GGTTTGAGGATTTGTTGATTAGGTGTTGGAATTGTTGATTGAGTTTGAGAATTTGTAGATTAGGATGGTGGAATTTTTGATAATGATGAGGTATGGTTGAGGAAAATTTAT	HHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHGGGGGGGGGGGGGGGFFFFFFFFEEEEEEEDDDDDCCCCCBBBAAAA@@@???>>>===<<;;:::998877655	NM:i:0	NH:i:1	IH:i:1	HI:i:0	XB:Z:XX/XX	RG:Z:sampleA	PG:Z:ERNE
42339_genome:1006-1163	83	genome	1053	60	111M	=	1007	-157	ATTCAAGAATTCATCAACCAAATATTAAAATCATCAACCAAATCTAAAAATTCATAAACCAAAACAACAAAATCCTCAACAATAACAAAATATAATCAAAAAAAATCTATC	556778899:::;;<<===>>>???@@@AAAABBBCCCCCDDDDDEEEEEEEFFFFFFFFGGGGGGGGGGGGGGGHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHH	NM:i:0	NH:i:1	IH:i:1	HI:i:0	XB:Z:XX/XX	RG:Z:sampleA	PG:Z:ERNE
2218_genome:1080-1237	99	genome	1080	57	111=	=	1126	157	GAATTGTTGATTGAGTTTGAGAATTTGTAGATTAGGATGGTGGAATTTTTGACAATGATGAGGTATGGTTGAGGAAAATTTATTGGGTTTGAGGATTTGTTTATTAGGTGA	HHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHGGGGGGGGGGGGGGGFFFFFFFFEEEEEEEDDDDDCCCCCBBBAAAA@@@???>>>===<<;;:::998877655	HI:i:0	NH:i:1	NM:i:0	MD:Z:111	XD:i:22	XF:i:0	XB:Z:F1/CT	YZ:Z:0	RG:Z:sampleA
1152697_genome:1089-1432	99	genome	1089	60	111M	=	1321	343	ATTGAGTTTGAGAATTTGTAGATTAGGATGGTGGAATTTTTGATAATGATGAGGTATGGTTGAGGAAAATTTATTGGGTTTGAGGATTTGTTTATTAGGTGATGGAATTTT	HHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHGGGGGGGGGGGGGGGFFFFFFFFEEEEEEEDDDDDCCCCCBBBAAAA@@@???>>>===<<;;:::998877655	NM:i:0	NH:i:1	IH:i:1	HI:i:0	XB:Z:XX/XX	RG:Z:sampleA	PG:Z:ERNE
843916_genome:1095-1265	99	genome	1095	60	111M	=	1154	170	TTTGAGAATTTGTAGATTAGGATGGTGGAATTTTTGATAATGATGAGGTATGGTTGAGGAAAATTTATTGGGTTTGAGGATTTGTTTATCAGGTGATGGAATTTTTGATTA	HHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHGGGGGGGGGGGGGGGFFFFFFFFEEEEEEEDDDDDCCCCCBBBAAAA@@@???>>>===<<;;:::998877655	NM:i:0	NH:i:1	IH:i:1	HI:i:0	XB:Z:XX/XX	RG:Z:sampleA	PG:Z:ERNE

Corresponding stats file in the sampleA subdirectory looks normal, too:

# This file was produced by samtools stats (1.9+htslib-1.9) and can be plotted using plot-bamstats
# This file contains statistics for all reads.
# The command line was:  stats sorted.bam
# CHK, Checksum	[2]Read Names	[3]Sequences	[4]Qualities
# CHK, CRC32 of reads which passed filtering followed by addition (32bit overflow)
CHK	d731acc5	4ea886af	5d47e054
# Summary Numbers. Use `grep ^SN | cut -f 2-` to extract this part.
SN	raw total sequences:	2245803
SN	filtered sequences:	0
SN	sequences:	2245803

Any thoughts?

@kubu4 kubu4 changed the title WGBS:bam_statistics always fails WGBS:bam_statistics always fails with Docker profile Jul 20, 2022
@kubu4
Copy link
Author

kubu4 commented Jul 20, 2022

In the meantime, is it possible for you to test it instead with conda and see if this is resolved there for you?

Tested with conda and the test profile runs without issue.

I've updated the issue name to specifically reflect that the failure is a Docker-specific problem.

With that said, I'll probably just move forward with the conda environment and ditch using Docker (for now).

@bio15anu
Copy link
Member

My bad, I jumped the gun a bit here as the other issue was closely intertwined - but #7 should at least be fixed for you now as of the last commit 17d09d1?

I managed to isolate the present issue with the Docker container and rebuilt the image now with a fix. Seems like there were a couple of things going on here which obscured the situation somewhat.

To verify, you can add the option -process.container=epidiverse/wgbs:latest when you try running the test profile with Docker/Singularity. If that works for you then I will push a commit to get the new container image incorporated with the default config.

Thanks again for your help here!

@bio15anu
Copy link
Member

Sorry, I should have thought this through first before responding!

So, to use the most current version available, I cloned the repo. Then called the pipeline by specifying path to the cloned repo. In my case, it looks like this:

NXF_VER=20.07.1 /home/shared/nextflow run /home/shared/epidiverse-pipelines/wgbs-current -profile test,docker

Will report back when test run and actual run using some of our data are complete!

No worries at all! In fact, it's even easier than that. If you ever want to get the latest version of any Nextflow pipeline hosted on Github you can simply do e.g. nextflow pull epidiverse/wgbs and you will get it downloaded automatically to the $HOME/.nextflow/assets directory. Then, you can run it offline.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants