Cleanup of failed folders #80

spficklin · 2019-01-09T19:45:55Z

Using the newest GEMmaker when I run the 475 rice dataset the output directory only cosumes 1.9GB of storage. Previously all of the output would have consumed around 30TB of storage. So this is great!

However, we still have some problems with the work directory. If GEMmaker fails along the way and has partially created files (e.g. SAM/BAM/FASTQ files) then restarts the process it creates a new process directory so those failed attempts never get cleaned up. Also, any temp files (e.g. temp.*.sam) files don't get cleaned up either. So, while the output directory has 1.9GB, the work directory in my case is still consuming 5TB.

Note, that work directories may be abandoned if any part of GEMmaker fails. For example, if you are running 100 samples and 1 failes for some reason, all of those working directories for the other 99 samples are potentially abandoned (need to verify this).

Perhaps the solution is to run a 'cleanup' process if the -resume flag is used that can look for incompleted work directories. However, I have no clue how to determine which directories contain results from failed processes.

The text was updated successfully, but these errors were encountered:

bentsherman · 2019-01-18T22:09:52Z

Nextflow provides two directives, beforeScript and afterScript, which allows you to run arbitrary scripts before and after a process. We could create "pre" and "post" scripts that simply create an empty file in the work directory and then delete it to signify that the process completed:

# pre-script
touch _PROCESS_INCOMPLETE

# execute process...

# post-script
rm _PROCESS_INCOMPLETE

Then it would be easy to scan for orphaned work directories. We could create a cleanup process like you said or we could use the onError handler:

workflow.onError {
  // remove orphaned work directories
}

The latter option would have to be written in Groovy instead of Bash though.

JohnHadish · 2019-08-27T18:51:26Z

Possibly Helpful References on fixing this problem. Nextflow issue 452 has Paolo considering that this is something that is necessary to add to the language, as many people have commented about this issue. This is currently causing a hangup for GEMmaker, as it quickly runs out of space when ran on a local machine.
The prefetch function often fails with exit status 141, causing a large build up of half downloaded sra files from NCBI to clog up a users machine. This prevents GEMmaker from being able to be run on a machine without a few spare Tb of storage if a user wants to run more than ~100 files or so.

nextflow-io/nextflow#165
nextflow-io/nextflow#19
nextflow-io/nextflow#452

https://www.nextflow.io/blog/2016/error-recovery-and-automatic-resources-management.html

JohnHadish · 2019-08-27T18:55:09Z

The error status 141 is strange, as it appears that the ascp step is finishing properly. When looking at the run logs, it appears that they complete successfully:

2019-08-27T04:15:48 prefetch.2.9.6: KClientHttpOpen - connected from '134.121.89.157' to www.ncbi.nlm.nih.gov (130.14.29.110) 
2019-08-27T04:15:48 prefetch.2.9.6: KClientHttpOpen - verifying CA cert 
2019-08-27T04:15:48 prefetch.2.9.6: KClientHttpOpen - connected from '134.121.89.157' to sra-download.ncbi.nlm.nih.gov (130.14.250.26) 
2019-08-27T04:15:48 prefetch.2.9.6: KClientHttpOpen - verifying CA cert 
2019-08-27T04:15:48 prefetch.2.9.6: 1) Downloading 'SRR7006145'...
2019-08-27T04:15:48 prefetch.2.9.6:  Downloading via fasp...
SRR7006145                                      
2019-08-27T04:23:00 prefetch.2.9.6:  fasp download succeed
2019-08-27T04:23:00 prefetch.2.9.6: 1) 'SRR7006145' was downloaded successfully
/opt/aspera/connect/bin/ascp /opt/aspera/connect/bin/ascp -i /opt/aspera/connect/etc/asperaweb_id_dsa.openssh -pQTk1 -k 1 -T -l 1000m [email protected]:data/sracloud/traces/sra62/SRR/006841/SRR7006145 /home/john/Documents/GEMmaker/work/b7/2d24964d5b785a09d3307fd34e27ff/SRR7006145.sra.tmp.32494.tmp

With no indication that they failed besides the .exitcode of 141

Example of failed directory:

.....b7/2d24964d5b785a09d3307fd34e27ff$ ll
total 6341336
drwxr-xr-x  2 john john       4096 Aug 26 21:23 ./
drwxr-xr-x 11 john john       4096 Aug 27 03:10 ../
-rw-r--r--  1 john john          0 Aug 26 21:15 .command.begin
-rw-r--r--  1 john john          0 Aug 26 21:15 .command.err
-rw-r--r--  1 john john       1416 Aug 26 21:23 .command.log
-rw-r--r--  1 john john       1416 Aug 26 21:23 .command.out
-rw-r--r--  1 john john       8999 Aug 26 21:15 .command.run
-rw-r--r--  1 john john        240 Aug 26 21:15 .command.sh
-rw-r--r--  1 john john        233 Aug 26 21:23 .command.trace
-rw-r--r--  1 john john          3 Aug 26 21:23 .exitcode
-rw-r--r--  1 john john 6493480127 Apr 15  2018 SRR7006145.sra

JohnHadish · 2019-08-29T19:30:34Z

nextflow-io/nextflow#1126

Potentially reason why some were failing

spficklin · 2019-09-11T15:35:42Z

exit 141 is commonly a SIGPIPE error meaning that a pipe was shutdown in the middle of the next tool in the piped command reading from the pipe. I'm getting 141 errors on other processes too, so I don't think this problem is specific to downloading SRAs. I'm currently wrapping the SRA download into a Python script so we have greater control over problems with it. For example, prefetch will sometimes indicate that the download failed when it fact it has not. I'm not sure why. Maybe some last bit of communication didn't happen correctly for it.

This may help resolve the 141 on this too as I suspect it may have something to do with the for loop in the bash code?

spficklin · 2019-09-11T15:36:40Z

Oh, and I should mention the Python wrapper script I'm writing will cleanup the files if it detects a failure :-)

spficklin · 2019-09-17T18:17:58Z

We've done all we can do to fix this issue. The new python scripts created for issue #138 will cleanup failed attempts for downloading SRAs and failed fastq_dump runs. We have processes to do cleanup. We just need to wait on Nextflow to provide more control.

spficklin changed the title ~~Cleanup of failed folderes~~ Cleanup of failed folders Jan 9, 2019

spficklin added this to the Release v1.1 milestone Jan 17, 2019

spficklin added the bug label Jan 17, 2019

JohnHadish added the release-blocker Items that need fixing before the next major release. label Aug 27, 2019

spficklin closed this as completed Sep 17, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cleanup of failed folders #80

Cleanup of failed folders #80

spficklin commented Jan 9, 2019 •

edited

Loading

bentsherman commented Jan 18, 2019 •

edited

Loading

JohnHadish commented Aug 27, 2019

JohnHadish commented Aug 27, 2019

JohnHadish commented Aug 29, 2019

spficklin commented Sep 11, 2019

spficklin commented Sep 11, 2019

spficklin commented Sep 17, 2019

Cleanup of failed folders #80

Cleanup of failed folders #80

Comments

spficklin commented Jan 9, 2019 • edited Loading

bentsherman commented Jan 18, 2019 • edited Loading

JohnHadish commented Aug 27, 2019

JohnHadish commented Aug 27, 2019

JohnHadish commented Aug 29, 2019

spficklin commented Sep 11, 2019

spficklin commented Sep 11, 2019

spficklin commented Sep 17, 2019

spficklin commented Jan 9, 2019 •

edited

Loading

bentsherman commented Jan 18, 2019 •

edited

Loading