Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cleanup of failed folders #80

Closed
spficklin opened this issue Jan 9, 2019 · 7 comments
Closed

Cleanup of failed folders #80

spficklin opened this issue Jan 9, 2019 · 7 comments
Labels
bug release-blocker Items that need fixing before the next major release.
Milestone

Comments

@spficklin
Copy link
Member

spficklin commented Jan 9, 2019

Using the newest GEMmaker when I run the 475 rice dataset the output directory only cosumes 1.9GB of storage. Previously all of the output would have consumed around 30TB of storage. So this is great!

However, we still have some problems with the work directory. If GEMmaker fails along the way and has partially created files (e.g. SAM/BAM/FASTQ files) then restarts the process it creates a new process directory so those failed attempts never get cleaned up. Also, any temp files (e.g. temp.*.sam) files don't get cleaned up either. So, while the output directory has 1.9GB, the work directory in my case is still consuming 5TB.

Note, that work directories may be abandoned if any part of GEMmaker fails. For example, if you are running 100 samples and 1 failes for some reason, all of those working directories for the other 99 samples are potentially abandoned (need to verify this).

Perhaps the solution is to run a 'cleanup' process if the -resume flag is used that can look for incompleted work directories. However, I have no clue how to determine which directories contain results from failed processes.

@spficklin spficklin changed the title Cleanup of failed folderes Cleanup of failed folders Jan 9, 2019
@spficklin spficklin added this to the Release v1.1 milestone Jan 17, 2019
@spficklin spficklin added the bug label Jan 17, 2019
@bentsherman
Copy link
Member

bentsherman commented Jan 18, 2019

Nextflow provides two directives, beforeScript and afterScript, which allows you to run arbitrary scripts before and after a process. We could create "pre" and "post" scripts that simply create an empty file in the work directory and then delete it to signify that the process completed:

# pre-script
touch _PROCESS_INCOMPLETE

# execute process...

# post-script
rm _PROCESS_INCOMPLETE

Then it would be easy to scan for orphaned work directories. We could create a cleanup process like you said or we could use the onError handler:

workflow.onError {
  // remove orphaned work directories
}

The latter option would have to be written in Groovy instead of Bash though.

@JohnHadish JohnHadish added the release-blocker Items that need fixing before the next major release. label Aug 27, 2019
@JohnHadish
Copy link
Collaborator

Possibly Helpful References on fixing this problem. Nextflow issue 452 has Paolo considering that this is something that is necessary to add to the language, as many people have commented about this issue. This is currently causing a hangup for GEMmaker, as it quickly runs out of space when ran on a local machine.
The prefetch function often fails with exit status 141, causing a large build up of half downloaded sra files from NCBI to clog up a users machine. This prevents GEMmaker from being able to be run on a machine without a few spare Tb of storage if a user wants to run more than ~100 files or so.

nextflow-io/nextflow#165
nextflow-io/nextflow#19
nextflow-io/nextflow#452

https://www.nextflow.io/blog/2016/error-recovery-and-automatic-resources-management.html

@JohnHadish
Copy link
Collaborator

The error status 141 is strange, as it appears that the ascp step is finishing properly. When looking at the run logs, it appears that they complete successfully:

2019-08-27T04:15:48 prefetch.2.9.6: KClientHttpOpen - connected from '134.121.89.157' to www.ncbi.nlm.nih.gov (130.14.29.110) 
2019-08-27T04:15:48 prefetch.2.9.6: KClientHttpOpen - verifying CA cert 
2019-08-27T04:15:48 prefetch.2.9.6: KClientHttpOpen - connected from '134.121.89.157' to sra-download.ncbi.nlm.nih.gov (130.14.250.26) 
2019-08-27T04:15:48 prefetch.2.9.6: KClientHttpOpen - verifying CA cert 
2019-08-27T04:15:48 prefetch.2.9.6: 1) Downloading 'SRR7006145'...
2019-08-27T04:15:48 prefetch.2.9.6:  Downloading via fasp...
SRR7006145                                      
2019-08-27T04:23:00 prefetch.2.9.6:  fasp download succeed
2019-08-27T04:23:00 prefetch.2.9.6: 1) 'SRR7006145' was downloaded successfully
/opt/aspera/connect/bin/ascp /opt/aspera/connect/bin/ascp -i /opt/aspera/connect/etc/asperaweb_id_dsa.openssh -pQTk1 -k 1 -T -l 1000m [email protected]:data/sracloud/traces/sra62/SRR/006841/SRR7006145 /home/john/Documents/GEMmaker/work/b7/2d24964d5b785a09d3307fd34e27ff/SRR7006145.sra.tmp.32494.tmp

With no indication that they failed besides the .exitcode of 141

Example of failed directory:

.....b7/2d24964d5b785a09d3307fd34e27ff$ ll
total 6341336
drwxr-xr-x  2 john john       4096 Aug 26 21:23 ./
drwxr-xr-x 11 john john       4096 Aug 27 03:10 ../
-rw-r--r--  1 john john          0 Aug 26 21:15 .command.begin
-rw-r--r--  1 john john          0 Aug 26 21:15 .command.err
-rw-r--r--  1 john john       1416 Aug 26 21:23 .command.log
-rw-r--r--  1 john john       1416 Aug 26 21:23 .command.out
-rw-r--r--  1 john john       8999 Aug 26 21:15 .command.run
-rw-r--r--  1 john john        240 Aug 26 21:15 .command.sh
-rw-r--r--  1 john john        233 Aug 26 21:23 .command.trace
-rw-r--r--  1 john john          3 Aug 26 21:23 .exitcode
-rw-r--r--  1 john john 6493480127 Apr 15  2018 SRR7006145.sra

@JohnHadish
Copy link
Collaborator

nextflow-io/nextflow#1126

Potentially reason why some were failing

@spficklin
Copy link
Member Author

exit 141 is commonly a SIGPIPE error meaning that a pipe was shutdown in the middle of the next tool in the piped command reading from the pipe. I'm getting 141 errors on other processes too, so I don't think this problem is specific to downloading SRAs. I'm currently wrapping the SRA download into a Python script so we have greater control over problems with it. For example, prefetch will sometimes indicate that the download failed when it fact it has not. I'm not sure why. Maybe some last bit of communication didn't happen correctly for it.

This may help resolve the 141 on this too as I suspect it may have something to do with the for loop in the bash code?

@spficklin
Copy link
Member Author

Oh, and I should mention the Python wrapper script I'm writing will cleanup the files if it detects a failure :-)

@spficklin
Copy link
Member Author

We've done all we can do to fix this issue. The new python scripts created for issue #138 will cleanup failed attempts for downloading SRAs and failed fastq_dump runs. We have processes to do cleanup. We just need to wait on Nextflow to provide more control.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug release-blocker Items that need fixing before the next major release.
Projects
None yet
Development

No branches or pull requests

3 participants