Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Replace file system as source for status data to enable at scale compute tasks #1245

Closed
PeteClapham opened this issue Jul 24, 2019 · 8 comments
Labels

Comments

@PeteClapham
Copy link

New feature

Hi! Thanks for using Nextflow and submitting the proposal
for a new feature or the enhancement of an existing functionality.

Please replace this text providing a short description of your
proposal.

Usage scenario

(What's the main usage case and the deployment scenario addressed by this proposal)

Large scale compute tasks that run in parallel (i.e. 10,000 tasks running concurrently) currently created many many small data files that are required to manage the workflow status and enable a workflow restart.

As parallel workloads scale over large clusters or cloud environments, this creates issues with data management, IO bottlenecks and lock contention, all of which impacts and impedes data analysis at scale.

Suggest implementation

(Highlight the main building blocks of a possible implementation and/or related components)

By managing status data within a data base structure, ideally of a resilient structure, the impact of the small data packages will be significantly improved and managed in a single service location.

Various databases support resilient infrastructure, i.e. MongoDB, MySQL, PostgreSQL etc etc. Would it be reasonable to push data into one such backend and ideally manage connection pooling to remove or reduce the overhead of establishing new connections etc ?

@pditommaso
Copy link
Member

Using a DB would not remove the need for having the job script (and wrapper) to be written into the file system, otherwise, how would you execute to the batch scheduler?

@PeteClapham
Copy link
Author

PeteClapham commented Jul 26, 2019 via email

@pditommaso
Copy link
Member

Hi Pete, for what files are you suggesting a database, command input/output files? nextflow control files ie .command.*? or both?

@PeteClapham
Copy link
Author

During the course of a job run, the TES and applications provide information as to the current state of the job in flight. This enables restarts and workflow / job management which is both a core component of NF and also becomes essential at scale.

With many jobs in flight concurrently, placing the individual job state files to disk increases the load on the backend storage systems in a manner that eventually reduces the ability for the backends to respond in a timely manner to the jobs themselves. This subsequently limits the number of jobs and can significantly increase the number of files that services need to support.

Comparison examples include:
ehive, which maintains it's job state within an SQL database and Cromwell which also provides the option to store state in an SQL backend. e.g. https://cromwell.readthedocs.io/en/stable/Configuring/ (search for SQL).

I hope this helps to separate out the need to keep binaries and general ref files on disk and the desire to keep job state separately on a backend database, either as a default option or as an option.

Thanks
Pete

@pditommaso
Copy link
Member

Hi Pete, thanks for your reply and bringing the experience with other projects, but I still not seeing exactly how to fit a DB in NF model.

In your proposal what would be stored in the DB? command input/output files? nextflow control files ie .command.*? or both?

@danielecook
Copy link

Hey @pditommaso -

What about a way for picking up from an intermediate step even when work directories have been removed? This would be quite useful if you are continually processing data for a large analysis but need to clean up scratch space.

For example:

trim_fastqs → align → merge → output bam

This is not a trivial feature, but the idea is that nextflow would retain a database of storeDir files with information on their data lineage, and use this before checking workdirs.

When rerunning the pipeline, during DAG construction, nextflow could check whether an output file had previously been run through all the steps rather than relying on work directories.

Another benefit to this approach: You could develop a way for nextflow to automatically clean up work directories once a storeDir file has been successfully created.

@pditommaso
Copy link
Member

@danielecook Yes, that would be quite useful. There's an open issue for that #452 even tho it's more complicated than expected.

@stale
Copy link

stale bot commented Apr 27, 2020

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the wontfix label Apr 27, 2020
@pditommaso pditommaso added stale and removed wontfix labels Apr 27, 2020
@stale stale bot closed this as completed Jun 27, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants