Description
New feature
Hi! Thanks for using Nextflow and submitting the proposal
for a new feature or the enhancement of an existing functionality.
Please replace this text providing a short description of your
proposal.
Usage scenario
(What's the main usage case and the deployment scenario addressed by this proposal)
Large scale compute tasks that run in parallel (i.e. 10,000 tasks running concurrently) currently created many many small data files that are required to manage the workflow status and enable a workflow restart.
As parallel workloads scale over large clusters or cloud environments, this creates issues with data management, IO bottlenecks and lock contention, all of which impacts and impedes data analysis at scale.
Suggest implementation
(Highlight the main building blocks of a possible implementation and/or related components)
By managing status data within a data base structure, ideally of a resilient structure, the impact of the small data packages will be significantly improved and managed in a single service location.
Various databases support resilient infrastructure, i.e. MongoDB, MySQL, PostgreSQL etc etc. Would it be reasonable to push data into one such backend and ideally manage connection pooling to remove or reduce the overhead of establishing new connections etc ?