Skip to content

Replace file system as source for status data to enable at scale compute tasks #1245

Closed
@PeteClapham

Description

@PeteClapham

New feature

Hi! Thanks for using Nextflow and submitting the proposal
for a new feature or the enhancement of an existing functionality.

Please replace this text providing a short description of your
proposal.

Usage scenario

(What's the main usage case and the deployment scenario addressed by this proposal)

Large scale compute tasks that run in parallel (i.e. 10,000 tasks running concurrently) currently created many many small data files that are required to manage the workflow status and enable a workflow restart.

As parallel workloads scale over large clusters or cloud environments, this creates issues with data management, IO bottlenecks and lock contention, all of which impacts and impedes data analysis at scale.

Suggest implementation

(Highlight the main building blocks of a possible implementation and/or related components)

By managing status data within a data base structure, ideally of a resilient structure, the impact of the small data packages will be significantly improved and managed in a single service location.

Various databases support resilient infrastructure, i.e. MongoDB, MySQL, PostgreSQL etc etc. Would it be reasonable to push data into one such backend and ideally manage connection pooling to remove or reduce the overhead of establishing new connections etc ?

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions