Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Configuration execution dynamic VS static #59

Open
cpelley opened this issue Oct 7, 2024 · 0 comments
Open

Configuration execution dynamic VS static #59

cpelley opened this issue Oct 7, 2024 · 0 comments
Assignees

Comments

@cpelley
Copy link
Collaborator

cpelley commented Oct 7, 2024

dynamic_workflow

edit diagram; edit mermaid

This diagram demonstrates how we might handle the execution of a configuration, with consideration that the model outputs different batches of leadtimes at different times (hence the clock triggers).
We see three forms of execution for a configuration:

  1. Static workloads (in-house scheduler/dask single-machine scheduler/dask distributed)
    • 1.1 Multiple rose tasks per configuration.
      • Executing a subset of a configuration (each subset corresponding to a single leadtime batch).
      • Positives:
        • Each batch can execute on a separate node if desired.
        • Executing a subset of the configuration results in smaller queue times.
        • Not a huge benefit to making dask processing module (plugin) memory footprint aware.
      • Negatives:
        • Still a fixed workflow and so we ask for a set amount of resources so to some extent there is wastage.
        • Any link between different leadtime batches needs explicit handling (saving data from 1 batch and loading it from another).
    • 1.2 Single rose task per configuration.
      • Executing our configuration as a single static workflow, single task.
      • Positives:
        • Ideal for basic workflows.
        • Handling leadtime batches is straight forward.
      • Negatives:
        • This is the least efficient use of resources requested on a platform since there will be times where large parts of the execution waiting for the polling clock trigger. That is, underutilising the resources we requested and wasting money.
        • Using these schedulers means being stuck with single node execution.
        • A configuration that would otherwise require more resources than what we are asking for means likely having to make dask memory footprint aware. That reduces the likelihood of dask having to spill to disk (reaching memory threshold limits). Spilling data to disk would be a source for inefficiencies in computation of a configuration.
  2. Dynamic workloads (dask-job-queue)
    • Positives:
      • The most efficient form of execution where we can dynamic scale our cluster based on the workfload.
      • Simplest and most flexible configurations (executing everything within a single execution).
      • Easily scale to multiple nodes.
      • Shortest queue times (each worker creation becomes a PBS/SLURM submission).
      • Utilising dask memory footprint awareness give total flexibility to utilise as little or as much resources as we want.
    • Negatives:
      • Exploratory work required to understand how to best feed dask with anticipated memory footprint of processing module execution along with potentially input data memory footprint too (spill to disk capability).

Proposed plan/roadmap

Related issues

@cpelley cpelley self-assigned this Oct 7, 2024
@cpelley cpelley changed the title Managing varying workloads Configuration execution dynamic VS static Oct 7, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant