Configuration execution dynamic VS static #59

cpelley · 2024-10-07T07:32:48Z

This diagram demonstrates how we might handle the execution of a configuration, with consideration that the model outputs different batches of leadtimes at different times (hence the clock triggers).
We see three forms of execution for a configuration:

Static workloads (in-house scheduler/dask single-machine scheduler/dask distributed)
- 1.1 Multiple rose tasks per configuration.
  - Executing a subset of a configuration (each subset corresponding to a single leadtime batch).
  - Positives:
    - Each batch can execute on a separate node if desired.
    - Executing a subset of the configuration results in smaller queue times.
    - Not a huge benefit to making dask processing module (plugin) memory footprint aware.
  - Negatives:
    - Still a fixed workflow and so we ask for a set amount of resources so to some extent there is wastage.
    - Any link between different leadtime batches needs explicit handling (saving data from 1 batch and loading it from another).
- 1.2 Single rose task per configuration.
  - Executing our configuration as a single static workflow, single task.
  - Positives:
    - Ideal for basic workflows.
    - Handling leadtime batches is straight forward.
  - Negatives:
    - This is the least efficient use of resources requested on a platform since there will be times where large parts of the execution waiting for the polling clock trigger. That is, underutilising the resources we requested and wasting money.
    - Using these schedulers means being stuck with single node execution.
    - A configuration that would otherwise require more resources than what we are asking for means likely having to make dask memory footprint aware. That reduces the likelihood of dask having to spill to disk (reaching memory threshold limits). Spilling data to disk would be a source for inefficiencies in computation of a configuration.
Dynamic workloads (dask-job-queue)
- Positives:
  - The most efficient form of execution where we can dynamic scale our cluster based on the workfload.
  - Simplest and most flexible configurations (executing everything within a single execution).
  - Easily scale to multiple nodes.
  - Shortest queue times (each worker creation becomes a PBS/SLURM submission).
  - Utilising dask memory footprint awareness give total flexibility to utilise as little or as much resources as we want.
- Negatives:
  - Exploratory work required to understand how to best feed dask with anticipated memory footprint of processing module execution along with potentially input data memory footprint too (spill to disk capability).

Proposed plan/roadmap

Related issues

INV: Real-time resource recording, usage and feedback mechanism #5 (making dask memory aware)
Multi-node execution #51 (dask-job-queue)

cpelley self-assigned this Oct 7, 2024

cpelley changed the title ~~Managing varying workloads~~ Configuration execution dynamic VS static Oct 7, 2024

cpelley added the decision log label Oct 22, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Configuration execution dynamic VS static #59

Configuration execution dynamic VS static #59

cpelley commented Oct 7, 2024 •

edited

Loading

Configuration execution dynamic VS static #59

Configuration execution dynamic VS static #59

Comments

cpelley commented Oct 7, 2024 • edited Loading

Proposed plan/roadmap

Related issues

cpelley commented Oct 7, 2024 •

edited

Loading