Implement Work Partitioner V2 #17111

glyh · 2025-04-30T09:41:00Z

This is part of the project Rework Snark Worker, intended to make job issuing from the Snark work coordinator issue finer-grained jobs, so that more parallelism is possible.

This, although is indeed a noop, will affect later PR where we switch the RPC logic completely. If this is buggy, CI will fail on later PR that actually does the switch.

Mechanism

Basics

The new layer is called a Work_partitioner. It's in function between a Work_selector and a Snark_worker. It's duty is:

After receiving requests from Snark_worker, try to request the work_selector to get some work to do.
- If the work is "too big", it will break it into smaller jobs. (Defined here: https://github.com/MinaProtocol/mina/blob/corvo/implement-work-partitioner/src/lib/snark_work_lib/partitioned.ml), it will store most jobs into its state, and only issue one small job to the snark worker;
- On top of that, when snark worker is actually requesting a work, it should query its own state, and only query the underlying Work_selector if there's no "small jobs" to issue.
After receiving responses from Snark_worker,
- Try to combine them together if they're indeed small jobs. And upon completion on the combination. Submit the job. This job should have same shape as it originally requested from an underlying Work_selector.
- If it's not a small job(some jobs distributed by the work_selector) are already small enough, just return it to the work_selector.

Small jobs

A "small job" is smaller than the original job provided by a Work_selector, in 2 possibilities:

The Work_selector provide a "`Two" in a One_or_two, in this case, partitioner should distribute 2 specs separately
The Work_selector distribute a Zkapp command, in this case, partitioner should distribute all base snark jobs, and merge jobs between them separately

Design Issue

We don't want to not break other parts of the system, most notably GraphQL APIs. We bolt on a new layer atop the Work_selector. Ideally, when there's refactor, we should unite with the Work_selector.
This layer, doesn't take care of the work reissue of a full job in the Work_selector's perspective, this is because they already have mechanism to do so there. So we only redistribute "small jobs".
It's assumed that we don't care the order of zkapp command base snarks / merge snarks being performed.

RPCs

Use of assertion when merging 2 Single work to a One_or_two

In the function merge_to_one_result_exn here, we are using assertion. The assumptions should be true unless there's bug in code. Still, this worth some attention.

Several failwith when unwrapping errors in work partitioner

See here.

glyh · 2025-04-30T12:40:51Z

For now I haven't design priority here. Should I? We kinda have prioritization because we'll never query underlying selector if we ever have pending zkapp commands/single spec on hand. But that's not full priority because there's no priority with in between the zkapp commands.

EDIT:

I'm using a FIFO queue for zkapp command pools. So in theory completion of early arrival commands is prioritized instead of based on completion percentage. This should be fine as long as the expected completion time for each zkapp command is roughly the same.

If we instead prioritize based on completion rate, e.g. we sort by the number of jobs left to complete, there's likely some starving issues. Although we only starve across already pending zkapp commands rather than all works in the pool. Since we prioritize any pending zkapp commands to any unpartitioned work.

Here's the background thread. And I guess this need input from @georgeee.

…app_segment_works that's used by both Snark_worker and Work_partitioner to Work_partitioner.Snark_worker_shared

…pool

…mp_slot, convert_single_work_from_selector, issue_job_from_partitioner}

…ctor

…rtitioner, request_partitioned_work}

…a work before falling back to other means

…d, so issue from pending zkapp command should be prioritized over issuing from tmp slot

glyh · 2025-05-12T09:49:09Z

Thanks @georgeee for drafting this graph during our session :)

…recuring series of int64, warns on overflow now

glyh requested a review from a team as a code owner April 30, 2025 09:41

glyh changed the title ~~Implement Work Partitioner~~ Implement Work Partitioner V2 Apr 30, 2025

glyh mentioned this pull request Apr 30, 2025

PR Train for Project Rework Snark Worker #17083

Open

glyh force-pushed the corvo/refactor-snark-worker-2 branch from b990aed to 2d90d4a Compare May 1, 2025 04:08

glyh force-pushed the corvo/implement-work-partitioner-3 branch from 5e80f8a to 29325da Compare May 1, 2025 08:56

glyh force-pushed the corvo/refactor-snark-worker-2 branch from ff0f1a9 to 40580a1 Compare May 3, 2025 10:18

glyh force-pushed the corvo/implement-work-partitioner-3 branch from e519fd0 to 5fe965c Compare May 3, 2025 10:20

glyh force-pushed the corvo/refactor-snark-worker-2 branch 2 times, most recently from b45b9a6 to 21b678d Compare May 6, 2025 03:43

glyh added 4 commits May 6, 2025 11:47

Work_partitioner: set up library

dba44c6

(Work Partitioner) Add module Id_generator to Work_partitioner

06693b6

(Work Partitioner) Add functor Job_pool

139b539

Snark Worker, Work Partitioner: Factor out common function extract_zk…

a75bd0f

…app_segment_works that's used by both Snark_worker and Work_partitioner to Work_partitioner.Snark_worker_shared

glyh force-pushed the corvo/implement-work-partitioner-3 branch from 5fe965c to 08b2b67 Compare May 6, 2025 03:50

glyh added 15 commits May 6, 2025 11:52

Work Partitioner: add Work_partitioner.Pending_zkapp_command

f89b177

Work Partitioner: Add {Zkapp_command_job_pool, Sent_job_pool}

f116bdb

(Work Partitioner) Add Mergable_single_work

54a7b94

Work Partitioner: add Work_partitioner.t and corresponding initializer

9f530d0

Work Partitioner: Add Work_partitioner.reissue_old_task

ae7c6e6

Work Partitioner: Add Work_partitioner.issue_from_zkapp_command_work_…

1952c76

…pool

Work Partitioner: Add Work_partitioner.Work_partitioner.{issue_from_t…

d976deb

…mp_slot, convert_single_work_from_selector, issue_job_from_partitioner}

Work Partitioner: Add function Work_partitioner.consume_job_from_sele…

87c7314

…ctor

Work Partitioner: Implement {request_from_selector_and_consume_by_pa…

51920ed

…rtitioner, request_partitioned_work}

Work Partitioner: add Work_partitioner.submit_result

77d72e9

Work Partitioner: implement submit_single

f49b4d4

Work Partitioner: implement submit_into_pending_zkapp_command

cd50d63

Work Partitioner: implement submit_partitioned_work

9d26f6b

Work Partitioner: Refactor Job_pool.{find_first_ready -> fold_until}

7bfb930

FIX(Work Partitioner): go through all pending zkapp command to issue …

8c501fe

…a work before falling back to other means

FIX(Work Partitioner): we should prioritize completion of work on han…

daa37a2

…d, so issue from pending zkapp command should be prioritized over issuing from tmp slot

glyh force-pushed the corvo/implement-work-partitioner-3 branch from 08b2b67 to daa37a2 Compare May 6, 2025 04:00

glyh added 2 commits May 12, 2025 19:34

Work Partitioner: remove recycle mechanism in ID generator and use a …

a348a99

…recuring series of int64, warns on overflow now

Work Partitioner: accept timeouted SNARK worker to submit a proof

866ae90

glyh closed this May 20, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement Work Partitioner V2 #17111

Implement Work Partitioner V2 #17111

glyh commented Apr 30, 2025 •

edited

Loading

glyh commented Apr 30, 2025 •

edited

Loading

glyh commented May 12, 2025 •

edited

Loading

Implement Work Partitioner V2 #17111

Implement Work Partitioner V2 #17111

Conversation

glyh commented Apr 30, 2025 • edited Loading

Mechanism

Basics

Small jobs

Design Issue

RPCs

Use of assertion when merging 2 Single work to a One_or_two

Several failwith when unwrapping errors in work partitioner

glyh commented Apr 30, 2025 • edited Loading

glyh commented May 12, 2025 • edited Loading

glyh commented Apr 30, 2025 •

edited

Loading

glyh commented Apr 30, 2025 •

edited

Loading

glyh commented May 12, 2025 •

edited

Loading