Skip to content

Implement Work Partitioner V2 #17111

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed

Conversation

glyh
Copy link
Member

@glyh glyh commented Apr 30, 2025

This is part of the project Rework Snark Worker, intended to make job issuing from the Snark work coordinator issue finer-grained jobs, so that more parallelism is possible.

This, although is indeed a noop, will affect later PR where we switch the RPC logic completely. If this is buggy, CI will fail on later PR that actually does the switch.

Mechanism

Basics

The new layer is called a Work_partitioner. It's in function between a Work_selector and a Snark_worker. It's duty is:

  • After receiving requests from Snark_worker, try to request the work_selector to get some work to do.
  • After receiving responses from Snark_worker,
    • Try to combine them together if they're indeed small jobs. And upon completion on the combination. Submit the job. This job should have same shape as it originally requested from an underlying Work_selector.
    • If it's not a small job(some jobs distributed by the work_selector) are already small enough, just return it to the work_selector.

Small jobs

A "small job" is smaller than the original job provided by a Work_selector, in 2 possibilities:

  • The Work_selector provide a "`Two" in a One_or_two, in this case, partitioner should distribute 2 specs separately
  • The Work_selector distribute a Zkapp command, in this case, partitioner should distribute all base snark jobs, and merge jobs between them separately

Design Issue

  • We don't want to not break other parts of the system, most notably GraphQL APIs. We bolt on a new layer atop the Work_selector. Ideally, when there's refactor, we should unite with the Work_selector.

  • This layer, doesn't take care of the work reissue of a full job in the Work_selector's perspective, this is because they already have mechanism to do so there. So we only redistribute "small jobs".

  • It's assumed that we don't care the order of zkapp command base snarks / merge snarks being performed.

RPCs

Use of assertion when merging 2 Single work to a One_or_two

In the function merge_to_one_result_exn here, we are using assertion. The assumptions should be true unless there's bug in code. Still, this worth some attention.

Several failwith when unwrapping errors in work partitioner

See here.

@glyh glyh requested a review from a team as a code owner April 30, 2025 09:41
@glyh glyh changed the title Implement Work Partitioner Implement Work Partitioner V2 Apr 30, 2025
@glyh
Copy link
Member Author

glyh commented Apr 30, 2025

For now I haven't design priority here. Should I? We kinda have prioritization because we'll never query underlying selector if we ever have pending zkapp commands/single spec on hand. But that's not full priority because there's no priority with in between the zkapp commands.

EDIT:

I'm using a FIFO queue for zkapp command pools. So in theory completion of early arrival commands is prioritized instead of based on completion percentage. This should be fine as long as the expected completion time for each zkapp command is roughly the same.

If we instead prioritize based on completion rate, e.g. we sort by the number of jobs left to complete, there's likely some starving issues. Although we only starve across already pending zkapp commands rather than all works in the pool. Since we prioritize any pending zkapp commands to any unpartitioned work.

Here's the background thread. And I guess this need input from @georgeee.

@glyh glyh force-pushed the corvo/refactor-snark-worker-2 branch from b990aed to 2d90d4a Compare May 1, 2025 04:08
@glyh glyh force-pushed the corvo/implement-work-partitioner-3 branch from 5e80f8a to 29325da Compare May 1, 2025 08:56
@glyh glyh force-pushed the corvo/refactor-snark-worker-2 branch from ff0f1a9 to 40580a1 Compare May 3, 2025 10:18
@glyh glyh force-pushed the corvo/implement-work-partitioner-3 branch from e519fd0 to 5fe965c Compare May 3, 2025 10:20
@glyh glyh force-pushed the corvo/refactor-snark-worker-2 branch 2 times, most recently from b45b9a6 to 21b678d Compare May 6, 2025 03:43
@glyh glyh force-pushed the corvo/implement-work-partitioner-3 branch from 5fe965c to 08b2b67 Compare May 6, 2025 03:50
…d, so issue from pending zkapp command should be prioritized over issuing from tmp slot
@glyh glyh force-pushed the corvo/implement-work-partitioner-3 branch from 08b2b67 to daa37a2 Compare May 6, 2025 04:00
@glyh
Copy link
Member Author

glyh commented May 12, 2025

Thanks @georgeee for drafting this graph during our session :)

image

@glyh glyh closed this May 20, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant