Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Anchor Processor #57

Merged
merged 14 commits into from
Dec 9, 2024
Merged

Add Anchor Processor #57

merged 14 commits into from
Dec 9, 2024

Conversation

dknopik
Copy link
Member

@dknopik dknopik commented Nov 27, 2024

Issue Addressed

Proposed Changes

Add a "processor" component, inspired by the beacon processor from lighthouse. It differs from Lighthouses' processor:

  • Multiple input channels instead of a single input channel which is split into queues.
  • No reprocessing queues. Could be added of course, but I think we should avoid those if possible to design around synchronisation primitives such as Notify etc instead.
  • No "overflowing queues", which drop existing items instead of logging an error. Can be implemented.

Additional Info

I've abandoned efforts to create a common base processor module used by both Lighthouse and Anchor, as this would basically replace glue with other glue. Also, Lighthouse has a lot of specialised behaviour that would be tough to abstract in a common module.

This is only a preliminary version we can use for development - we will of course need to reconsider and/or fine-tune later when we can actually test performance.

This is a draft, stuff like metrics is still missing.

@Zacholme7
Copy link
Member

Zacholme7 commented Nov 27, 2024

My very initial thoughts

  1. I like the distinction between async and blocking tasks, but im not sure on the WorkItem actually containing the function to be executed. In my mind, a WorkItem is just a signal of something that has to be done. It is just message + state. It is then up to the processor to handle the correct routing of that item to the thing that will fulfill it. Something like registering a handler for each item. Also, we would have to Box::pin all the async functions which might get a little icky and theres some overhead with dynamic dispatch.
  2. Following up on 1, maybe we should consider if we want the processor to act as more of an central executor or a manager. As an central executor, the processor is receiving messages from the channels and then executing them. Maybe it makes sense for the processor to act as more of a manager with a set of workers it coordiantes/checks on, while the workers are actually receiving the messages and doing the heavy lifting. Just something I thought of on the fly and may have no relevance/be practical at all.

Other than those two high level thoughts I like it. The priority via biased is nice. It is also difficult to design this without us knowing the full workload right now. Interested to see what others think too.

@dknopik
Copy link
Member Author

dknopik commented Nov 27, 2024

Thank you for looking through the PR :)

  1. I like the distinction between async and blocking tasks, but im not sure on the WorkItem actually containing the function to be executed. In my mind, a WorkItem is just a signal of something that has to be done. It is just message + state. It is then up to the processor to handle the correct routing of that item to the thing that will fulfill it. Something like registering a handler for each item. Also, we would have to Box::pin all the async functions which might get a little icky and theres some overhead with dynamic dispatch.

This is one of the aspects of the beacon processor it kept in. I like it because it saves us infrastructure (having to register a handler for each queue) and allows us to easily submit work items that are different but belong to the same priority category (=queue).

  1. Following up on 1, maybe we should consider if we want the processor to act as more of an central executor or a manager. As an central executor, the processor is receiving messages from the channels and then executing them. Maybe it makes sense for the processor to act as more of a manager with a set of workers it coordiantes/checks on, while the workers are actually receiving the messages and doing the heavy lifting. Just something I thought of on the fly and may have no relevance/be practical at all.

What do you mean by worker? Basically a thread that continually runs and receives messages from the manager, servicing a single queue?

@Zacholme7
Copy link
Member

Zacholme7 commented Nov 27, 2024

This is one of the aspects of the beacon processor it kept in. I like it because it saves us infrastructure (having to register a handler for each queue) and allows us to easily submit work items that are different but belong to the same priority category (=queue).

Ah okay cool that makes sense. I have not dug too much into the beacon processor.

What do you mean by worker? Basically a thread that continually runs and receives messages from the manager, servicing a single queue?

Say we have a slot where multiple validators each have a duty to fulfill. The duty manager would send a message to the specialized workers which in turn would do any specific processing needed and spin up their own QBFT instances and handle that entire life cycle. The processor here is then used to make sure everything is progressing as intended and the workers are still alive, etc.

I suppose here we also have to make the distinction between having workers based on queue priority like is currently done or having workers that specialize in some duty such as attestations. But we would have to somehow work in priority for the latter.

Might make sense to just not over engineer and route it all through the processor and have that manage all the QBFT instances.

Just food for thought and thinking out loud.

@eserilev
Copy link
Contributor

re: a common base processor module:

I've done some work in this PR to generalize aspects of the Lighthouse beacon processor. Mostly to experiment with a different scheduling algorithm. Its still very much a WIP and I'm not sure if these changes would be useful for Anchor, but mentioning it here just in case

@dknopik
Copy link
Member Author

dknopik commented Nov 28, 2024

@eserilev thanks! Being able to switch scheduling strategies is definitely valuable. However, your PR does not yet generalize over the possible work items. It motivated me again to think about a good way to do that though, would be nice to come up with something that supports both :)

@Zacholme7 ah, got you :) yeah, until now I was thinking about the processor purely as a scheduler for task admission. You are right, having it manage e.g. QBFT instances and "route" QBFT messages to the correct instances might be worth it, so I will explore that as well. Thanks for your input!

@jking-aus
Copy link
Contributor

just reviewed -- good start mate and I think it's on the right track. Fair bit of work to come in refining queues, scheduling and prioritisation imo but keen to hear where your head is at.

On some of the discussion above:

  • Yes the intent is to have the processor instantiate and manage the QBFT instances based on inputs from the duties service, which is what I think Zach's comment is saying.
  • It should queue tasks/events from the duties service as work items as well as messages between qbft instances, the networking layer, and the crypto module.
  • The scheduling stuff in the beacon processor and Eitan's PR is worth looking at. We need some prioritisation logic eventually.

Copy link
Member

@AgeManning AgeManning left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good!

I read some of the comments. In terms of routing messages to QBFT instances, we had envisaged each instance being tagged with a unique InstanceId.

The Processor (or some state attached to it) would build a mapping of <InstanceId, Sender> and <InstanceId, Receiver> when creating them.
So we can always just send/recv messages from each instance.

I dont think we need to worry to much about checking if they are running or not. We can have metrics to see how many we have running at any given time, but in general I think its fine to assume they will always complete, either success or failure in at most X time. There is no scenario where a QBFT task cannot end after X time (this saves us checking on them, we can just spawn and forget).

anchor/processor/src/lib.rs Outdated Show resolved Hide resolved
anchor/processor/src/lib.rs Outdated Show resolved Hide resolved
@dknopik
Copy link
Member Author

dknopik commented Dec 3, 2024

I think we should go with the simple variant for the processor instead of sharing code with Lighthouse for now - Lighthouse's needs are pretty complex, and Anchor's needs not fully known yet, so for time budget reasons it is probably better to start with the non-modularized version and later revisit an implementation using the modular processor @eserilev is cooking up. :)

jking-aus
jking-aus previously approved these changes Dec 6, 2024
Copy link
Contributor

@jking-aus jking-aus left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good work mate, concise and logical.

I think for todos:

  • We should use a more structured type definition for the queues: i.e.
    pub Struct QueueConfig {name: str, capacity: usize, permit: bool} then a vec of queues or something to define the queue type. Would make a few things like monitoring, runtime config and queue creation easier.
  • Maybe graceful shutdown handling (nice to have)
  • Consider resource limits and performance monitoring (nice to have)

/// The [`DropOnFinish`] should be dropped when the work is done, for proper permit accounting
/// and metrics. This includes any work triggered by the closure, so [`DropOnFinish`] should
/// be sent along if any other process such as a QBFT instance is messaged.
pub fn new_immediate(name: &'static str, func: ImmediateFn) -> Self {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is a good addition for state change and queue management

@jking-aus jking-aus linked an issue Dec 6, 2024 that may be closed by this pull request
@dknopik
Copy link
Member Author

dknopik commented Dec 6, 2024

Thanks for the review!

* We should use a more structured type definition for the queues: i.e.
  `pub Struct QueueConfig {name: str, capacity: usize, permit: bool}` then a vec of queues or something to define the queue type. Would make a few things like monitoring, runtime config and queue creation easier.

Yeah, that might make sense. I am not sure about the configurability of needing a permit though - This is currently very ingrained into the way I implemented the main select!. Also, it doesn't really add value to have multiple permitless input queues, as the semantic meaning of that queue is basically "throw into tokio asap and let it deal with it". The tasks in there should therefore be very quick or very non-blocking (basically awaiting most of the time) - and if we have to prioritize which task we add to tokio first (just add, not execute!), that would mean that we produce these tasks faster than we can add them, and then something is seriously wrong anyway :)

* Maybe graceful shutdown handling (nice to have)

Oh yeah - that's a good idea. I wonder if we need an explicit shutdown queue or if dropping the senders is enough as shutdown signal 🤔 probably not due to the risk of deadlocking?

* Consider resource limits and performance monitoring (nice to have)

We have the semaphore as a limit already. What kind of limit do you envision? Also, for monitoring, what do you have in mind in addition to the metrics?

@dknopik dknopik marked this pull request as ready for review December 6, 2024 07:44
@jking-aus
Copy link
Contributor

Thanks for the review!

* We should use a more structured type definition for the queues: i.e.
  `pub Struct QueueConfig {name: str, capacity: usize, permit: bool}` then a vec of queues or something to define the queue type. Would make a few things like monitoring, runtime config and queue creation easier.

Yeah, that might make sense. I am not sure about the configurability of needing a permit though - This is currently very ingrained into the way I implemented the main select!. Also, it doesn't really add value to have multiple permitless input queues...

yep all good - permitless queue could be a separate type and it be fine. Just made sense to me to have it all in one for ease of use, even if it almost always false.

* Maybe graceful shutdown handling (nice to have)

Oh yeah - that's a good idea. I wonder if we need an explicit shutdown queue or if dropping the senders is enough as shutdown signal 🤔 probably not due to the risk of deadlocking?

I could be wrong but I think only deadlock risk is blocked tasks with indefinite awaits. For now dropping senders OK but I think we need to handle the blocking tasks more carefully in next ver. Could we just use Notify?

* Consider resource limits and performance monitoring (nice to have)

We have the semaphore as a limit already. What kind of limit do you envision? Also, for monitoring, what do you have in mind in addition to the metrics?

was thinking more for in the future when we need to optimise throughput or check backpressure -- queue latency, processing time, resource usage, etc. Not for this version but an idea for when we've got mvp and are finetuning.

Copy link
Contributor

@jking-aus jking-aus left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@jking-aus jking-aus merged commit 04176ae into sigp:unstable Dec 9, 2024
8 checks passed
@dknopik dknopik deleted the processor branch February 5, 2025 14:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Implement Anchor Processor
5 participants