-
Notifications
You must be signed in to change notification settings - Fork 33
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
frequent unknown instance ID
error when running multiple backend instances
#69
Comments
Thanks - I believe I understand the issue to be a limitation in how the server APIs are implemented. For context, the server APIs were primarily designed for use in the context of a sidecar process, in which case there is a 1:1 relationship between workers (clients) and servers. I think in order for your client/server architecture to work, you'll need to implement some kind of sticky affinity between workers and the servers that they communicate with. |
Thank you for your response I was thinking of what would be needed in order to support an scenario where there can be many clients and many servers, and maybe is not that big of a change... The solution could consist of replacing the maps
Then it becomes responsibility of the backend how to keep that state, for a quick implementation one could do the same as before and continue using maps to track the pending orchestrators and activities. Then the functions |
I think the approach of making work item tracking a responsibility of the backend makes sense. The tradeoff is that it makes backend development a bit more complex since backend implementations must now track work-item state themselves. I didn't quite understand what |
here is a PR showing the proposal #73 , I hope it helps it may be a bit confusing, hence your comment, that now the Backend interface have functions such as:
But these functions serve different purposes: The However the makes sense? Maybe to avoid confusion the Backend interface could be splitted into multiple interface... an in process or internal interface and another interface only meant to be used from the GRPC service |
I've just found this PR also fixes other issues I've had #61 I'll re-think the proposal to try to accomodate forthe changes in that PR |
I wanted to discard my original proposal because it suggested having functions such as So then I started thinking how could we decouple the functions The functions I already started a POC implementation where |
Well, I have re-iterated with our team on the proposal made here #69 (comment) PR showing a rough version of it here #73 We have changed our mind and we think the idea of adding functions such as From the dapr backend implementation POV we should be able to have an implementation that leverages reminders to notify the instance locked on wdyt @cgillum |
Hi @famarting - it might help if we could create a simple visual sketch of the proposed interaction pattern between the engine and the backend (even better if we can show the before and after) just to make sure I'm correctly understanding the high-level design. There are a few details that I'm having trouble visualizing based on the text description and PR. |
HI @cgillum sorry for the late response, I've been reviewing my proposal and trying a POC implementation of the proposed changes and its counterpart in dapr. btw I was initially confused by your comment
The proposal made here aims to abstract state from the GRPC Executor implementation. In the current system, the functions The proposed solution keep the same behavior for |
@cgillum ping |
Revisiting this. I appreciate the diagrams! One thing I'm a bit confused about still is what "Client" refers to. Maybe I'm confused primarily because "Client" and "Worker" are essentially the same thing, but you've separated them out into two distinct things in the diagrams above? The other thing that confuses me about the diagram above is that the "Before" diagram doesn't include anything about the "Backend", yet the "After" diagram does. I think the Backend should exist in both cases, correct? You're just adding a couple new methods onto the Backend interface? I understand the desire to abstract state out of the Executor and agree that's probably a good thing to do.
Can you go into more details here? I think of reminders as being associated with actors (in the context of Dapr), but didn't see any mention of actors in this proposal. I'm also curious what exactly would trigger these reminders. |
We have found an scenario where there is a transient error that happens frequently
The
unknown instance ID
could be considered transient, because after the server returning this error to the client, the server stops giving that error after retries, but IMO it shows a more fundamental problem with the server side implementation.In our scenario we can run multiple instances of the server, so there are multiple grpc servers behind a load balancer. So it can happen that a request to
CompleteOrchestratorTask
lands in a server where there is no "pending orchestrator" to serve that request.Here is the series of steps I went through to come to that conclusion:
ProcessWorkItem
callsExecuteOrchestrator
ExecuteOrchestrator
, here https://github.com/microsoft/durabletask-go/blob/main/backend/executor.go#L100 , adds the instance id to be executed into apendingOrchestrators
map , then puts a work item into the work queue and then the function waits for the execution to complete by expecting a signal in a channel attached to the original instance stored in thependingOrchestrators
mapCompleteOrchestratorTask
https://github.com/microsoft/durabletask-go/blob/main/backend/executor.go#L230CompleteOrchestratorTask
is received by a different server instance than the one that originally put the instance id into thependingOrchestrators
map, then the unknown instance ID error will happen.The text was updated successfully, but these errors were encountered: