frequent `unknown instance ID` error when running multiple backend instances #69

famarting · 2024-04-25T17:34:49Z

We have found an scenario where there is a transient error that happens frequently

failed to complete orchestration task: rpc error: code = Unknown desc = unknown instance ID: 5f7b2345-897d-4471-af96-6c8e590a29bf

The unknown instance ID could be considered transient, because after the server returning this error to the client, the server stops giving that error after retries, but IMO it shows a more fundamental problem with the server side implementation.

In our scenario we can run multiple instances of the server, so there are multiple grpc servers behind a load balancer. So it can happen that a request to CompleteOrchestratorTask lands in a server where there is no "pending orchestrator" to serve that request.

Here is the series of steps I went through to come to that conclusion:

first you schedule a new orchestration
on the server side the orchestration worker is eventually triggered which on ProcessWorkItem calls ExecuteOrchestrator
we continue on the server side and ExecuteOrchestrator , here https://github.com/microsoft/durabletask-go/blob/main/backend/executor.go#L100 , adds the instance id to be executed into a pendingOrchestrators map , then puts a work item into the work queue and then the function waits for the execution to complete by expecting a signal in a channel attached to the original instance stored in the pendingOrchestrators map
now on the client side, because of the work item added to the work queue, the client eventually receives the work item to execute the orchestrator and then it calls CompleteOrchestratorTask https://github.com/microsoft/durabletask-go/blob/main/backend/executor.go#L230
on the server side, if the call to CompleteOrchestratorTask is received by a different server instance than the one that originally put the instance id into the pendingOrchestrators map, then the unknown instance ID error will happen.

The text was updated successfully, but these errors were encountered:

cgillum · 2024-04-25T18:41:20Z

Thanks - I believe I understand the issue to be a limitation in how the server APIs are implemented.

For context, the server APIs were primarily designed for use in the context of a sidecar process, in which case there is a 1:1 relationship between workers (clients) and servers. I think in order for your client/server architecture to work, you'll need to implement some kind of sticky affinity between workers and the servers that they communicate with.

famarting · 2024-06-13T17:24:38Z

Thank you for your response

I was thinking of what would be needed in order to support an scenario where there can be many clients and many servers, and maybe is not that big of a change...

The solution could consist of replacing the maps pendingOrchestrators and pendingActivities with new functions provided by the Backend.

	SetPendingOrchestrator(id api.InstanceID) error
	CompletePendingOrchestrator(id api.InstanceID, res *protos.OrchestratorResponse) error
	WaitForPendingOrchestrator(ctx context.Context, id api.InstanceID) (*protos.OrchestratorResponse, error)

	SetPendingActivity(id api.InstanceID) error
	CompletePendingActivity(id api.InstanceID, res *protos.ActivityResponse) error
	WaitForPendingActivity(ctx context.Context, id api.InstanceID) (*protos.ActivityResponse, error)

Then it becomes responsibility of the backend how to keep that state, for a quick implementation one could do the same as before and continue using maps to track the pending orchestrators and activities.
However if the backend decides to use persistent storage for this, now calls to CompleteOrchestratorTask and CompleteActivityTask could be received in any server instance and potentially providing horizontal scaling capabilities.

Then the functions WaitForPendingOrchestrator and WaitForPendingActivity would be used from ExecuteOrchestrator and ExecuteActivity respectively.

cgillum · 2024-06-14T16:54:33Z

I think the approach of making work item tracking a responsibility of the backend makes sense. The tradeoff is that it makes backend development a bit more complex since backend implementations must now track work-item state themselves.

I didn't quite understand what SetPendingOrchestrator/Activity and WaitForPendingOrchestrator/Activity would be used for. Could you go into a bit more detail? Feel free to open a PR if that would help make it more clear. I was expecting we'd just need signature changes to CompleteXXXWorkItem and AbandonXXXWorkItem methods.

famarting · 2024-06-17T11:16:37Z

here is a PR showing the proposal #73 , I hope it helps

it may be a bit confusing, hence your comment, that now the Backend interface have functions such as:

GetXXXWorkItem, CompleteXXXWorkItem and AbandonXXXWorkItem
SetPendingXXX, CompletePendingXXX and WaitForPendingXXX

But these functions serve different purposes:

The XXXWorkItem functions are called from backend/orchestration.go and backend/activity.go which are implementations of TaskProcessor. A TaskProcessor seems to be a single process that executes in a single instance, so the calls to the backend interface to the XXXWorkItem functions can have stateful logic... because they are always going to be called from the same process.

However the PendingXXX functions are called from backend/executor.go which is the implementation of the GRPC service. This functions should not have stateful logic and they should externalize state, because due to being a GRPC service there is no guarantee that all the calls are going to be made to the same process.

makes sense? Maybe to avoid confusion the Backend interface could be splitted into multiple interface... an in process or internal interface and another interface only meant to be used from the GRPC service

famarting · 2024-06-20T17:40:21Z

I've just found this PR also fixes other issues I've had #61

I'll re-think the proposal to try to accomodate forthe changes in that PR

famarting · 2024-06-25T16:49:38Z

I wanted to discard my original proposal because it suggested having functions such as WaitForPendingOrchestrator which depending on the backend implementation they could poll on a database... That proposal would require very little changes to the existing implementation, but polling databases is something that we would prefer to avoid.

So then I started thinking how could we decouple the functions ExecuteOrchestrator and CompleteOrchestratorTask so it becomes technically possible to execute these functions in different server instances... however, breaking the current way of working of the grpc executor and task worker so it can be distributed across multiple servers is not as straight forward as I expected 😄 .

The functions ExecuteOrchestrator and ExecuteActivity are blocking functions that "dispatch" the work item to the work items channel and then wait for a response to be received via a callback channel. The statefulness of expecting to receive the orchestration or activity response via a callback channel is what I'm looking to get rid of. Because the fact of receiving a signal via the callback channel means that the data has originated in the same instance, and when running multiple servers there is no guarantee of that.

I already started a POC implementation where ExecuteOrchestrator and ExecuteActivity are broken into two phases: "dispatch" and "process result" ... but the refactor is turning out too large, so before I continue I think it would be better if I can discuss this topic with others

famarting · 2024-06-27T16:49:35Z

Well, I have re-iterated with our team on the proposal made here #69 (comment)

PR showing a rough version of it here #73

We have changed our mind and we think the idea of adding functions such as WaitForPendingXXXX is a necessary evil, as it will help keep the refactor contained and it will mean that we keep the contract of the functions ExecuteOrchestrator and ExecuteActivity intact.

From the dapr backend implementation POV we should be able to have an implementation that leverages reminders to notify the instance locked on WaitForPendingXXXX. This would solve the problem of running multiple instances of the server behind a load balancer and it would be an efficient implementation.

wdyt @cgillum

cgillum · 2024-06-28T22:53:12Z

Hi @famarting - it might help if we could create a simple visual sketch of the proposed interaction pattern between the engine and the backend (even better if we can show the before and after) just to make sure I'm correctly understanding the high-level design. There are a few details that I'm having trouble visualizing based on the text description and PR.

famarting · 2024-07-09T10:57:23Z

HI @cgillum sorry for the late response, I've been reviewing my proposal and trying a POC implementation of the proposed changes and its counterpart in dapr.

btw I was initially confused by your comment

I didn't quite understand what SetPendingOrchestrator/Activity and WaitForPendingOrchestrator/Activity would be used for. Could you go into a bit more detail? Feel free to open a PR if that would help make it more clear. I was expecting we'd just need signature changes to CompleteXXXWorkItem and AbandonXXXWorkItem methods.

The proposal made here aims to abstract state from the GRPC Executor implementation.

In the current system, the functions ExecuteOrchestrator and ExecuteActivity block on a channel receive and that makes them stateful because a call to i.e func (g *grpcExecutor) CompleteOrchestratorTask CANNOT be received on a different server instance, currently the call to CompleteOrchestratorTask MUST be received on the same server instance in order for the GRPC Executor to function properly.

The proposed solution keep the same behavior for ExecuteOrchestrator and ExecuteActivity , but instead of blocking on a channel it would be abstracted to the Backend interface with a series of new functions to do so.

famarting · 2024-07-22T15:47:14Z

@cgillum ping

cgillum · 2024-09-10T22:10:06Z

Revisiting this. I appreciate the diagrams! One thing I'm a bit confused about still is what "Client" refers to. Maybe I'm confused primarily because "Client" and "Worker" are essentially the same thing, but you've separated them out into two distinct things in the diagrams above?

The other thing that confuses me about the diagram above is that the "Before" diagram doesn't include anything about the "Backend", yet the "After" diagram does. I think the Backend should exist in both cases, correct? You're just adding a couple new methods onto the Backend interface?

I understand the desire to abstract state out of the Executor and agree that's probably a good thing to do.

From the dapr backend implementation POV we should be able to have an implementation that leverages reminders to notify the instance locked on WaitForPendingXXXX. This would solve the problem of running multiple instances of the server behind a load balancer and it would be an efficient implementation.

Can you go into more details here? I think of reminders as being associated with actors (in the context of Dapr), but didn't see any mention of actors in this proposal. I'm also curious what exactly would trigger these reminders.

famarting mentioned this issue Jun 17, 2024

abstract pending tasks to backend #73

Closed

elena-kolevska mentioned this issue Jul 23, 2024

Proposal to use reminders for internal workflow actors dapr/dapr#7934

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

frequent `unknown instance ID` error when running multiple backend instances #69

frequent `unknown instance ID` error when running multiple backend instances #69

famarting commented Apr 25, 2024

cgillum commented Apr 25, 2024

famarting commented Jun 13, 2024

cgillum commented Jun 14, 2024

famarting commented Jun 17, 2024

famarting commented Jun 20, 2024

famarting commented Jun 25, 2024

famarting commented Jun 27, 2024

cgillum commented Jun 28, 2024

famarting commented Jul 9, 2024

famarting commented Jul 22, 2024

cgillum commented Sep 10, 2024

frequent unknown instance ID error when running multiple backend instances #69

frequent unknown instance ID error when running multiple backend instances #69

Comments

famarting commented Apr 25, 2024

cgillum commented Apr 25, 2024

famarting commented Jun 13, 2024

cgillum commented Jun 14, 2024

famarting commented Jun 17, 2024

famarting commented Jun 20, 2024

famarting commented Jun 25, 2024

famarting commented Jun 27, 2024

cgillum commented Jun 28, 2024

famarting commented Jul 9, 2024

famarting commented Jul 22, 2024

cgillum commented Sep 10, 2024

frequent `unknown instance ID` error when running multiple backend instances #69

frequent `unknown instance ID` error when running multiple backend instances #69