break out of infinite retry for submitted ops #1595

shorsher · 2024-10-30T03:12:10Z

Fix for #1594

I've noticed that if there's an error (timeout/disconnect/connection loss) during a BatchPin submission to the blockchain connector (EVMConnect in my case), it's possible for EVMConnect to successfully process the transaction but the FireFly operation will be Failed. Then, the batch processor resubmits the batch but EVMConnect returns a 409, which causes the processor to indefinitely retry and prevent new batches from occurring.

Example error message:

FF10458: Conflict from blockchain connector: FF21065: ID 'default:f5296ba7-1b23-4c7f-8612-620dafc0e40a' is not unique d=pinned_broadcast ns=default opcache=1UGYM3mn p=did:firefly:org/org_5452a6| pid=61421 role=batchmgr Currently, the only way to fix this is restarting FireFly.

Instead, I propose we break the retry loop by having the batch processor not retry on conflict errors.

EnriqueL8 · 2024-10-30T12:07:18Z

Thanks @shorsher - I see the problem now

I was thinking we could approach this one differently by using the backend ID that is included in the transaction to EVMConnect so that when the operation does go ahead and retries using the EVMConnect connector the connector should be able to handle the case where the transaction successfully worked. Instead of marking it as Failed or not returning an error.

EnriqueL8 · 2024-10-30T12:08:54Z

internal/blockchain/ethereum/ethereum.go

@@ -147,6 +148,7 @@ func (e *Ethereum) Init(ctx context.Context, cancelCtx context.CancelFunc, conf
 	e.capabilities = &blockchain.Capabilities{}
 	e.callbacks = common.NewBlockchainCallbacks()
 	e.subs = common.NewFireflySubscriptions()
+	e.count = 0


this count is global so this would only work the first time a TX is sent in the whole of FF

internal/blockchain/ethereum/ethereum.go

shorsher · 2024-10-30T13:10:52Z

I think the bug is that FireFly is not correctly handling the conflict error @EnriqueL8. It gets properly classified in the operations manager as a 409 (which it is), but just missing the step that unblocks the processor. EVMConnect receives and successfully handles the transaction, I don't see how using a different ID would make any difference. We would still be resubmitting the transaction.

codecov · 2024-10-30T16:06:10Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 100.00%. Comparing base (f81d957) to head (a116123).
Report is 3 commits behind head on main.

Additional details and impacted files

@@            Coverage Diff            @@
##              main     #1595   +/-   ##
=========================================
  Coverage   100.00%   100.00%           
=========================================
  Files          337       337           
  Lines        29492     29498    +6     
=========================================
+ Hits         29492     29498    +6

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

awrichar

I think this is the correct fix.

The only alternative that I was considering, is whether Operations Manager should swallow the error. We have very specific conflict handling added there in #1406 for HTTP 409, but even though it does some processing of the error, that code elects to still return the error upward. You could make the case that Operations Manager has handled the error and could clear it from bubbling up any further - but that would potentially have wider implications.

So all of that to say, I think re-parsing the conflict error up at this level feels like the safest solution to the bug.

EnriqueL8

Great stuff, and agree with @awrichar on the solution. The other approach we had discussed around checking the op being successful was more complicated.

One comment on the successful flow that might have been missed and missing codecov

EnriqueL8 · 2024-11-01T17:29:53Z

internal/batch/batch_processor.go

+				conflictErr, conflictTestOk := err.(operations.ConflictError)
+				if conflictTestOk && conflictErr.IsConflictError() {
+					// We know that the connector has received our batch, so we shouldn't need to retry
+					return true, nil
+				}


Agreed this is the right approach but we are missing the else section where we update the message state so in the issue the first time EVMConnect call you get EOF and the second time you get a conflict so it never updated that state for the case where it's a pinned batch tx

Updated to add the message state, thanks for catching.

If an operation receives a conflict error from the blockchain connector, currently it will continutally retry submitting that operation but it will never succeed. Instead, we shouldn't retry if we know the connector has received the submission. Signed-off-by: Alex Shorsher <[email protected]>

Signed-off-by: Alex Shorsher <[email protected]>

EnriqueL8 · 2024-11-04T10:43:43Z

internal/batch/batch_processor.go

+				conflictErr, conflictTestOk := err.(operations.ConflictError)
+				if conflictTestOk && conflictErr.IsConflictError() {
+					// We know that the connector has received our batch, so we shouldn't need to retry
+					payload.addMessageUpdate(payload.Messages, core.MessageStateReady, core.MessageStateSent)


I think we need the whole block and we can extract this to a function

if core.IsPinned(payload.Batch.TX.Type) { payload.addMessageUpdate(payload.Messages, core.MessageStateReady, core.MessageStateSent) } else { payload.addMessageUpdate(payload.Messages, core.MessageStateReady, core.MessageStateConfirmed) }

For pinned batches we have to wait for the blockchain TX that is why we set them as sent and you might have just assumed this was always the case in conflict mode. But this batch processor is completed agnostic to how the batch is handled, it could be a conflict from data exchange or any other plugin needed to dispatch that batch. In the case where it's not pinned, we don't need to wait for TX so we can set it as confirmed

Furthermore there is a snippet of code above that handles a cancelled batch pin and this is an edge case but if the last error before the cancel is a conflict we will update the state of the initial batch incorrectly from cancelled to sent

- don't infinitely retry for submitted operations

shorsher requested a review from a team as a code owner October 30, 2024 03:12

shorsher force-pushed the idempotent-retry-fix branch from 8d79de6 to f968a81 Compare October 30, 2024 03:12

shorsher marked this pull request as draft October 30, 2024 03:12

EnriqueL8 reviewed Oct 30, 2024

View reviewed changes

shorsher force-pushed the idempotent-retry-fix branch 2 times, most recently from 22629de to 68ad190 Compare October 30, 2024 16:00

shorsher marked this pull request as ready for review October 30, 2024 16:55

awrichar approved these changes Nov 1, 2024

View reviewed changes

EnriqueL8 reviewed Nov 1, 2024

View reviewed changes

shorsher force-pushed the idempotent-retry-fix branch 2 times, most recently from e4a24f3 to ef997cd Compare November 1, 2024 19:24

add message state on conflict error

a116123

Signed-off-by: Alex Shorsher <[email protected]>

shorsher force-pushed the idempotent-retry-fix branch from ef997cd to a116123 Compare November 1, 2024 20:32

shorsher merged commit 8951c92 into hyperledger:main Nov 1, 2024
17 checks passed

shorsher deleted the idempotent-retry-fix branch November 1, 2024 21:03

EnriqueL8 reviewed Nov 4, 2024

View reviewed changes

shorsher added the backport-candidate label Nov 4, 2024

shorsher added a commit to kaleido-io/firefly that referenced this pull request Nov 4, 2024

backport of hyperledger#1595

b2f48cc

- don't infinitely retry for submitted operations

shorsher added a commit to kaleido-io/firefly that referenced this pull request Nov 4, 2024

backport of hyperledger#1595

552ad1e

- don't infinitely retry for submitted operations

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

break out of infinite retry for submitted ops #1595

break out of infinite retry for submitted ops #1595

Uh oh!

shorsher commented Oct 30, 2024 •

edited

Loading

Uh oh!

EnriqueL8 commented Oct 30, 2024

Uh oh!

EnriqueL8 Oct 30, 2024

Uh oh!

Uh oh!

shorsher commented Oct 30, 2024

Uh oh!

codecov bot commented Oct 30, 2024 •

edited

Loading

Uh oh!

awrichar left a comment

Uh oh!

EnriqueL8 left a comment •

edited

Loading

Uh oh!

EnriqueL8 Nov 1, 2024 •

edited

Loading

Uh oh!

shorsher Nov 1, 2024

Uh oh!

Uh oh!

EnriqueL8 Nov 4, 2024

Uh oh!

EnriqueL8 Nov 4, 2024

Uh oh!

Uh oh!

break out of infinite retry for submitted ops #1595

break out of infinite retry for submitted ops #1595

Uh oh!

Conversation

shorsher commented Oct 30, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

EnriqueL8 commented Oct 30, 2024

Uh oh!

EnriqueL8 Oct 30, 2024

Choose a reason for hiding this comment

Uh oh!

Uh oh!

shorsher commented Oct 30, 2024

Uh oh!

codecov bot commented Oct 30, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

awrichar left a comment

Choose a reason for hiding this comment

Uh oh!

EnriqueL8 left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

EnriqueL8 Nov 1, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

shorsher Nov 1, 2024

Choose a reason for hiding this comment

Uh oh!

Uh oh!

EnriqueL8 Nov 4, 2024

Choose a reason for hiding this comment

Uh oh!

EnriqueL8 Nov 4, 2024

Choose a reason for hiding this comment

Uh oh!

Uh oh!

shorsher commented Oct 30, 2024 •

edited

Loading

codecov bot commented Oct 30, 2024 •

edited

Loading

EnriqueL8 left a comment •

edited

Loading

EnriqueL8 Nov 1, 2024 •

edited

Loading