Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

break out of infinite retry for submitted ops #1595

Merged
merged 2 commits into from
Nov 1, 2024

Conversation

shorsher
Copy link
Member

@shorsher shorsher commented Oct 30, 2024

Fix for #1594

I've noticed that if there's an error (timeout/disconnect/connection loss) during a BatchPin submission to the blockchain connector (EVMConnect in my case), it's possible for EVMConnect to successfully process the transaction but the FireFly operation will be Failed. Then, the batch processor resubmits the batch but EVMConnect returns a 409, which causes the processor to indefinitely retry and prevent new batches from occurring.

Example error message:

FF10458: Conflict from blockchain connector: FF21065: ID 'default:f5296ba7-1b23-4c7f-8612-620dafc0e40a' is not unique d=pinned_broadcast ns=default opcache=1UGYM3mn p=did:firefly:org/org_5452a6| pid=61421 role=batchmgr Currently, the only way to fix this is restarting FireFly.

Instead, I propose we break the retry loop by having the batch processor not retry on conflict errors.

@shorsher shorsher requested a review from a team as a code owner October 30, 2024 03:12
@shorsher shorsher force-pushed the idempotent-retry-fix branch from 8d79de6 to f968a81 Compare October 30, 2024 03:12
@shorsher shorsher marked this pull request as draft October 30, 2024 03:12
@EnriqueL8
Copy link
Contributor

Thanks @shorsher - I see the problem now

I was thinking we could approach this one differently by using the backend ID that is included in the transaction to EVMConnect so that when the operation does go ahead and retries using the EVMConnect connector the connector should be able to handle the case where the transaction successfully worked. Instead of marking it as Failed or not returning an error.

@@ -147,6 +148,7 @@ func (e *Ethereum) Init(ctx context.Context, cancelCtx context.CancelFunc, conf
e.capabilities = &blockchain.Capabilities{}
e.callbacks = common.NewBlockchainCallbacks()
e.subs = common.NewFireflySubscriptions()
e.count = 0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this count is global so this would only work the first time a TX is sent in the whole of FF

internal/blockchain/ethereum/ethereum.go Outdated Show resolved Hide resolved
@shorsher
Copy link
Member Author

I think the bug is that FireFly is not correctly handling the conflict error @EnriqueL8. It gets properly classified in the operations manager as a 409 (which it is), but just missing the step that unblocks the processor. EVMConnect receives and successfully handles the transaction, I don't see how using a different ID would make any difference. We would still be resubmitting the transaction.

@shorsher shorsher force-pushed the idempotent-retry-fix branch 2 times, most recently from 22629de to 68ad190 Compare October 30, 2024 16:00
Copy link

codecov bot commented Oct 30, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 100.00%. Comparing base (f81d957) to head (a116123).
Report is 3 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff            @@
##              main     #1595   +/-   ##
=========================================
  Coverage   100.00%   100.00%           
=========================================
  Files          337       337           
  Lines        29492     29498    +6     
=========================================
+ Hits         29492     29498    +6     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@shorsher shorsher marked this pull request as ready for review October 30, 2024 16:55
Copy link
Contributor

@awrichar awrichar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is the correct fix.

The only alternative that I was considering, is whether Operations Manager should swallow the error. We have very specific conflict handling added there in #1406 for HTTP 409, but even though it does some processing of the error, that code elects to still return the error upward. You could make the case that Operations Manager has handled the error and could clear it from bubbling up any further - but that would potentially have wider implications.

So all of that to say, I think re-parsing the conflict error up at this level feels like the safest solution to the bug.

Copy link
Contributor

@EnriqueL8 EnriqueL8 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great stuff, and agree with @awrichar on the solution. The other approach we had discussed around checking the op being successful was more complicated.

One comment on the successful flow that might have been missed and missing codecov

Comment on lines 636 to 640
conflictErr, conflictTestOk := err.(operations.ConflictError)
if conflictTestOk && conflictErr.IsConflictError() {
// We know that the connector has received our batch, so we shouldn't need to retry
return true, nil
}
Copy link
Contributor

@EnriqueL8 EnriqueL8 Nov 1, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed this is the right approach but we are missing the else section where we update the message state so in the issue the first time EVMConnect call you get EOF and the second time you get a conflict so it never updated that state for the case where it's a pinned batch tx

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated to add the message state, thanks for catching.

If an operation receives a conflict error from the blockchain connector,
currently it will continutally retry submitting that operation but it will
never succeed. Instead, we shouldn't retry if we know the connector has
received the submission.

Signed-off-by: Alex Shorsher <[email protected]>
@shorsher shorsher force-pushed the idempotent-retry-fix branch 2 times, most recently from e4a24f3 to ef997cd Compare November 1, 2024 19:24
@shorsher shorsher force-pushed the idempotent-retry-fix branch from ef997cd to a116123 Compare November 1, 2024 20:32
@shorsher shorsher merged commit 8951c92 into hyperledger:main Nov 1, 2024
17 checks passed
@shorsher shorsher deleted the idempotent-retry-fix branch November 1, 2024 21:03
conflictErr, conflictTestOk := err.(operations.ConflictError)
if conflictTestOk && conflictErr.IsConflictError() {
// We know that the connector has received our batch, so we shouldn't need to retry
payload.addMessageUpdate(payload.Messages, core.MessageStateReady, core.MessageStateSent)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need the whole block and we can extract this to a function

if core.IsPinned(payload.Batch.TX.Type) {
   payload.addMessageUpdate(payload.Messages, core.MessageStateReady, core.MessageStateSent)
} else {
   payload.addMessageUpdate(payload.Messages, core.MessageStateReady, core.MessageStateConfirmed)
}

For pinned batches we have to wait for the blockchain TX that is why we set them as sent and you might have just assumed this was always the case in conflict mode. But this batch processor is completed agnostic to how the batch is handled, it could be a conflict from data exchange or any other plugin needed to dispatch that batch. In the case where it's not pinned, we don't need to wait for TX so we can set it as confirmed

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Furthermore there is a snippet of code above that handles a cancelled batch pin and this is an edge case but if the last error before the cancel is a conflict we will update the state of the initial batch incorrectly from cancelled to sent

shorsher added a commit to kaleido-io/firefly that referenced this pull request Nov 4, 2024
 - don't infinitely retry for submitted operations
shorsher added a commit to kaleido-io/firefly that referenced this pull request Nov 4, 2024
 - don't infinitely retry for submitted operations
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants