-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(proof-data-handler): exclude batches without object file in GCS #2980
base: main
Are you sure you want to change the base?
feat(proof-data-handler): exclude batches without object file in GCS #2980
Conversation
f1b8ad3
to
65cc26e
Compare
@popzxc, I remember you mentioned not to ask for code reviews this wave, but you're probably the most familiar with this code (along with @slowli). So, if you could make an exception this time, I’d really appreciate it. If you're busy, no worries – feel free to ignore, and I’ll ask @RomanBrodetski to find someone else. Thanks! |
Kindly ping @slowli @RomanBrodetski. I need a reviewer. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@pbeza to be honest I don't fully follow this solution. I understand what we are trying to do (mark older unresolved jobs as skipped), but I'm not sure I understand the Why
here. We can discuss over a huddle or async
core/lib/dal/migrations/20240930110000_tee_add_permanently_ignored_state.down.sql
Outdated
Show resolved
Hide resolved
4ee505b
to
bfeddc9
Compare
cf2cf1d
to
6c7d879
Compare
I rebased this PR on the latest BTW, sorry for the force-push instead of merging |
/tee/proof_inputs endpoint no longer returns batches that have no corresponding object file in Google Cloud Storage for an extended period. Since the recent `mainnet`'s `24.25.0` redeployment, we've been [flooded with warnings][warnings] for the `proof-data-handler` on `mainnet` (the warnings are actually _not_ fatal in this context): ``` Failed request with a fatal error (...) Blobs for batch numbers 490520 to 490555 not found in the object store. Marked as unpicked. ``` The issue was caused [by the code][code] behind the `/tee/proof_inputs` [endpoint][endpoint_proof_inputs] (which is equivalent to the `/proof_generation_data` [endpoint][endpoint_proof_generation_data]) – it finds the next batch to send to the [requesting][requesting] `tee-prover` by looking for the first batch that has a corresponding object in the Google object store. As it skips over batches that don’t have the objects, [it logs][logging] `Failed request with a fatal error` for each one (unless the skipped batch was successfully proven, in which case it doesn’t log the error). This happens with every [request][request] the `tee-prover` sends, which is why we were getting so much noise in the logs. One possible solution was to manually flag the problematic batches as `permanently_ignored`, like Thomas [did before][Thomas] on `mainnet`. It was a quick and dirty workaround, but now we have a more automated solution. [warnings]: https://grafana.matterlabs.dev/goto/TjlaXQgHg?orgId=1 [code]: https://github.com/matter-labs/zksync-era/blob/3f406c7d0c0e76d798c2d838abde57ca692822c0/core/node/proof_data_handler/src/tee_request_processor.rs#L35-L79 [endpoint_proof_inputs]: https://github.com/matter-labs/zksync-era/blob/3f406c7d0c0e76d798c2d838abde57ca692822c0/core/node/proof_data_handler/src/lib.rs#L96 [endpoint_proof_generation_data]: https://github.com/matter-labs/zksync-era/blob/3f406c7d0c0e76d798c2d838abde57ca692822c0/core/node/proof_data_handler/src/lib.rs#L67 [requesting]: https://github.com/matter-labs/zksync-era/blob/3f406c7d0c0e76d798c2d838abde57ca692822c0/core/bin/zksync_tee_prover/src/tee_prover.rs#L93 [logging]: https://github.com/matter-labs/zksync-era/blob/3f406c7d0c0e76d798c2d838abde57ca692822c0/core/lib/object_store/src/retries.rs#L56 [Thomas]: https://matter-labs-workspace.slack.com/archives/C05ANUCGCKV/p1725284962312929
6c7d879
to
9188f8a
Compare
@slowli, kindly |
@@ -47,49 +51,52 @@ impl TeeRequestProcessor { | |||
) -> Result<Option<Json<TeeProofGenerationDataResponse>>, RequestProcessorError> { | |||
tracing::info!("Received request for proof generation data: {:?}", request); | |||
|
|||
let mut min_batch_number = self.config.tee_config.first_tee_processed_batch; | |||
let mut missing_range: Option<(L1BatchNumber, L1BatchNumber)> = None; | |||
let batch_ignored_timeout = ChronoDuration::days(10); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hardcode or config parameter?
.lock_batch_for_proving(request.tee_type, min_batch_number) | ||
.await? | ||
else { | ||
// No job available | ||
return Ok(None); | ||
return Ok(None); // no job available |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: this can now be break
, too... Either change all break
to return
or the other way round
f: F, | ||
) -> Result<T, ObjectStoreError> | ||
where | ||
Fut: Future<Output = Result<T, ObjectStoreError>>, | ||
F: FnMut() -> Fut, | ||
{ | ||
self.retry_internal(max_retries, f).await | ||
} | ||
|
||
async fn retry_internal<T, Fut, F>( | ||
&self, | ||
max_retries: u16, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
JFYI: this is an artifact that I'm gonna revert.
zksync_contracts.workspace = true | ||
zksync_basic_types.workspace = true |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: You don't usually need zksync_basic_types
as a direct dep if you depend on zksync_types
; the latter re-exports a substantial part of basic types.
}; | ||
self.unlock_batch(l1_batch_number, request.tee_type).await?; | ||
min_batch_number = l1_batch_number + 1; | ||
self.unlock_batch(batch_number, request.tee_type, status) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel we've already had this conversation: This looks like backend driven by frontend anti-pattern; batches are only unlocked in response to client requests. I'd imagine that batches should be unlocked on a timeout (currently hard-coded as 10 days) with PermanentlyIgnored
status, right? Or is there no harm if a batch is unlocked later?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is also a unlock
part in the SQL query for new jobs ...
OR ( tee.status = `picked_by_prover` AND tee.prover_taken_at < NOW() - processing_timeout::INTERVAL )
} | ||
Err(err) => { | ||
self.unlock_batch(l1_batch_number, request.tee_type).await?; | ||
self.unlock_batch( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Dumb question: Why does this unlocking work like that? Suppose this server loses connection to Postgres for a moment, resulting in a RequestProcessorError::Dal
error. IIUC, the batch will be marked as unlocked here, but there's seemingly no reason to do so.
What ❔
/tee/proof_inputs
endpoint no longer returns batches that have no corresponding object file in Google Cloud Storage for an extended period.Why ❔
TEE's
proof-data-handler
onmainnet
was flooded with warnings.Since the recent
mainnet
's24.25.0
redeployment, we've been flooded with warnings for theproof-data-handler
onmainnet
(the warnings are actually not fatal in this context):The issue is caused by the code behind the
/tee/proof_inputs
endpoint (which is equivalent to the/proof_generation_data
endpoint) – it finds the next batch to send to the requestingtee-prover
by looking for the first batch that has a corresponding object in the Google object store. As it skips over batches that don’t have the objects, it logsFailed request with a fatal error
for each one (unless the skipped batch was successfully proven, in which case it doesn’t log the error). This happens with every request thetee-prover
sends, which is why we're getting so much noise in the logs.One possible solution is to flag the problematic batches as
permanently_ignored
, like Thomas did before onmainnet
.Checklist
zk fmt
andzk lint
.