Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

(bugfix): reduce frequency of update requests for copied CSVs #3411

Closed

Conversation

everettraven
Copy link
Contributor

Description of the change:
Reduces the frequency of update requests for copied CSVs by:

  • Adding the olm.operatorframework.io/nonStatusCopyHash annotation to the copied CSVs, populated with a hash of all the non-status fields of the original CSV.
  • Adding the olm.operatorframework.io/statusCopyHash annotation to the copied CSVs, populated with a hash of all the status fields of the original CSV.
  • Updating the above annotations as necessary on the copied CSVs when changes are made to the original CSV

This appears to have been the desired behavior of this function, as evidenced by:

existingNonStatus := existing.Annotations["$copyhash-nonstatus"]
existingStatus := existing.Annotations["$copyhash-status"]

The problem with the previous implementation was that these annotations were never set on the copied CSVs.

Motivation for the change:

  • Reduce the frequency of UPDATE requests made to the Kubernetes API Server for copied CSVs

Architectural changes:

Testing remarks:

Reviewer Checklist

  • Implementation matches the proposed design, or proposal is updated to match implementation
  • Sufficient unit test coverage
  • Sufficient end-to-end test coverage
  • Bug fixes are accompanied by regression test(s)
  • e2e tests and flake fixes are accompanied evidence of flake testing, e.g. executing the test 100(0) times
  • tech debt/todo is accompanied by issue link(s) in comments in the surrounding code
  • Tests are comprehensible, e.g. Ginkgo DSL is being used appropriately
  • Docs updated or added to /doc
  • Commit messages sensible and descriptive
  • Tests marked as [FLAKE] are truly flaky and have an issue
  • Code is properly formatted

by adding annotations to copied CSVs that are populated with
hashes of the non-status fields and the status fields.

This seems to be how this was intended to work, but was not actually
working this way because the annotations never actually existed on the
copied CSV. This resulted in a hot loop of update requests being made
on all copied CSVs.

Signed-off-by: everettraven <[email protected]>
Signed-off-by: everettraven <[email protected]>
@benluddy
Copy link
Contributor

benluddy commented Oct 8, 2024

Those annotations were set on the CSVs in the informer's cache when the hash comparison was introduced (cf618d3). The observation was that CSV copies needed to be written if and only if the copy was out of sync with the original, so caching a pruned object with a hash was much cheaper than caching complete objects while still being able to stomp CSV copies at the appropriate times. This seems to have regressed as part of #3003.

Edit: Breaking the hash comparison would have masked this with the unnecessary update requests, but using a partialobjectmeta informer for CSV copies means that olm-operator doesn't have enough information to determine whether or not a copy is in sync with the original. Even if a hash of the original is included in an annotation when olm-operator writes a copy, if someone else makes a change to a copy's spec, that won't be observable by olm-operator.

Signed-off-by: everettraven <[email protected]>
@bentito
Copy link
Contributor

bentito commented Oct 9, 2024

@benluddy I think we want to push ahead with this PR for a couple reasons:

  1. reverting OCPBUGS-17157: improve watching semantics around CSVs #3003 would be difficult at this late date because it had follow-on work AND we need the memory reduction.
  2. there have not been any OLM users noticing the introduced problems from OCPBUGS-17157: improve watching semantics around CSVs #3003
  3. this PR suppresses sending anything to the API server when annotations appear, which does help out users who have noticed the increased activity.

@benluddy
Copy link
Contributor

benluddy commented Oct 9, 2024

there have not been any OLM users noticing the introduced problems from #3003

You should find with this PR that you can issue a write to a copied CSV that puts it out of sync with the original CSV, and this control loop will not bring the copy back into sync with the original.

Edit: I am a little confused by this response. Isn't this PR an attempt to fix a problem that was both reported by a user and introduced by the linked PR?

@everettraven
Copy link
Contributor Author

there have not been any OLM users noticing the introduced problems from #3003

You should find with this PR that you can issue a write to a copied CSV that puts it out of sync with the original CSV, and this control loop will not bring the copy back into sync with the original.

Edit: I am a little confused by this response. Isn't this PR an attempt to fix a problem that was both reported by a user and introduced by the linked PR?

@benluddy I think what @bentito is trying to say is that we have received no complaints from users about the bug you called out specifically related to writes to a copied CSV, without changing the annotation, would not be reverted.

My understand of the current plan, and I may be a tad out of the loop due to hurricane preparations, is:

  • Merge this PR to fix a critical bug causing users higher logging costs and load on the API server
  • If bug reports come in regarding copied CSV writes not being reverted, investigate how to resolve that

I agree that reverting #3003 is the right approach to fix both the critical bug and the bug you called out that was introduced by the use of the PartialObjectMetadata informer. There are folks with concerns about reverting that change because we needed the performance gain to enable folks running on the edge to be able to use OLM. Reverting #3003 is probably still the best thing to do and identify a way to still achieve enough performance gain for the edge use cases, but the priority is being placed on resolving the update 100% of the time ASAP and the consensus of the current maintainers seems to be that this PR is what allows us to achieve that and give us some breathing room to re-assess.

I know it isn't a great answer, but hopefully this gives some insight as to the decisions the team has made re: continuing with this PR.

@kevinrizza
Copy link
Member

@everettraven I don't think Ben is suggesting that we revert #3003, I think he is suggesting that we don't introduce another regression on top of the previous one as a fix. With this change, I think we are putting ourselves in a situation where we are going to ignore spec changes in the copied csvs, which I'm pretty sure are considered in resolution.

@everettraven
Copy link
Contributor Author

@kevinrizza The regression is technically caused by #3003 . We just would realize it in this PR because we add the annotations back that prevent updating 100% of the time.

That being said, maybe I'm missing that there is a different way to do this?

@everettraven
Copy link
Contributor Author

everettraven commented Oct 9, 2024

Thinking about this a bit more, maybe we can remove the hash annotation and always calculate the hash of the copied CSV and compare that?

EDIT: This isn't possible without reverting #3003 because the lister returns a PartialObjectMeta which doesn't contain spec or status information

EDIT EDIT: This could be possible, likely with significant refactoring. I haven't had time to dig into this more. @bentito is taking over this work since I'll be impacted by Hurricane Milton.

@benluddy
Copy link
Contributor

benluddy commented Oct 9, 2024

Seems likely that some of the observed memory reductions over caching pruned CSVs might have come from reducing the high watermark transient allocations from decoding a big list response (unlike CSVs, or any CRD-backed resource, a PartialObjectMetaList can be encoded as Protobuf and not only JSON).

EDIT: This isn't possible without reverting #3003 because the lister returns a PartialObjectMeta which doesn't contain spec or status information

Right, without the full object, the copy loop can't see all the fields that might have diverged from the original. The metadata.generation field is automatically incremented for all CRs on updates that change non-metadata fields (https://github.com/kubernetes/kubernetes/blob/d9c46d8ecb1ede9be30545c9803e17682fcc4b50/staging/src/k8s.io/apiextensions-apiserver/pkg/registry/customresource/strategy.go#L185) so you might be able to keep track of the last observed generation per copy with only metadata available. The only problem would be that the status subresource is enabled on the CSV CRD, so status updates don't cause metadata.generation to be incremented.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants