-
Notifications
You must be signed in to change notification settings - Fork 69
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
backend: Rewrite Cosmos DB scanning #1200
Conversation
796f3da
to
f375ccf
Compare
Please rebase pull request. |
f375ccf
to
12617b6
Compare
Please rebase pull request. |
12617b6
to
64d6bc3
Compare
Please rebase pull request. |
64d6bc3
to
4db55d9
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM but I'm concerned we have no tests for the workers. What do you think?
We have unit tests for the "business logic" of a single worker -- everything that happens after the subscription is locked. True we don't have any tests for the worker pool itself. Given the high degree of concurrency involved, I don't know how to mock the DB calls. The call sequence would be non-deterministic, I think. Open to suggestions. |
Yeah. This is what I expected you to say and I agree. I guess what I would like to see is a integration test where we establish the pool and see some operations getting handled but that's well outside the scope of this. I assume E2E type tests are on some horizon and this is for then. |
Agree. It would be good, actually, to set up a test to DoS this thing (like way too many subscriptions and operations for the number of workers) and make sure the emitted metrics – assuming we add metrics – indicate that. Then we could build an SRE alert around it. |
In anticipation of the Cosmos DB containers being merged and partitioned by subscription ID, this commit fully rewrites how the backend finds and processes operation documents in Cosmos DB. Instead of querying for all operation documents across all Azure subscriptions and processing them serially, the backend now uses a goroutine-based "worker pool" where each worker is responsible for processing operation documents within a single subscription/ (soon-to-be) Cosmos DB partition. This dramatically increases the parallelism within the backend without significantly changing the Cluster Service interaction logic. The polling intervals and worker pool size within the backend may require further tuning.
Operation documents have a limited time-to-live and are never explicitly deleted.
Async Operation Callbacks protocol requires sending a status payload in the request body of the notification callback. https://eng.ms/docs/products/arm/api_contracts/asyncoperationcallback#callback-request-body
4db55d9
to
86ffba2
Compare
I tacked on a bug fix that isn't really related to the purpose of the PR, but it's in the backend. Realized it today while re-reading Async Operation Callbacks. (Remains to be seen if we're even gonna switch this on, but it's implemented anyway.) |
// postAsyncNotification submits an POST request with status payload to the given URL. | ||
func (s *OperationsScanner) postAsyncNotification(ctx context.Context, operationID string) error { | ||
// Refetch the operation document to provide the latest status. | ||
doc, err := s.dbClient.GetOperationDoc(ctx, operationID) | ||
if err != nil { | ||
return err | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In case you're wondering why this refetch is necessary, it's the same reason as in
#985 (comment)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
What this PR does
In anticipation of the Cosmos DB containers being merged and partitioned by Azure subscription ID, this PR fully rewrites how the backend finds and processes operation documents in Cosmos DB.
Instead of querying for all operation documents across all Azure subscriptions and processing them serially, the backend now uses a goroutine-based "worker pool" where each worker is responsible for processing operation documents within a single Azure subscription / (soon-to-be) Cosmos DB partition.
Specifically, the backend periodically (every 10 minutes) reads all the items in the newly-added PartitionKeys container and builds an internal list of Azure subscription IDs. Then, on a more frequent cycle (every 10 seconds), it deals this list of subscription IDs out to the pool of worker goroutines through a channel.
Each worker goroutine then locks the Azure subscription ID its been given, queries Cosmos DB for any operation items within that Azure subscription, calls Cluster Service for a cluster or (eventually) node pool status update on each operation, updates or deletes Cosmos DB items as necessary, unlocks the Azure subscription ID, and finally listens to the channel again for the next subscription ID to process.
This dramatically increases the parallelism within the backend without significantly changing the Cluster Service interaction logic.
The polling intervals and worker pool size within the backend may require further tuning. I've exposed these "tuning knobs" as environment variables (though haven't yet mapped them to a configmap). Also if the worker pool channel — whose buffer grows with the worker pool size — gets full and starts blocking, we'll get log messages stating so as an indication to re-tune. (We'll probably want metrics for this as well.)
Jira: ARO-14170 - Merge Cosmos DB containers
Special notes for your reviewer