Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

backend: Rewrite Cosmos DB scanning #1200

Merged
merged 4 commits into from
Feb 6, 2025
Merged

backend: Rewrite Cosmos DB scanning #1200

merged 4 commits into from
Feb 6, 2025

Conversation

mbarnes
Copy link
Collaborator

@mbarnes mbarnes commented Jan 30, 2025

What this PR does

In anticipation of the Cosmos DB containers being merged and partitioned by Azure subscription ID, this PR fully rewrites how the backend finds and processes operation documents in Cosmos DB.

Instead of querying for all operation documents across all Azure subscriptions and processing them serially, the backend now uses a goroutine-based "worker pool" where each worker is responsible for processing operation documents within a single Azure subscription / (soon-to-be) Cosmos DB partition.

Specifically, the backend periodically (every 10 minutes) reads all the items in the newly-added PartitionKeys container and builds an internal list of Azure subscription IDs. Then, on a more frequent cycle (every 10 seconds), it deals this list of subscription IDs out to the pool of worker goroutines through a channel.

Each worker goroutine then locks the Azure subscription ID its been given, queries Cosmos DB for any operation items within that Azure subscription, calls Cluster Service for a cluster or (eventually) node pool status update on each operation, updates or deletes Cosmos DB items as necessary, unlocks the Azure subscription ID, and finally listens to the channel again for the next subscription ID to process.

This dramatically increases the parallelism within the backend without significantly changing the Cluster Service interaction logic.

The polling intervals and worker pool size within the backend may require further tuning. I've exposed these "tuning knobs" as environment variables (though haven't yet mapped them to a configmap). Also if the worker pool channel — whose buffer grows with the worker pool size — gets full and starts blocking, we'll get log messages stating so as an indication to re-tune. (We'll probably want metrics for this as well.)

Jira: ARO-14170 - Merge Cosmos DB containers

Special notes for your reviewer

Copy link

github-actions bot commented Feb 3, 2025

Please rebase pull request.

Copy link

github-actions bot commented Feb 3, 2025

Please rebase pull request.

Copy link

github-actions bot commented Feb 5, 2025

Please rebase pull request.

Copy link
Collaborator

@mociarain mociarain left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM but I'm concerned we have no tests for the workers. What do you think?

backend/operations_scanner.go Show resolved Hide resolved
backend/operations_scanner.go Outdated Show resolved Hide resolved
@mbarnes
Copy link
Collaborator Author

mbarnes commented Feb 5, 2025

LGTM but I'm concerned we have no tests for the workers. What do you think?

We have unit tests for the "business logic" of a single worker -- everything that happens after the subscription is locked.

True we don't have any tests for the worker pool itself. Given the high degree of concurrency involved, I don't know how to mock the DB calls. The call sequence would be non-deterministic, I think. Open to suggestions.

@mociarain
Copy link
Collaborator

True we don't have any tests for the worker pool itself. Given the high degree of concurrency involved, I don't know how to mock the DB calls. The call sequence would be non-deterministic, I think. Open to suggestions.

Yeah. This is what I expected you to say and I agree. I guess what I would like to see is a integration test where we establish the pool and see some operations getting handled but that's well outside the scope of this. I assume E2E type tests are on some horizon and this is for then.

@mbarnes
Copy link
Collaborator Author

mbarnes commented Feb 5, 2025

I guess what I would like to see is a integration test where we establish the pool and see some operations getting handled but that's well outside the scope of this. I assume E2E type tests are on some horizon and this is for then.

Agree. It would be good, actually, to set up a test to DoS this thing (like way too many subscriptions and operations for the number of workers) and make sure the emitted metrics – assuming we add metrics – indicate that. Then we could build an SRE alert around it.

Matthew Barnes added 4 commits February 5, 2025 11:53
In anticipation of the Cosmos DB containers being merged and
partitioned by subscription ID, this commit fully rewrites how
the backend finds and processes operation documents in Cosmos DB.

Instead of querying for all operation documents across all Azure
subscriptions and processing them serially, the backend now uses
a goroutine-based "worker pool" where each worker is responsible
for processing operation documents within a single subscription/
(soon-to-be) Cosmos DB partition.

This dramatically increases the parallelism within the backend
without significantly changing the Cluster Service interaction
logic.

The polling intervals and worker pool size within the backend
may require further tuning.
Operation documents have a limited time-to-live and are never
explicitly deleted.
Async Operation Callbacks protocol requires sending a status
payload in the request body of the notification callback.

https://eng.ms/docs/products/arm/api_contracts/asyncoperationcallback#callback-request-body
@mbarnes
Copy link
Collaborator Author

mbarnes commented Feb 5, 2025

I tacked on a bug fix that isn't really related to the purpose of the PR, but it's in the backend. Realized it today while re-reading Async Operation Callbacks. (Remains to be seen if we're even gonna switch this on, but it's implemented anyway.)

@mbarnes mbarnes requested a review from mociarain February 5, 2025 17:06
Comment on lines +370 to +376
// postAsyncNotification submits an POST request with status payload to the given URL.
func (s *OperationsScanner) postAsyncNotification(ctx context.Context, operationID string) error {
// Refetch the operation document to provide the latest status.
doc, err := s.dbClient.GetOperationDoc(ctx, operationID)
if err != nil {
return err
}
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In case you're wondering why this refetch is necessary, it's the same reason as in
#985 (comment)

Copy link
Collaborator

@mociarain mociarain left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@mbarnes mbarnes merged commit 871d333 into main Feb 6, 2025
20 checks passed
@mbarnes mbarnes deleted the backend-rewrite branch February 6, 2025 16:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants