backend: Rewrite Cosmos DB scanning #1200

mbarnes · 2025-01-30T15:02:53Z

What this PR does

In anticipation of the Cosmos DB containers being merged and partitioned by Azure subscription ID, this PR fully rewrites how the backend finds and processes operation documents in Cosmos DB.

Instead of querying for all operation documents across all Azure subscriptions and processing them serially, the backend now uses a goroutine-based "worker pool" where each worker is responsible for processing operation documents within a single Azure subscription / (soon-to-be) Cosmos DB partition.

Specifically, the backend periodically (every 10 minutes) reads all the items in the newly-added PartitionKeys container and builds an internal list of Azure subscription IDs. Then, on a more frequent cycle (every 10 seconds), it deals this list of subscription IDs out to the pool of worker goroutines through a channel.

Each worker goroutine then locks the Azure subscription ID its been given, queries Cosmos DB for any operation items within that Azure subscription, calls Cluster Service for a cluster or (eventually) node pool status update on each operation, updates or deletes Cosmos DB items as necessary, unlocks the Azure subscription ID, and finally listens to the channel again for the next subscription ID to process.

This dramatically increases the parallelism within the backend without significantly changing the Cluster Service interaction logic.

The polling intervals and worker pool size within the backend may require further tuning. I've exposed these "tuning knobs" as environment variables (though haven't yet mapped them to a configmap). Also if the worker pool channel — whose buffer grows with the worker pool size — gets full and starts blocking, we'll get log messages stating so as an indication to re-tune. (We'll probably want metrics for this as well.)

Jira: ARO-14170 - Merge Cosmos DB containers

Special notes for your reviewer

github-actions · 2025-02-03T10:07:19Z

Please rebase pull request.

github-actions · 2025-02-03T17:45:58Z

Please rebase pull request.

github-actions · 2025-02-05T12:37:14Z

Please rebase pull request.

mociarain

LGTM but I'm concerned we have no tests for the workers. What do you think?

backend/operations_scanner.go

mbarnes · 2025-02-05T16:31:43Z

LGTM but I'm concerned we have no tests for the workers. What do you think?

We have unit tests for the "business logic" of a single worker -- everything that happens after the subscription is locked.

True we don't have any tests for the worker pool itself. Given the high degree of concurrency involved, I don't know how to mock the DB calls. The call sequence would be non-deterministic, I think. Open to suggestions.

mociarain · 2025-02-05T16:38:14Z

True we don't have any tests for the worker pool itself. Given the high degree of concurrency involved, I don't know how to mock the DB calls. The call sequence would be non-deterministic, I think. Open to suggestions.

Yeah. This is what I expected you to say and I agree. I guess what I would like to see is a integration test where we establish the pool and see some operations getting handled but that's well outside the scope of this. I assume E2E type tests are on some horizon and this is for then.

mbarnes · 2025-02-05T16:51:00Z

I guess what I would like to see is a integration test where we establish the pool and see some operations getting handled but that's well outside the scope of this. I assume E2E type tests are on some horizon and this is for then.

Agree. It would be good, actually, to set up a test to DoS this thing (like way too many subscriptions and operations for the number of workers) and make sure the emitted metrics – assuming we add metrics – indicate that. Then we could build an SRE alert around it.

In anticipation of the Cosmos DB containers being merged and partitioned by subscription ID, this commit fully rewrites how the backend finds and processes operation documents in Cosmos DB. Instead of querying for all operation documents across all Azure subscriptions and processing them serially, the backend now uses a goroutine-based "worker pool" where each worker is responsible for processing operation documents within a single subscription/ (soon-to-be) Cosmos DB partition. This dramatically increases the parallelism within the backend without significantly changing the Cluster Service interaction logic. The polling intervals and worker pool size within the backend may require further tuning.

Operation documents have a limited time-to-live and are never explicitly deleted.

Async Operation Callbacks protocol requires sending a status payload in the request body of the notification callback. https://eng.ms/docs/products/arm/api_contracts/asyncoperationcallback#callback-request-body

mbarnes · 2025-02-05T16:57:36Z

I tacked on a bug fix that isn't really related to the purpose of the PR, but it's in the backend. Realized it today while re-reading Async Operation Callbacks. (Remains to be seen if we're even gonna switch this on, but it's implemented anyway.)

mbarnes · 2025-02-06T15:57:32Z

backend/operations_scanner.go

+// postAsyncNotification submits an POST request with status payload to the given URL.
+func (s *OperationsScanner) postAsyncNotification(ctx context.Context, operationID string) error {
+	// Refetch the operation document to provide the latest status.
+	doc, err := s.dbClient.GetOperationDoc(ctx, operationID)
+	if err != nil {
+		return err
+	}


In case you're wondering why this refetch is necessary, it's the same reason as in
#985 (comment)

mociarain

LGTM

mbarnes requested review from bennerv, SudoBrendan, mociarain, venkateshsredhat, Nanyte25, kostola, geoberle, janboll, jmelis, jonathan34c, UlrichSchlueter, weinong, whober0521, tony-schndr and jfchevrette as code owners January 30, 2025 15:02

mbarnes enabled auto-merge January 30, 2025 15:09

mbarnes mentioned this pull request Jan 30, 2025

database: Add PartitionKeys container #1189

Merged

mbarnes force-pushed the backend-rewrite branch from 796f3da to f375ccf Compare February 1, 2025 19:44

mbarnes mentioned this pull request Feb 1, 2025

database: Cosmos DB document tweaks #1203

Open

github-actions bot added the needs-rebase label Feb 3, 2025

mbarnes force-pushed the backend-rewrite branch from f375ccf to 12617b6 Compare February 3, 2025 13:12

github-actions bot removed the needs-rebase label Feb 3, 2025

github-actions bot added the needs-rebase label Feb 3, 2025

mbarnes force-pushed the backend-rewrite branch from 12617b6 to 64d6bc3 Compare February 3, 2025 18:30

github-actions bot added needs-rebase and removed needs-rebase labels Feb 3, 2025

mbarnes force-pushed the backend-rewrite branch from 64d6bc3 to 4db55d9 Compare February 5, 2025 13:04

mbarnes requested a review from hbhushan3 as a code owner February 5, 2025 13:04

github-actions bot removed the needs-rebase label Feb 5, 2025

mociarain reviewed Feb 5, 2025

View reviewed changes

backend/operations_scanner.go Show resolved Hide resolved

backend/operations_scanner.go Outdated Show resolved Hide resolved

Matthew Barnes added 4 commits February 5, 2025 11:53

database: Remove unused DeleteOperationDoc method

3e2fc65

Operation documents have a limited time-to-live and are never explicitly deleted.

database: Remove unused ListAllOperationDocs method

33565f9

backend: Include status body in POST notification to ARM

86ffba2

Async Operation Callbacks protocol requires sending a status payload in the request body of the notification callback. https://eng.ms/docs/products/arm/api_contracts/asyncoperationcallback#callback-request-body

mbarnes force-pushed the backend-rewrite branch from 4db55d9 to 86ffba2 Compare February 5, 2025 16:53

mbarnes requested a review from mociarain February 5, 2025 17:06

mbarnes commented Feb 6, 2025

View reviewed changes

mociarain approved these changes Feb 6, 2025

View reviewed changes

mbarnes merged commit 871d333 into main Feb 6, 2025
20 checks passed

mbarnes deleted the backend-rewrite branch February 6, 2025 16:47

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

backend: Rewrite Cosmos DB scanning #1200

backend: Rewrite Cosmos DB scanning #1200

mbarnes commented Jan 30, 2025 •

edited

Loading

github-actions bot commented Feb 3, 2025

github-actions bot commented Feb 3, 2025

github-actions bot commented Feb 5, 2025

mociarain left a comment

mbarnes commented Feb 5, 2025 •

edited

Loading

mociarain commented Feb 5, 2025

mbarnes commented Feb 5, 2025

mbarnes commented Feb 5, 2025

mbarnes Feb 6, 2025

mociarain left a comment

backend: Rewrite Cosmos DB scanning #1200

backend: Rewrite Cosmos DB scanning #1200

Conversation

mbarnes commented Jan 30, 2025 • edited Loading

What this PR does

Special notes for your reviewer

github-actions bot commented Feb 3, 2025

github-actions bot commented Feb 3, 2025

github-actions bot commented Feb 5, 2025

mociarain left a comment

Choose a reason for hiding this comment

mbarnes commented Feb 5, 2025 • edited Loading

mociarain commented Feb 5, 2025

mbarnes commented Feb 5, 2025

mbarnes commented Feb 5, 2025

mbarnes Feb 6, 2025

Choose a reason for hiding this comment

mociarain left a comment

Choose a reason for hiding this comment

mbarnes commented Jan 30, 2025 •

edited

Loading

mbarnes commented Feb 5, 2025 •

edited

Loading