From f78f0df7b30c3c41c5144b3fbcbcbc927c4e97e9 Mon Sep 17 00:00:00 2001 From: Casie Chen Date: Fri, 27 Sep 2024 16:46:07 -0700 Subject: [PATCH] DESIGN.md suggestions --- pkg/scheduler/DESIGN.md | 124 +++++++++++++++++++++------------------- 1 file changed, 64 insertions(+), 60 deletions(-) diff --git a/pkg/scheduler/DESIGN.md b/pkg/scheduler/DESIGN.md index c8f34e2520e..58476a32433 100644 --- a/pkg/scheduler/DESIGN.md +++ b/pkg/scheduler/DESIGN.md @@ -8,24 +8,27 @@ the queuing logic is isolated into a "tree queue" structure and its associated q ## Tree Queue: What and Why -The "tree queue" structure serves the purpose of a discrete priority queue. -The requests are split into many queues, each of which is located at a leaf node in the tree structure. +The "tree queue" structure builds a discrete priority queue. +Requests are split into many queues; each request queue is located at a leaf node in the tree structure. -The tree structure enables some of the specific requirements of our queue selection algorithms: +The prioritization decisions required for the request queue carry constraints which lend themselves to a +search tree or decision tree structure: +- Each decision is taken independently of the other, making its decision based on some state +- The decisions are hierarchical, or ordered; the same set of request queues may present a different final decision + (request queue) depending on the order of decision execution +- The second decision may return no valid queues. In this case, we defer back to first decision for a different result +- If there are non-zero requests waiting to be served, and valid queriers available to serve those requests, + then these two decisions combined _must_ eventually produce a request queue. -- we must select a queue to dequeue from based on two independent algorithms, each with their own state -- there is a hierarchy of importance between the two algorithms - one is primary, the other secondary -- one of the algorithms (tenant-querier shuffle shard) can reject all queue options presented to it, - requiring us to return back up to the previous level of queue selection to continue searching. - -These requirements lend themselves to a search tree or decision tree structure; -the levels of the tree express a clear hierarchy of decisonmaking between the two algorithms, -and the depth-first traversal provides a familiar pattern for searching for a leaf node to dequeue from. +The ordered nature of the decision-making can be expressed as tree levels, +and the end goal of producing a request queue can be modeled as a depth-first search. ### Diagram: Dequeue Decision Tree (Simplified) -For diagrams in this doc, we omit the `unknown` query component node and its subtree to save space. -The system treats `unknown` the same as `ingester-and-store-gateway`. +> [!NOTE] +> The system maintains a fourth query component, `unknown`, which is treated the same as `ingester-and-store-gateway`. +For diagrams in this doc, we omit the `unknown` query component node and its subtree for visual simplicity. + ```mermaid --- @@ -105,7 +108,7 @@ graph TB ### Enqueuing to the Tree Queue -On enqueue, we partition requests into separate queues based on two static properties of the query request: +On enqueue, we partition requests into separate simple queues based on two static properties of the query request: - the "expected query component" - `ingester`, `store-gateway`, `ingester-and-store-gateway`, or `unknown` @@ -113,19 +116,19 @@ On enqueue, we partition requests into separate queues based on two static prope These properties are used to place the request into a queue at a leaf node. A request from `tenant-1` which is expected to only utilize ingesters -will be enqueued at the leaf node reached by the path `root -> ingester -> tenant-1`. +will be enqueued at the leaf node with path `root -> ingester -> tenant-1`. ### Dequeuing from the Tree Queue -On dequeue, we perform a depth-first search of the tree structure to select a leaf node to dequeue from. +On dequeue, we perform a depth-first search of the tree to select a leaf node to dequeue from. Each of the two non-leaf levels of the tree uses a different algorithm to select the next child node. 1. At the root node level, one algorithm selects one of four possible query component child nodes. -1. At query component level, the other algorithm attempts to select a tenant-specific child node. +2. At query component level, the other algorithm attempts to select a tenant-specific child node. 1. due to tenant-querier shuffle sharding, it is possible that none of the tenant nodes can be selected for dequeuing for the current querier. -1. If a tenant node is selected, the search dequeues from it as it has reached a leaf node. -1. If no tenant node is selected, the search returns back up to the root node level +3. If a tenant node is selected, the search dequeues from it, as it has reached a leaf node. +4. If no tenant node is selected, the search returns back up to the root node level and selects the next query component child node to continue the search from. ### Diagram: Dequeue Decision Tree (Full) @@ -242,57 +245,53 @@ graph TB ### Original State: Queue Splitting by Tenant -The `RequestQueue` originally utilized only a single dimension of queue splitting, by tenant. -This approach served two purposes: +The `RequestQueue` originally only split queues by tenant, with two goals in mind: 1. tenant fairness via a simple round-robin between all tenants with non-empty query request queues -1. rudimentary tenant isolation via shuffle-shard assignment of noisy tenants to only a subset of queriers +2. rudimentary tenant isolation via shuffle-shard assignment of noisy tenants to only a subset of queriers While this inter-tenant Quality-Of-Service approach has worked well, -other QOS issues have arisen from the varying characteristics of Mimir's two "query components" -utilized by the queriers to fetch TSDB data for executing queries: ingesters and store-gateways. +other QoS issues have arisen from the varying characteristics of Mimir's two "query components" -- +components that the queriers fetch TSDB data from in order to execute queries: ingesters and store-gateways. ### New Requirement: Queue Splitting by Query Component Ingesters serve requests for recent data, and store-gateways serve requests for older data. -While queries can overlap the time periods of data fetched by both query components, +While queries can span the time periods served by both query components, many requests are served by only one of the two components. Ingesters and store-gateways tend to experience issues independently of each other, but when one component was in a degraded state, _all_ queries would wait in the queue behind the slow queries, +regardless of which query component they required, causing high latency and timeouts for queries which could have been serviced by the non-degraded query component. -### Phase 1: Query Component Selection by Round-Robin - -In the first phase, we believed that it would be sufficient to duplicate the tenant queue splitting approach. -We split the tenant queues further by query component, so that each tenant could have up to four queues. - -To enable more clear management of the two dimensions of queue splitting rather than one, -we introduced the "tree queue" structure, inspired by Loki's implementation. - -For simplicity at this stage, the tenant selection algorithm was kept higher in the tree -and therefore took priority over the query component queue selection algorithm. -Additionally, the query component selection algorithm was a simple round-robin. -This phase was a failure due to both of those design decisions. +### New Requirement: Ordered Decision-Making -#### Failure 1: Tenant Selection Priority over Query Component Selection (minor) +Because all requests will now be split across two dimensions instead of one, +it matters which dimension is considered first. Earlier in this project, we optimized for simplicity +by implementing the additional query-component queue splitting as a decision taken after choosing a tenant. +As a result, because any given tenant in the queue was guaranteed to have some request(s) queued, +we always dequeued a request for the next tenant in the queue, +even if that tenant only had requests queued for a degraded component. -The fact that the tenant selection was given priority over query-component selection -meant that a tenant's query traffic profile could override the query component round-robin. +Thus, the decision of "which query component to dequeue a request for?" +must come _before_ the decision of which tenant to dequeue a request for. -If the tenant rotation had selected `tenant-1` which was only sending ingester queries at the time, -the round-robin algorithm could only select the ingester queue from the child queue nodes for `tenant-1`, -overriding the intended progression of the query component round-robin. - -#### Failure 2: Inability to Prevent Processing Time Dominance by Slow Queries (major) +### New Requirement: Query Component Prioritization A vanilla round-robin algorithm does not sufficiently guard against a high-latency component -saturating all or nearly all connections with requests in flight in the slow component. +saturating all or nearly all connections with inflight requests for that component. Despite rotating which query component is dequeued for, utilization of the querier-worker connection pool as measured by inflight query processing time will grow asymptotically to be dominated by the slow query component. +In some cases, we may reach this state even faster than in the simple (round-robin across tenant queues) case. +This is because every querier has, at worst, a 75% chance of dequeuing a request for a degraded component +while there are still queries available for non-degraded components. + +Therefore, we are required to make some prioritization decisions about query components +to keep dequeuing queries for non-degraded components wherever possible. -### Phase 2: Query Component Selection to Solve Processing Time Dominance by Slow Queries +### Query Component Selection to Solve Processing Time Dominance by Slow Queries #### Modeling the Problem @@ -355,11 +354,12 @@ gantt ### Solution: Query Component Partitioning by Querier-Worker -This solution is inspired by a research paper from Brown University and Microsoft: +This solution is inspired by [Two-Dimensional Fair Queuing for Multi-Tenant Cloud Services](https://people.mpi-sws.org/~jcmace/papers/mace20162dfq.pdf). -Querier-worker connections are partitioned evenly across up to four possible query-component nodes -with a modulo of the querier-worker connection ID across the ordered list of nodes. +Querier-worker connections are given IDs, +and partitioned evenly across up to four possible query-component nodes +via a modulo operation: `querier-worker connection ID % number of query-component nodes`. Ex: Assume a query component node order of `[ingester, store-gateway, ingester-and-store-gateway, unknown]`. @@ -369,14 +369,18 @@ Assume a query component node order of `[ingester, store-gateway, ingester-and-s - etc. for `ingester-and-store-gateway`, and `unknown` We conservatively expect degradation of the store-gateway query component will cause high latency -for the queries in the `store-gateway`, `ingester-and-store-gateway`, and `unknown` queues, -but by partitioning the querier-worker connections evenly across the four queues, -25% of connections remain reserved to process queries from the `ingester` queue. +for the queries in the `store-gateway`, `ingester-and-store-gateway`, and `unknown` queues. +By partitioning the querier-worker connections evenly across the four queues, +25% of connections remain "reserved" to process queries from the `ingester` queue. The primary measure of success is the servicing of the queries to the non-degraded query component, In real-world scenarios the slow queries are often slow enough to hit timeouts, and the majority of those queries will be expected to fail until the component recovers. +A secondary measure of success is the continued utilization of queriers while there are still any requests in the queue. +The modulo operation described above supports this; if, in the example above, +we exhaust the `ingester` queue, it will be removed and querier-worker connections will be distributed amongst the remaining three queues as they become available again. + #### Modeling the Solution Again we simplify the system to two query components and four querier connections. @@ -459,27 +463,27 @@ gantt #### Caveats: Corner Cases and Things to Know ##### Distribution of Querier-Worker Connections Across Query Component Nodes - -**At least 4 querier-worker connections per querier are required to avoid starving a query component node.** +**If there are fewer than 4 querier-worker connections to the request queue, a query-component +node can be starved of connections.** To prevent this, the querier has been updated to create at least 4 connections to each scheduler, ignoring any `-querier.max-concurrent` value below 4. **When the total number of querier-worker connections is not evenly divisible by the number of query component nodes, the modulo distribution will be uneven, with some nodes being assigned one extra connection**. -This is not considered to be an issue. -Queue nodes are deleted as queues are cleared, then recreated in whichever order the queries arrive in. -As the node count and order changes over time, it in turn shuffles which node(s) receive the extra connections. +This is not an issue. +Queue nodes are deleted as queues are cleared, then recreated in whichever order new queries arrive. +As the node count and order changes over time, the node(s) which receive the extra connections are naturally shuffled. ##### Empty Queue Node Deletion Can Cause Temporary Starvation -As mentioned above, when a queue node is emptied it is deleted from the tree structure +As mentioned above, when a queue node is emptied, it is deleted from the tree structure and cannot be selected by the queue selection algorithms. This can result in the following scenario: 1. Queries to store-gateways are experiencing high latency, causing backup in the `store-gateway`, `ingester-and store-gateway`, and `unknown queues`. 2. The ingester-only queries continue to be dequeued and processed by 1/4 of the querier-worker connections. -3. The ingester-only queue is emptied and the `ingester` node is deleted from the tree. +3. The ingester-only queue is exhausted and the `ingester` node is deleted from the tree. 4. The querier-worker connections are now evenly distributed across the remaining three nodes, and _all_ connections are now stuck working on slow queries touching the degraded store-gateways. 5. More ingester-only queries arrive and are enqueued at the `ingester` node,