[8.x] [Security Solution][Siem migrations] Implement rate limit backoff (#211469) #212178

semd · 2025-02-22T08:52:32Z

Backport

This will backport the following commits from main to 8.x:

[Security Solution][Siem migrations] Implement rate limit backoff (#211469)

Questions ?

Please refer to the Backport tool documentation

…astic#211469) ## Summary Implements an exponential backoff retry strategy when the LLM API throws rate limit (`429`) errors. ### Backoff implementation - The `run` method from the `RuleMigrationsTaskClient` has been moved to the new `RuleMigrationTaskRunner` class. - The settings for the backoff are defined in this class with: ```ts /** Exponential backoff configuration to handle rate limit errors */ const RETRY_CONFIG = { initialRetryDelaySeconds: 1, backoffMultiplier: 2, maxRetries: 8, // max waiting time 4m15s (1*2^8 = 256s) } as const; ``` - Only one rule will be retried at a time, the rest of the concurrent rule translations blocked by the rate limit will await for the API to recover before attempting the translation again. ```ts /** Executor sleep configuration * A sleep time applied at the beginning of each single rule translation in the execution pool, * The objective of this sleep is to spread the load of concurrent translations, and prevent hitting the rate limit repeatedly. * The sleep time applied is a random number between [0-value]. Every time we hit rate limit the value is increased by the multiplier, up to the limit. */ const EXECUTOR_SLEEP = { initialValueSeconds: 3, multiplier: 2, limitSeconds: 96, // 1m36s (5 increases) } as const; ``` ### Migration batching changes ```ts /** Number of concurrent rule translations in the pool */ const TASK_CONCURRENCY = 10 as const; /** Number of rules loaded in memory to be translated in the pool */ const TASK_BATCH_SIZE = 100 as const; ``` #### Before - Batches of 15 rules were retrieved and executed in a `Promise.all`, requiring all of them to be completed before proceeding to the next batch. - A "batch sleep" of 10s was executed at the end of each iteration. #### In this PR - Batches of 100 rules are retrieved and kept in memory. The execution is performed in a task pool with a concurrency of 10 rules. This ensures there are always 10 rules executing at a time. - The "batch sleep" has been removed in favour of an "execution sleep" of rand[1-3]s at the start of each single rule migration. This individual sleep serves two goals: - Spread the load when the migration is first launched. - Prevent hitting the rate limit consistently: The sleep duration is increased every time we hit a rate limit. --------- Co-authored-by: kibanamachine <[email protected]> (cherry picked from commit 64426b2)

elasticmachine · 2025-02-22T10:15:43Z

💛 Build succeeded, but was flaky

Buildkite Build
Commit: 09529b9

Failed CI Steps

Detection Engine - Security Solution Cypress Tests #1

Test Failures

[job] [logs] Detection Engine - Security Solution Cypress Tests #1 / Rule actions during detection rule creation Forwards the correct rule type id to the Cases system action Forwards the correct rule type id to the Cases system action

Metrics [docs]

✅ unchanged

History

semd requested a review from kibanamachine as a code owner February 22, 2025 08:52

semd added the backport label Feb 22, 2025

semd enabled auto-merge (squash) February 22, 2025 08:52

semd mentioned this pull request Feb 22, 2025

[Security Solution][Siem migrations] Implement rate limit backoff #211469

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[8.x] [Security Solution][Siem migrations] Implement rate limit backoff (#211469) #212178

[8.x] [Security Solution][Siem migrations] Implement rate limit backoff (#211469) #212178

semd commented Feb 22, 2025

elasticmachine commented Feb 22, 2025 •

edited

Loading

[8.x] [Security Solution][Siem migrations] Implement rate limit backoff (#211469) #212178

Are you sure you want to change the base?

[8.x] [Security Solution][Siem migrations] Implement rate limit backoff (#211469) #212178

Conversation

semd commented Feb 22, 2025

Backport

Questions ?

elasticmachine commented Feb 22, 2025 • edited Loading

💛 Build succeeded, but was flaky

Failed CI Steps

Test Failures

Metrics [docs]

History

elasticmachine commented Feb 22, 2025 •

edited

Loading