Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[8.x] [Security Solution][Siem migrations] Implement rate limit backoff (#211469) #212178

Open
wants to merge 1 commit into
base: 8.x
Choose a base branch
from

Conversation

semd
Copy link
Contributor

@semd semd commented Feb 22, 2025

Backport

This will backport the following commits from main to 8.x:

Questions ?

Please refer to the Backport tool documentation

…astic#211469)

## Summary

Implements an exponential backoff retry strategy when the LLM API throws
rate limit (`429`) errors.

### Backoff implementation

- The `run` method from the `RuleMigrationsTaskClient` has been moved to
the new `RuleMigrationTaskRunner` class.
- The settings for the backoff are defined in this class with:
```ts
/** Exponential backoff configuration to handle rate limit errors */
const RETRY_CONFIG = {
  initialRetryDelaySeconds: 1,
  backoffMultiplier: 2,
  maxRetries: 8,
  // max waiting time 4m15s (1*2^8 = 256s)
} as const;
```
- Only one rule will be retried at a time, the rest of the concurrent
rule translations blocked by the rate limit will await for the API to
recover before attempting the translation again.

```ts
/** Executor sleep configuration
 * A sleep time applied at the beginning of each single rule translation in the execution pool,
 * The objective of this sleep is to spread the load of concurrent translations, and prevent hitting the rate limit repeatedly.
 * The sleep time applied is a random number between [0-value]. Every time we hit rate limit the value is increased by the multiplier, up to the limit.
 */
const EXECUTOR_SLEEP = {
  initialValueSeconds: 3,
  multiplier: 2,
  limitSeconds: 96, // 1m36s (5 increases)
} as const;
```

### Migration batching changes

```ts
/** Number of concurrent rule translations in the pool */
const TASK_CONCURRENCY = 10 as const;
/** Number of rules loaded in memory to be translated in the pool */
const TASK_BATCH_SIZE = 100 as const;
```

#### Before

- Batches of 15 rules were retrieved and executed in a `Promise.all`,
requiring all of them to be completed before proceeding to the next
batch.
- A "batch sleep" of 10s was executed at the end of each iteration.

#### In this PR

- Batches of 100 rules are retrieved and kept in memory. The execution
is performed in a task pool with a concurrency of 10 rules. This ensures
there are always 10 rules executing at a time.
- The "batch sleep" has been removed in favour of an "execution sleep"
of rand[1-3]s at the start of each single rule migration. This
individual sleep serves two goals:
  - Spread the load when the migration is first launched.
- Prevent hitting the rate limit consistently: The sleep duration is
increased every time we hit a rate limit.

---------

Co-authored-by: kibanamachine <[email protected]>
(cherry picked from commit 64426b2)
@elasticmachine
Copy link
Contributor

elasticmachine commented Feb 22, 2025

💛 Build succeeded, but was flaky

Failed CI Steps

Test Failures

  • [job] [logs] Detection Engine - Security Solution Cypress Tests #1 / Rule actions during detection rule creation Forwards the correct rule type id to the Cases system action Forwards the correct rule type id to the Cases system action

Metrics [docs]

✅ unchanged

History

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants