Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consider consecutive estimator internal errors #3178

Merged
merged 12 commits into from
Jan 2, 2025
Merged

Consider consecutive estimator internal errors #3178

merged 12 commits into from
Jan 2, 2025

Conversation

squadgazzz
Copy link
Contributor

@squadgazzz squadgazzz commented Dec 20, 2024

Description

From the original issue:

Native prices are an essential part of the auction. That's why the autopilot needs to have a valid native price for all tokens (sold and bought) in an auction. The NativePriceCache has a background task that continuously updates missing native prices.
It obviously caches valid native prices but also responses indicating that no solver has enough liquidity for the requested token. This is done to avoid fetching native prices over and over for tokens that no solver supports.
However, when a solver reports some different error (best example are rate limiting errors) we assume they are just intermittent and re-requesting the native price might work in the future.

The problem is that this logic does not consider the case where a solver always returns an error for a given token while also not other solver is able to produce a good result.
This happened recently where one solver always returned 404 errors because it got shut down on staging. Since no solver was able to produce a price and the logic assume these errors are intermittent we queried the native price for a few tokens over and over again which resulted in a lot more request than usual.

This PR addressed it by introducing a consecutive estimator's internal errors counter. The consecutive error threshold is hardcoded with a value of 5(still a subject of consideration). While a threshold is not reached, the requests remain to be sent to the solver. Otherwise, the cache uses the cached value until it expires using the existing timeout or until another result gets cached. Received recoverable errors don't reset the accumulative errors counter.

How to test

Added a unit test.

Related Issues

Fixes #3159

@squadgazzz squadgazzz marked this pull request as ready for review December 24, 2024 20:10
@squadgazzz squadgazzz requested a review from a team as a code owner December 24, 2024 20:10
/// Defines how many consecutive errors are allowed before the cache starts
/// returning the error to the user without trying to fetch the price from the
/// estimator.
const ACCUMULATIVE_ERRORS_THRESHOLD: u32 = 5;
Copy link
Contributor

@m-lord-renkse m-lord-renkse Dec 30, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wouldn't be simpler and more robust to define a custom response for NoLiquidity from the solvers side? So if we receive NoLiquidity, there is no liquidity, any other answer is some kind of error like server down, rate limit, etc.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We already have this error: https://github.com/cowprotocol/services/pull/3178/files#diff-50d12b254f11a969c7dc675e6f3181c39a3785e541dac0963f869d2d3ec9bacaR231
Not sure I am following the idea. The purpose of this PR is to detect whether the solver is offline.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I didn't explain myself properly. The idea would be to have known errors, like NoLiquidity for which we could keep requesting the prices at a "normal" pace, and then a set of known (e.g., rate limit) and unknown (server down) errors for which we would:

  • request at a greater time interval, so it is not so spammy
  • have a exponential backoff method with a maximum cap, so the spam is reduced significantly
    but at the same time we kind of keep requesting the token in case the price becomes available, just the load is significantly reduced.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I got it correctly, that means a custom cache entry expiration time depending on the cached value type. Will think about that. At first glance, it requires much more changes.

Copy link
Contributor

@m-lord-renkse m-lord-renkse Dec 30, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that means a custom cache entry expiration time depending on the cached value type

we now time.now() and we can derive how much the previous expiration was with requested_at (I may be wrong here, and we may need a new timestamp field). And then apply exponential backoff of the previous timeout (with a configurable, preferably) maximum cap value. Pretty simple, maybe I am missing something 🤔

Copy link
Contributor

@MartinquaXD MartinquaXD left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks correct to me.
Nice test. 👍
A bit unfortunate that this component accumulates complexity. Hopefully this can be addressed when we build the separate pricing service.

crates/shared/src/price_estimation/native_price_cache.rs Outdated Show resolved Hide resolved
crates/shared/src/price_estimation/native_price_cache.rs Outdated Show resolved Hide resolved
}
None
}
}
}

fn get_ready_to_use_cached_price(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function doesn't feel like it's carrying its own weight. I think inlining it in the 2 places where it's used is fine and similarly expressive.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just wanted to hide the filtering details there and avoid using a non-filtered cache anywhere else.

crates/shared/src/price_estimation/native_price_cache.rs Outdated Show resolved Hide resolved
@squadgazzz squadgazzz enabled auto-merge (squash) January 2, 2025 11:34
@squadgazzz squadgazzz merged commit 1294738 into main Jan 2, 2025
11 checks passed
@squadgazzz squadgazzz deleted the fix/3159 branch January 2, 2025 11:39
@github-actions github-actions bot locked and limited conversation to collaborators Jan 2, 2025
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

chore: Cache persistent errors when computing native prices
3 participants