-
-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bug]: possible caching / rpm enforcement issue with usage-based-routing-v2 disabled #7395
Comments
this is not where usage based routing runs it's check @April-forever it happens here -
if you have a specific test failing please share it. i'm not sure i follow the caching point in part 2, but if you could share a test where you can trigger the failure event that would be great |
Hello, let us focus on the first issue for now, as it confuses me more. @krrishdholakia From my understanding, Line 861 in 277c6e8
Line 5346 in 277c6e8
Let me provide an example to illustrate the issue. First, consider the following configuration:
Then, start LiteLLM with the following command:
When sending requests to the model repeatedly, I found that the first three requests were processed successfully (in some cases, the third request would fail). However, starting from the fourth request, it fails with the following error:
The reason for this error is that the deployment with the smaller rpm value was selected:
This deployment should not have been selected if the following code had worked as expected: litellm/litellm/router_strategy/lowest_tpm_rpm_v2.py Lines 405 to 408 in bd4ab14
To compare, I tested with the following configuration using usage-based-routing instead of usage-based-routing-v2:
With this configuration, all requests are processed correctly until the 11th request, which fails. This behavior is handled correctly due to the following code: litellm/litellm/router_strategy/lowest_tpm_rpm.py Lines 235 to 238 in bd4ab14
(However, I believe the condition should use > instead of >=, as the RPM limit should be inclusive of the upper bound.) Thus, I believe this issue in usage-based-routing-v2 occurs because the v2 version reused the original code without adapting it properly, leading to this bug. |
What happened?
During my usage of LiteLLM, I noticed two issues:
In the usage-based-routing-v2 method, the RPM limitation is not properly supported. This causes models that have already reached the RPM limit to be incorrectly selected. Specifically, in the _common_checks_available_deployment method:
Here, item is only a part of the tpm_dict key: item.split(":")[0]. It does not correctly correspond to the rpm key. As a result, this condition never matches, and the RPM limitation is effectively ignored.
When I disable the usage-based-routing-v2 method and use enable_pre_call_checks, I observed that it fetches data from two caches (current_request_cache_local and model_group_cache) and takes the maximum value:
I tried to print these values to debug the issue. Initially, this works as expected. For example:
However, over time, I frequently encountered issues where current_request_cache_local did not expire quickly enough. For example, after more than a minute had passed, current_request_cache_local still held data:
This causes issues in handling the RPM limitation for the current minute. I’m not sure if this is a usage issue on my part or an actual bug.
Relevant log output
Are you a ML Ops Team?
No
What LiteLLM version are you on ?
v1.55.1-stable
Twitter / LinkedIn details
No response
The text was updated successfully, but these errors were encountered: