Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: possible caching / rpm enforcement issue with usage-based-routing-v2 disabled #7395

Open
April-forever opened this issue Dec 24, 2024 · 2 comments
Assignees
Labels
awaiting: user response bug Something isn't working

Comments

@April-forever
Copy link

What happened?

During my usage of LiteLLM, I noticed two issues:

  1. RPM Limitation Not Correctly Supported in usage-based-routing-v2

In the usage-based-routing-v2 method, the RPM limitation is not properly supported. This causes models that have already reached the RPM limit to be incorrectly selected. Specifically, in the _common_checks_available_deployment method:

elif (rpm_dict is not None and item in rpm_dict) and (
    rpm_dict[item] + 1 >= _deployment_rpm
):

Here, item is only a part of the tpm_dict key: item.split(":")[0]. It does not correctly correspond to the rpm key. As a result, this condition never matches, and the RPM limitation is effectively ignored.

  1. Cache Behavior in enable_pre_call_checks

When I disable the usage-based-routing-v2 method and use enable_pre_call_checks, I observed that it fetches data from two caches (current_request_cache_local and model_group_cache) and takes the maximum value:

### get usage based cache ###
if (
    isinstance(model_group_cache, dict)
    and self.routing_strategy != "usage-based-routing-v2"
):
    model_group_cache[model_id] = model_group_cache.get(model_id, 0)

    current_request = max(
        current_request_cache_local, model_group_cache[model_id]
    )

I tried to print these values to debug the issue. Initially, this works as expected. For example:

08:05:10 - LiteLLM Router:DEBUG: router.py:5070 - current_request_cache_local=1
08:05:10 - LiteLLM Router:DEBUG: router.py:5071 - model_group_cache=1
08:05:10 - LiteLLM Router:DEBUG: router.py:5072 - current_request=1
08:05:10 - LiteLLM Router:DEBUG: router.py:5073 - rpm=1
08:05:10 - LiteLLM Router:DEBUG: router.py:5070 - current_request_cache_local=2
08:05:10 - LiteLLM Router:DEBUG: router.py:5071 - model_group_cache=2
08:05:10 - LiteLLM Router:DEBUG: router.py:5072 - current_request=2
08:05:10 - LiteLLM Router:DEBUG: router.py:5073 - rpm=3

However, over time, I frequently encountered issues where current_request_cache_local did not expire quickly enough. For example, after more than a minute had passed, current_request_cache_local still held data:

08:09:31 - LiteLLM Router:DEBUG: router.py:5070 - current_request_cache_local=0
08:09:31 - LiteLLM Router:DEBUG: router.py:5071 - model_group_cache=0
08:09:31 - LiteLLM Router:DEBUG: router.py:5072 - current_request=0
08:09:31 - LiteLLM Router:DEBUG: router.py:5073 - rpm=1
08:09:31 - LiteLLM Router:DEBUG: router.py:5070 - current_request_cache_local=3
08:09:31 - LiteLLM Router:DEBUG: router.py:5071 - model_group_cache=0
08:09:31 - LiteLLM Router:DEBUG: router.py:5072 - current_request=3
08:09:31 - LiteLLM Router:DEBUG: router.py:5073 - rpm=3

This causes issues in handling the RPM limitation for the current minute. I’m not sure if this is a usage issue on my part or an actual bug.

Relevant log output

Attached above

Are you a ML Ops Team?

No

What LiteLLM version are you on ?

v1.55.1-stable

Twitter / LinkedIn details

No response

@April-forever April-forever added the bug Something isn't working label Dec 24, 2024
@krrishdholakia
Copy link
Contributor

Specifically, in the _common_checks_available_deployment method

this is not where usage based routing runs it's check @April-forever

it happens here -

async def async_pre_call_check(

if you have a specific test failing please share it.

i'm not sure i follow the caching point in part 2, but if you could share a test where you can trigger the failure event that would be great

@krrishdholakia krrishdholakia self-assigned this Dec 24, 2024
@krrishdholakia krrishdholakia changed the title [Bug]: Problems with RPM Limitation and Cache Behavior in LiteLLM [Bug]: possible caching / rpm issue with usage-based-routing-v2 disabled Dec 24, 2024
@krrishdholakia krrishdholakia changed the title [Bug]: possible caching / rpm issue with usage-based-routing-v2 disabled [Bug]: possible caching / rpm enforcement issue with usage-based-routing-v2 disabled Dec 24, 2024
@April-forever
Copy link
Author

Hello, let us focus on the first issue for now, as it confuses me more. @krrishdholakia

From my understanding, _common_checks_available_deployment is not used for checks, but it is called within the async_get_available_deployments method to get available deployments, where it should check the RPM:

deployment = await self.async_get_available_deployment(

await self.lowesttpm_logger_v2.async_get_available_deployments(

Let me provide an example to illustrate the issue.

First, consider the following configuration:

model_list:
  - model_name: model-test
    litellm_params:
      model: somemodel
      api_key: somekey
      api_base: somehost
      rpm: 1
  - model_name: model-test
    litellm_params:
      model: somemodel
      api_key: somekey
      api_base: somehost
      rpm: 10
router_settings:
  routing_strategy: usage-based-routing-v2
  disable_cooldowns: True

Then, start LiteLLM with the following command:

docker run -v /config.yaml:/etc/litellm/config.yaml \
           -p 4000:4000 \
           -it\
           litellm/litellm:v1.55.1-stable \
           --config /etc/litellm/config.yaml --detailed_debug

When sending requests to the model repeatedly, I found that the first three requests were processed successfully (in some cases, the third request would fail). However, starting from the fourth request, it fails with the following error:

An error occurred: litellm.RateLimitError: Deployment over defined rpm limit=1. current usage=1

The reason for this error is that the deployment with the smaller rpm value was selected:

10:20:47 - LiteLLM Router:DEBUG: lowest_tpm_rpm_v2.py:440 - get_available_deployments - Usage Based. model_group: model-test, healthy_deployments: [{‘model_name’: ‘model-test’, ‘litellm_params’: {‘rpm’: 1, ‘api_key’: ‘somekey’, ‘api_base’: ‘somehost’, ‘model’: ‘somemodel’}, ‘model_info’: {‘id’: ‘7dad6c90ada0fe4774ad25e300359673b11ae7f6c993419aec3af’, ‘db_model’: False}}, {‘model_name’: ‘model-test’, ‘litellm_params’: {‘rpm’: 10, ‘api_key’: ‘somekey’, ‘api_base’: ‘somehost’, ‘model’: ‘somemodel’}, ‘model_info’: {‘id’: ‘aaa246b6413b778068c7cc03910a27fefccddd73a0a16b905bddc90’, ‘db_model’: False}}]
…something…
10:20:47 - LiteLLM:DEBUG: utils.py:266 - returning picked lowest tpm/rpm deployment.
10:20:47 - LiteLLM Router:INFO: router.py:5403 - get_available_deployment for model: model-test, Selected deployment: {‘model_name’: ‘model-test’, ‘litellm_params’: {‘rpm’: 1, ‘api_key’: ‘somekey’, ‘api_base’: ‘somehost’, ‘model’: ‘somemodel’}, ‘model_info’: {‘id’: ‘7dad6c90ada0fe4774ad25e300359673b11ae7f6c993419aec3af’, ‘db_model’: False}} for model: model-test

This deployment should not have been selected if the following code had worked as expected:

elif (rpm_dict is not None and item in rpm_dict) and (
rpm_dict[item] + 1 >= _deployment_rpm
):
continue

To compare, I tested with the following configuration using usage-based-routing instead of usage-based-routing-v2:

model_list:
  - model_name: model-test
    litellm_params:
      model: somemodel
      api_key: somekey
      api_base: somehost
      rpm: 1
  - model_name: model-test
    litellm_params:
      model: somemodel
      api_key: somekey
      api_base: somehost
      rpm: 10
router_settings:
  routing_strategy: usage-based-routing
  disable_cooldowns: True

With this configuration, all requests are processed correctly until the 11th request, which fails. This behavior is handled correctly due to the following code:

elif (rpm_dict is not None and item in rpm_dict) and (
rpm_dict[item] + 1 >= _deployment_rpm
):
continue

(However, I believe the condition should use > instead of >=, as the RPM limit should be inclusive of the upper bound.)

Thus, I believe this issue in usage-based-routing-v2 occurs because the v2 version reused the original code without adapting it properly, leading to this bug.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
awaiting: user response bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants