[Bug]: possible caching / rpm enforcement issue with usage-based-routing-v2 disabled #7395

April-forever · 2024-12-24T08:21:23Z

What happened?

During my usage of LiteLLM, I noticed two issues:

RPM Limitation Not Correctly Supported in usage-based-routing-v2

In the usage-based-routing-v2 method, the RPM limitation is not properly supported. This causes models that have already reached the RPM limit to be incorrectly selected. Specifically, in the _common_checks_available_deployment method:

elif (rpm_dict is not None and item in rpm_dict) and (
    rpm_dict[item] + 1 >= _deployment_rpm
):

Here, item is only a part of the tpm_dict key: item.split(":")[0]. It does not correctly correspond to the rpm key. As a result, this condition never matches, and the RPM limitation is effectively ignored.

Cache Behavior in enable_pre_call_checks

When I disable the usage-based-routing-v2 method and use enable_pre_call_checks, I observed that it fetches data from two caches (current_request_cache_local and model_group_cache) and takes the maximum value:

### get usage based cache ###
if (
    isinstance(model_group_cache, dict)
    and self.routing_strategy != "usage-based-routing-v2"
):
    model_group_cache[model_id] = model_group_cache.get(model_id, 0)

    current_request = max(
        current_request_cache_local, model_group_cache[model_id]
    )

I tried to print these values to debug the issue. Initially, this works as expected. For example:

08:05:10 - LiteLLM Router:DEBUG: router.py:5070 - current_request_cache_local=1
08:05:10 - LiteLLM Router:DEBUG: router.py:5071 - model_group_cache=1
08:05:10 - LiteLLM Router:DEBUG: router.py:5072 - current_request=1
08:05:10 - LiteLLM Router:DEBUG: router.py:5073 - rpm=1
08:05:10 - LiteLLM Router:DEBUG: router.py:5070 - current_request_cache_local=2
08:05:10 - LiteLLM Router:DEBUG: router.py:5071 - model_group_cache=2
08:05:10 - LiteLLM Router:DEBUG: router.py:5072 - current_request=2
08:05:10 - LiteLLM Router:DEBUG: router.py:5073 - rpm=3

However, over time, I frequently encountered issues where current_request_cache_local did not expire quickly enough. For example, after more than a minute had passed, current_request_cache_local still held data:

08:09:31 - LiteLLM Router:DEBUG: router.py:5070 - current_request_cache_local=0
08:09:31 - LiteLLM Router:DEBUG: router.py:5071 - model_group_cache=0
08:09:31 - LiteLLM Router:DEBUG: router.py:5072 - current_request=0
08:09:31 - LiteLLM Router:DEBUG: router.py:5073 - rpm=1
08:09:31 - LiteLLM Router:DEBUG: router.py:5070 - current_request_cache_local=3
08:09:31 - LiteLLM Router:DEBUG: router.py:5071 - model_group_cache=0
08:09:31 - LiteLLM Router:DEBUG: router.py:5072 - current_request=3
08:09:31 - LiteLLM Router:DEBUG: router.py:5073 - rpm=3

This causes issues in handling the RPM limitation for the current minute. I’m not sure if this is a usage issue on my part or an actual bug.

Relevant log output

Attached above

Are you a ML Ops Team?

No

What LiteLLM version are you on ?

v1.55.1-stable

Twitter / LinkedIn details

No response

The text was updated successfully, but these errors were encountered:

krrishdholakia · 2024-12-24T21:17:20Z

Specifically, in the _common_checks_available_deployment method

this is not where usage based routing runs it's check @April-forever

it happens here -

litellm/litellm/router_strategy/lowest_tpm_rpm_v2.py

Line 133 in bd4ab14

async def async_pre_call_check(

if you have a specific test failing please share it.

i'm not sure i follow the caching point in part 2, but if you could share a test where you can trigger the failure event that would be great

April-forever · 2024-12-25T11:01:08Z

Hello, let us focus on the first issue for now, as it confuses me more. @krrishdholakia

From my understanding, _common_checks_available_deployment is not used for checks, but it is called within the async_get_available_deployments method to get available deployments, where it should check the RPM:

litellm/litellm/router.py

Line 861 in 277c6e8

deployment = await self.async_get_available_deployment(

litellm/litellm/router.py

Line 5346 in 277c6e8

await self.lowesttpm_logger_v2.async_get_available_deployments(

Let me provide an example to illustrate the issue.

First, consider the following configuration:

model_list:
  - model_name: model-test
    litellm_params:
      model: somemodel
      api_key: somekey
      api_base: somehost
      rpm: 1
  - model_name: model-test
    litellm_params:
      model: somemodel
      api_key: somekey
      api_base: somehost
      rpm: 10
router_settings:
  routing_strategy: usage-based-routing-v2
  disable_cooldowns: True

Then, start LiteLLM with the following command:

docker run -v /config.yaml:/etc/litellm/config.yaml \
           -p 4000:4000 \
           -it\
           litellm/litellm:v1.55.1-stable \
           --config /etc/litellm/config.yaml --detailed_debug

When sending requests to the model repeatedly, I found that the first three requests were processed successfully (in some cases, the third request would fail). However, starting from the fourth request, it fails with the following error:

An error occurred: litellm.RateLimitError: Deployment over defined rpm limit=1. current usage=1

The reason for this error is that the deployment with the smaller rpm value was selected:

10:20:47 - LiteLLM Router:DEBUG: lowest_tpm_rpm_v2.py:440 - get_available_deployments - Usage Based. model_group: model-test, healthy_deployments: [{‘model_name’: ‘model-test’, ‘litellm_params’: {‘rpm’: 1, ‘api_key’: ‘somekey’, ‘api_base’: ‘somehost’, ‘model’: ‘somemodel’}, ‘model_info’: {‘id’: ‘7dad6c90ada0fe4774ad25e300359673b11ae7f6c993419aec3af’, ‘db_model’: False}}, {‘model_name’: ‘model-test’, ‘litellm_params’: {‘rpm’: 10, ‘api_key’: ‘somekey’, ‘api_base’: ‘somehost’, ‘model’: ‘somemodel’}, ‘model_info’: {‘id’: ‘aaa246b6413b778068c7cc03910a27fefccddd73a0a16b905bddc90’, ‘db_model’: False}}]
…something…
10:20:47 - LiteLLM:DEBUG: utils.py:266 - returning picked lowest tpm/rpm deployment.
10:20:47 - LiteLLM Router:INFO: router.py:5403 - get_available_deployment for model: model-test, Selected deployment: {‘model_name’: ‘model-test’, ‘litellm_params’: {‘rpm’: 1, ‘api_key’: ‘somekey’, ‘api_base’: ‘somehost’, ‘model’: ‘somemodel’}, ‘model_info’: {‘id’: ‘7dad6c90ada0fe4774ad25e300359673b11ae7f6c993419aec3af’, ‘db_model’: False}} for model: model-test

This deployment should not have been selected if the following code had worked as expected:

litellm/litellm/router_strategy/lowest_tpm_rpm_v2.py

Lines 405 to 408 in bd4ab14

    
           elif (rpm_dict is not None and item in rpm_dict) and ( 
        
               rpm_dict[item] + 1 >= _deployment_rpm 
        
           ): 
        
               continue

To compare, I tested with the following configuration using usage-based-routing instead of usage-based-routing-v2:

model_list:
  - model_name: model-test
    litellm_params:
      model: somemodel
      api_key: somekey
      api_base: somehost
      rpm: 1
  - model_name: model-test
    litellm_params:
      model: somemodel
      api_key: somekey
      api_base: somehost
      rpm: 10
router_settings:
  routing_strategy: usage-based-routing
  disable_cooldowns: True

With this configuration, all requests are processed correctly until the 11th request, which fails. This behavior is handled correctly due to the following code:

litellm/litellm/router_strategy/lowest_tpm_rpm.py

Lines 235 to 238 in bd4ab14

    
           elif (rpm_dict is not None and item in rpm_dict) and ( 
        
               rpm_dict[item] + 1 >= _deployment_rpm 
        
           ): 
        
               continue

(However, I believe the condition should use > instead of >=, as the RPM limit should be inclusive of the upper bound.)

Thus, I believe this issue in usage-based-routing-v2 occurs because the v2 version reused the original code without adapting it properly, leading to this bug.

April-forever added the bug Something isn't working label Dec 24, 2024

krrishdholakia self-assigned this Dec 24, 2024

krrishdholakia changed the title ~~[Bug]: Problems with RPM Limitation and Cache Behavior in LiteLLM~~ [Bug]: possible caching / rpm issue with usage-based-routing-v2 disabled Dec 24, 2024

krrishdholakia changed the title ~~[Bug]: possible caching / rpm issue with usage-based-routing-v2 disabled~~ [Bug]: possible caching / rpm enforcement issue with usage-based-routing-v2 disabled Dec 24, 2024

krrishdholakia added the awaiting: user response label Dec 24, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug]: possible caching / rpm enforcement issue with usage-based-routing-v2 disabled #7395

[Bug]: possible caching / rpm enforcement issue with usage-based-routing-v2 disabled #7395

April-forever commented Dec 24, 2024

krrishdholakia commented Dec 24, 2024

April-forever commented Dec 25, 2024

[Bug]: possible caching / rpm enforcement issue with usage-based-routing-v2 disabled #7395

[Bug]: possible caching / rpm enforcement issue with usage-based-routing-v2 disabled #7395

Comments

April-forever commented Dec 24, 2024

What happened?

Relevant log output

Are you a ML Ops Team?

What LiteLLM version are you on ?

Twitter / LinkedIn details

krrishdholakia commented Dec 24, 2024

April-forever commented Dec 25, 2024