You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Firstly, it's about sending a lot of request to server and waited for correct response. Tests for this cases looks like:
@pytest.mark.parametrize("num_workers", [64, 128])deftest_send_many_request(num_workers, model_name, token, endpoint):
message="Create a short story about a friendship between a cat and a dog."request=model_data(model_name, message, max_tokens=300)
url=endpoint+"/v1/chat/completions"headers= {
"Content-Type": "application/json",
"Authorization": f"Bearer {token}",
}
responses_code_set=set()
withThreadPoolExecutor(max_workers=num_workers) asexecutor:
futures= [
executor.submit(send_request_get_response, url, request, headers)
for_inrange(num_workers)
]
forfutureinconcurrent.futures.as_completed(futures):
responses_code_set.add(future.result().status_code)
assertresponses_code_set== {200}
num_workers is amount of sending requests. The problem is that when the number of requests is more than 64, we get error 502 (Bad Gateway) on a lot of endpoints, especially on prod. We can see it in this table on last raw.
So the tests, which are based on the same logic, also fall at sending requests> 64.
In total:
It doesn't seem right that the server can't handle that amount of requests.
Number of chat completion.
Another confusing problem with the parameter - number of chat completion. When I work through this parameter via the openal api, I don't have a problem with n > 1000. For example, the test like this (with prod-codellama-7b-instruct-fp16):
@pytest.mark.parametrize("num_workers", [2])@pytest.mark.parametrize("n", [10])deftest_many_request_and_completion(model_name, num_workers, n, token, endpoint):
message="Create a short story about a friendship between a cat and a dog."request=model_data(model_name, message, max_tokens=300, n=n)
url=endpoint+"/v1/chat/completions"headers= {
"Content-Type": "application/json",
"Authorization": f"Bearer {token}",
}
responses_code_set=set()
withThreadPoolExecutor(max_workers=num_workers) asexecutor:
futures= [
executor.submit(send_request_get_response, url, request, headers)
for_inrange(num_workers)
]
forfutureinconcurrent.futures.as_completed(futures):
print(future.result().json())
responses_code_set.add(future.result().status_code)
assertresponses_code_set== {200}
{'object': 'error', 'message': "1 validation error for SamplingParams\n Value error, best_of must be 1 when using greedy sampling.Got 10. [type=value_error, input_value={'n': 10, 'presence_penal...rue, 'logit_bias': None}, input_type=dict]\n For further information visit https://errors.pydantic.dev/2.5/v/value_error", 'type': 'invalid_request_error', 'param': None, 'code': None}
However, there are more strange things with this. Other tests work well at n > 1 (for example, n=500), but fail at n > 1000 with an error:
rbody = '{"object":"error","message":"The prompt is too long for the given set of engine parameters.","type":"invalid_request_error","param":null,"code":null}'
rcode = 400
resp = {'code': None, 'message': 'The prompt is too long for the given set of engine parameters.', 'object': 'error', 'param': None, ...}
In total:
Currently, it is very unclear how the n should behave in the end. What limitations does it have? Why, if you make requests differently, then it behaves differently?
The text was updated successfully, but these errors were encountered:
I found the cause of my problem when I sent parameter n through a request and received:
{'object': 'error', 'message': "1 validation error for SamplingParams\n Value error, best_of must be 1 when using greedy sampling.Got 10. [type=value_error, input_value={'n': 10, 'presence_penal...rue, 'logit_bias': None}, input_type=dict]\n For further information visit https://errors.pydantic.dev/2.5/v/value_error", 'type': 'invalid_request_error', 'param': None, 'code': None}
The reason was that the default temperature, I was sending, was 0.0. So, it didn't work. But it works with the other temperature.
However, I think the error message of this case is a bit confusing.
Currently, there are two things that confuse me.
Sending requests
Firstly, it's about sending a lot of request to server and waited for correct response. Tests for this cases looks like:
num_workers
is amount of sending requests. The problem is that when the number of requests is more than 64, we get error 502 (Bad Gateway) on a lot of endpoints, especially on prod. We can see it in this table on last raw.So the tests, which are based on the same logic, also fall at
sending requests
> 64.In total:
It doesn't seem right that the server can't handle that amount of requests.
Number of chat completion.
Another confusing problem with the parameter - number of chat completion. When I work through this parameter via the openal api, I don't have a problem with n > 1000. For example, the test like this (with prod-codellama-7b-instruct-fp16):
It only fell when n = 2500 with error 400:
But if I try to send n > 1 via requests:
Where
model_data
is just:I got (with prod-codellama-7b-instruct-fp16):
However, there are more strange things with this. Other tests work well at n > 1 (for example, n=500), but fail at n > 1000 with an error:
In total:
Currently, it is very unclear how the
n
should behave in the end. What limitations does it have? Why, if you make requests differently, then it behaves differently?The text was updated successfully, but these errors were encountered: