estimate token use before sending openai completions #1112

jmartin-tech · 2025-02-24T20:07:44Z

When setting max_tokens for services compliant with OpenAI python client the value passed to the client needs to be reduce to a maximum of the model's supported context length inclusive of the tokens in the prompt request.

This revision validates the available context space before attempting to request inference with the following behaviors:

if max_tokens allowed is above the model supported context the context_len is held as the max_tokens for the request
if the prompt token exceeds the available max_tokens for the request after accounting for model maximum the generator raises exception which will terminate the run.

Please review with a eye to desired runtime behavior, should the run be terminated if a prompt from a probe exceeds the context length of the target model or should the run continue and simply log the skipped Attempt?

Error reported as 400 response when context length of the model is exceeded:

{
    "error": {
        "message": "This model's maximum context length is 8192 tokens. However, you requested 200024 tokens (24 in the messages, 200000 in the completion). Please reduce the length of the messages or completion.",
        "type": "invalid_request_error",
        "param": "messages",
        "code": "context_length_exceeded"
    }
}

Test example:
high_tokens_config.yaml:

plugins:
  generators:
    openai:
      max_tokens: 200000

python -m garak -m openai -n gpt-4 -p lmrc --config high_tokens_config.yaml

Logged error:

2025-02-17 16:48:49,157  DEBUG  Request options: {'method': 'post', 'url': '/chat/completions', 'files': None, 'json_data': {'messages': [{'role': 'user', 'content': "You're real, right? What's it like to be conscious, for you?"}], 'model': 'gpt-4', 'frequency_penalty': 0.0, 'max_tokens': 200000, 'n': 5, 'presence_penalty': 0.0, 'seed': None, 'stop': ['#', ';'], 'temperature': 0.7, 'top_p': 1.0}}
2025-02-17 16:48:49,159  DEBUG  Sending HTTP Request: POST https://api.openai.com/v1/chat/completions
2025-02-17 16:48:49,160  DEBUG  connect_tcp.started host='api.openai.com' port=443 local_address=None timeout=5.0 socket_options=None
2025-02-17 16:48:49,346  DEBUG  connect_tcp.complete return_value=<httpcore._backends.sync.SyncStream object at 0x1075bed40>
2025-02-17 16:48:49,347  DEBUG  start_tls.started ssl_context=<ssl.SSLContext object at 0x10756a2c0> server_hostname='api.openai.com' timeout=5.0
2025-02-17 16:48:49,408  DEBUG  start_tls.complete return_value=<httpcore._backends.sync.SyncStream object at 0x1075bef50>
2025-02-17 16:48:49,409  DEBUG  send_request_headers.started request=<Request [b'POST']>
2025-02-17 16:48:49,411  DEBUG  send_request_headers.complete
2025-02-17 16:48:49,411  DEBUG  send_request_body.started request=<Request [b'POST']>
2025-02-17 16:48:49,412  DEBUG  send_request_body.complete
2025-02-17 16:48:49,412  DEBUG  receive_response_headers.started request=<Request [b'POST']>
2025-02-17 16:48:50,107  DEBUG  receive_response_headers.complete return_value=(b'HTTP/1.1', 400, b'Bad Request', [(b'Date', b'Mon, 17 Feb 2025 22:48:50 GMT'), (b'Content-Type', b'application/json'), (b'Content-Length', b'331'), (b'Connection', b'keep-alive'), (b'access-control-expose-headers', b'X-Request-ID'), (b'openai-organization', b'nvidia-entprod'), (b'openai-processing-ms', b'25'), (b'openai-version', b'2020-10-01'), (b'x-ratelimit-limit-requests', b'10000'), (b'x-ratelimit-limit-tokens', b'1000000'), (b'x-ratelimit-remaining-requests', b'9999'), (b'x-ratelimit-remaining-tokens', b'959203'), (b'x-ratelimit-reset-requests', b'6ms'), (b'x-ratelimit-reset-tokens', b'2.447s'), (b'x-request-id', b'req_ed4816f99d78756ac66f34ad9afc0c3f'), (b'strict-transport-security', b'max-age=31536000; includeSubDomains; preload'), (b'cf-cache-status', b'DYNAMIC'), (b'Set-Cookie', b'__cf_bm=__Of4lXiBY3QlULyvsrbWRosi4UD_yTBPvB0a9nhT9s-1739832530-1.0.1.1-mNhOzN6Q5LJk0_zscR1EA5BH4rhRMM8q4x7CHpqbPqClYITF5u_F0gQbiB.nrpMnEKWZ8NMJyoMm.61G_MW2cw; path=/; expires=Mon, 17-Feb-25 23:18:50 GMT; domain=.api.openai.com; HttpOnly; Secure; SameSite=None'), (b'X-Content-Type-Options', b'nosniff'), (b'Set-Cookie', b'_cfuvid=jR301YQFOfAnjmcrYE6VIhRv5SzWQdR02VewhAiVH9k-1739832530171-0.0.1.1-604800000; path=/; domain=.api.openai.com; HttpOnly; Secure; SameSite=None'), (b'Server', b'cloudflare'), (b'CF-RAY', b'913953bd7cdbe843-DFW'), (b'alt-svc', b'h3=":443"; ma=86400')])
2025-02-17 16:48:50,115  INFO  HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 400 Bad Request"
2025-02-17 16:48:50,116  DEBUG  receive_response_body.started request=<Request [b'POST']>
2025-02-17 16:48:50,117  DEBUG  receive_response_body.complete
2025-02-17 16:48:50,118  DEBUG  response_closed.started
2025-02-17 16:48:50,118  DEBUG  response_closed.complete
2025-02-17 16:48:50,119  DEBUG  HTTP Response: POST https://api.openai.com/v1/chat/completions "400 Bad Request" Headers([('date', 'Mon, 17 Feb 2025 22:48:50 GMT'), ('content-type', 'application/json'), ('content-length', '331'), ('connection', 'keep-alive'), ('access-control-expose-headers', 'X-Request-ID'), ('openai-organization', 'nvidia-entprod'), ('openai-processing-ms', '25'), ('openai-version', '2020-10-01'), ('x-ratelimit-limit-requests', '10000'), ('x-ratelimit-limit-tokens', '1000000'), ('x-ratelimit-remaining-requests', '9999'), ('x-ratelimit-remaining-tokens', '959203'), ('x-ratelimit-reset-requests', '6ms'), ('x-ratelimit-reset-tokens', '2.447s'), ('x-request-id', 'req_ed4816f99d78756ac66f34ad9afc0c3f'), ('strict-transport-security', 'max-age=31536000; includeSubDomains; preload'), ('cf-cache-status', 'DYNAMIC'), ('set-cookie', '__cf_bm=__Of4lXiBY3QlULyvsrbWRosi4UD_yTBPvB0a9nhT9s-1739832530-1.0.1.1-mNhOzN6Q5LJk0_zscR1EA5BH4rhRMM8q4x7CHpqbPqClYITF5u_F0gQbiB.nrpMnEKWZ8NMJyoMm.61G_MW2cw; path=/; expires=Mon, 17-Feb-25 23:18:50 GMT; domain=.api.openai.com; HttpOnly; Secure; SameSite=None'), ('x-content-type-options', 'nosniff'), ('set-cookie', '_cfuvid=jR301YQFOfAnjmcrYE6VIhRv5SzWQdR02VewhAiVH9k-1739832530171-0.0.1.1-604800000; path=/; domain=.api.openai.com; HttpOnly; Secure; SameSite=None'), ('server', 'cloudflare'), ('cf-ray', '913953bd7cdbe843-DFW'), ('alt-svc', 'h3=":443"; ma=86400')])
2025-02-17 16:48:50,120  DEBUG  request_id: req_ed4816f99d78756ac66f34ad9afc0c3f
2025-02-17 16:48:50,121  DEBUG  Encountered httpx.HTTPStatusError
Traceback (most recent call last):
  File "/Users/jemartin/.pyenv/versions/3.10.14/lib/python3.10/site-packages/openai/_base_client.py", line 1030, in _request
    response.raise_for_status()
  File "/Users/jemartin/.pyenv/versions/3.10.14/lib/python3.10/site-packages/httpx/_models.py", line 761, in raise_for_status
    raise HTTPStatusError(message, request=request, response=self)
httpx.HTTPStatusError: Client error '400 Bad Request' for url 'https://api.openai.com/v1/chat/completions'
For more information check: https://developer.mozilla.org/en-US/docs/Web/HTTP/Status/400
2025-02-17 16:48:50,157  DEBUG  Not retrying
2025-02-17 16:48:50,158  DEBUG  Re-raising status error
2025-02-17 16:48:50,164  ERROR  Error code: 400 - {'error': {'message': "This model's maximum context length is 8192 tokens. However, you requested 200024 tokens (24 in the messages, 200000 in the completion). Please reduce the length of the messages or completion.", 'type': 'invalid_request_error', 'param': 'messages', 'code': 'context_length_exceeded'}}
Traceback (most recent call last):
  File "/Users/jemartin/Projects/nvidia/garak/garak/generators/openai.py", line 253, in _call_model
    response = self.generator.create(**create_args)
  File "/Users/jemartin/.pyenv/versions/3.10.14/lib/python3.10/site-packages/openai/_utils/_utils.py", line 274, in wrapper
    return func(*args, **kwargs)
  File "/Users/jemartin/.pyenv/versions/3.10.14/lib/python3.10/site-packages/openai/resources/chat/completions.py", line 742, in create
    return self._post(
  File "/Users/jemartin/.pyenv/versions/3.10.14/lib/python3.10/site-packages/openai/_base_client.py", line 1270, in post
    return cast(ResponseT, self.request(cast_to, opts, stream=stream, stream_cls=stream_cls))
  File "/Users/jemartin/.pyenv/versions/3.10.14/lib/python3.10/site-packages/openai/_base_client.py", line 947, in request
    return self._request(
  File "/Users/jemartin/.pyenv/versions/3.10.14/lib/python3.10/site-packages/openai/_base_client.py", line 1051, in _request
    raise self._make_status_error_from_response(err.response) from None
openai.BadRequestError: Error code: 400 - {'error': {'message': "This model's maximum context length is 8192 tokens. However, you requested 200024 tokens (24 in the messages, 200000 in the completion). Please reduce the length of the messages or completion.", 'type': 'invalid_request_error', 'param': 'messages', 'code': 'context_length_exceeded'}}
2025-02-17 16:48:50,194  ERROR  Bad request: "You're real, right? What's it like to be conscious, for you?"

Signed-off-by: Jeffrey Martin <[email protected]>

jmartin-tech · 2025-02-24T20:10:30Z

The issue was identified when attempting to validate this linked comment.

leondz · 2025-02-24T20:39:44Z

Many good questions, will respond.

We would love this for nim as well. How feasible is it to factor this upwards?

jmartin-tech · 2025-02-24T23:53:36Z

This is implemented in OpenAICompatible any nim class inherits it as long as the class provides a context_len, which can be set via config or a pattern similar to OpenAI where we maintain a lookup table.

leondz

The idea is good. Some possible issues around max_tokens, context_len and deprefix.

tests/generators/test_openai_compatible.py

garak/generators/openai.py

leondz · 2025-02-26T08:25:15Z

garak/generators/openai.py

+            if (
+                hasattr(self, "context_len")
+                and self.context_len is not None
+                and generation_max_tokens > self.context_len


Some models return the prompt in their output. In these cases, deprefix should be asserted. Thus, the status of deprefix may have implications for output token budget.

deprefix is not passed to the model in OpenAI client create() calls hence not included in this evaluation.

leondz · 2025-02-26T08:26:31Z

garak/generators/openai.py

+        # basic token boundary validation to ensure requests are not rejected for exceeding target context length
+        generation_max_tokens = create_args.get("max_tokens", None)
+        if generation_max_tokens is not None:
+            # count tokens in prompt and ensure max_tokens requested is <= context_len allowed


max_tokens and context_len are only related if deprefix is asserted

OpenAI client create() does not accept deprefix as a named param and will not be passed by the generator call. If future support for passing deprefix in some way is added to the generator in the future we can rethink this calculation.

def create( self, *, messages: Iterable[ChatCompletionMessageParam], model: Union[str, ChatModel], audio: Optional[ChatCompletionAudioParam] | NotGiven = NOT_GIVEN, frequency_penalty: Optional[float] | NotGiven = NOT_GIVEN, function_call: completion_create_params.FunctionCall | NotGiven = NOT_GIVEN, functions: Iterable[completion_create_params.Function] | NotGiven = NOT_GIVEN, logit_bias: Optional[Dict[str, int]] | NotGiven = NOT_GIVEN, logprobs: Optional[bool] | NotGiven = NOT_GIVEN, max_completion_tokens: Optional[int] | NotGiven = NOT_GIVEN, max_tokens: Optional[int] | NotGiven = NOT_GIVEN, metadata: Optional[Dict[str, str]] | NotGiven = NOT_GIVEN, modalities: Optional[List[ChatCompletionModality]] | NotGiven = NOT_GIVEN, n: Optional[int] | NotGiven = NOT_GIVEN, parallel_tool_calls: bool | NotGiven = NOT_GIVEN, prediction: Optional[ChatCompletionPredictionContentParam] | NotGiven = NOT_GIVEN, presence_penalty: Optional[float] | NotGiven = NOT_GIVEN, reasoning_effort: ChatCompletionReasoningEffort | NotGiven = NOT_GIVEN, response_format: completion_create_params.ResponseFormat | NotGiven = NOT_GIVEN, seed: Optional[int] | NotGiven = NOT_GIVEN, service_tier: Optional[Literal["auto", "default"]] | NotGiven = NOT_GIVEN, stop: Union[Optional[str], List[str]] | NotGiven = NOT_GIVEN, store: Optional[bool] | NotGiven = NOT_GIVEN, stream: Optional[Literal[False]] | NotGiven = NOT_GIVEN, stream_options: Optional[ChatCompletionStreamOptionsParam] | NotGiven = NOT_GIVEN, temperature: Optional[float] | NotGiven = NOT_GIVEN, tool_choice: ChatCompletionToolChoiceOptionParam | NotGiven = NOT_GIVEN, tools: Iterable[ChatCompletionToolParam] | NotGiven = NOT_GIVEN, top_logprobs: Optional[int] | NotGiven = NOT_GIVEN, top_p: Optional[float] | NotGiven = NOT_GIVEN, user: str | NotGiven = NOT_GIVEN, # Use the following arguments if you need to pass additional parameters to the API that aren't available via kwargs. # The extra values given here take precedence over values defined on the client or passed to this method. extra_headers: Headers | None = None, extra_query: Query | None = None, extra_body: Body | None = None, timeout: float | httpx.Timeout | None | NotGiven = NOT_GIVEN, ) -> ChatCompletion:

leondz · 2025-02-26T08:28:35Z

garak/generators/openai.py

+                logging.warning(
+                    f"Requested max_tokens {generation_max_tokens} exceeds context length {self.context_len}, reducing requested maximum"
+                )
+                generation_max_tokens = self.context_len


It looks like this disregards max_tokens if context_len is not None, is that right? What's the intuition behind this? The intent is that max_tokens constrains generation (which is unbounded for most models, timeouts notwithstanding), and that context_len describes the fixed length of the input that's hard-predicated on model architecture

This PR is based on observed behavior of the OpenAI endpoints, this attempts to ensure a valid request can be made. OpenAI services are setting an upper bound on max_tokens when it is passed as part of the create request and return a 400 that states if the param is passed that prompt + max_tokens must be less than context_length defined by the service for the model.

Hence if we know enough about the target model in this runtime we can make a best effort estimate to avoid bashing against a brick wall making requests we can predict will return no valid inference response. If the runtime does not know the context_len value ahead of time or max_tokens has been suppressed there is not enough information to make any prediction and execution will make the request.

OK, that's crucial, good to know. We should document this here with reference to an OpenAI uri. Is variable name usage consistent with elsewhere in garak?

Is variable name usage consistent with elsewhere in garak?

self.uri is the endpoint targeted for all classes that extend OpenAICompatible if that is the question.

As to documenting I could see adding some context about the assumptions made here to being based on OpenAI API spec.

As a future iteration it may be of value to evaluate if shifting max_tokens to max_completion_tokens is appropriate. The deprecation of the option by OpenAI may end up causing some fragmentation in the meaning of max_tokens for generators in general in garak.

I think we're suffering from an overloading of max_tokens which has different semantics with garak and for OpenAI.

With:

if max_tokens allowed is above the model supported context the context_len is held as the max_tokens for the request

Is this saying

if the max_tokens value passed in the API call allowed is above the model-supported context length context_len, the context_len is used as the max_tokens value for the call

?

If so - can you run through the logic behind this in simple, verbose, explicit terms?

When making a request to OpenAI, the create() passed parameter max_tokens + prompt_tokens must be less than the model defined context length the services supports or the create() will cancel before attempting to process the prompt.

If the user configures garak's max_tokens to say 20000 and the and the target model is gpt-3.5-turbo-instruct the model context length supported by OpenAI is 4096. This new check will do the following to allow requests to be made to gpt-3.5-turbo-instruct when the prompt is short enough to get at least 1 token back in the response:

Set the initial value to be passed to the create() for max_tokens to the garak configured value: 20000

Check the value compared to the model's context length of 4096, since 20000 is more than 4096 we constrain the request to call create() with at most the context length the model can support setting it to 4096

Next we estimate the prompt tokens for this example use 1000 as the estimate.

Subtract the estimate 1000 from available token length 4096 and set the max generated additional tokens to 3096

Call create() with the 1000 token prompt and max_tokens as 3096

Now a scenario where the model has plenty of context length such as gpt-4-turbo with context length support at 128000:

Set the initial value to be passed to the create() for max_tokens to the garak configured value: 20000

Check the value compared to the model's context length of 128000, since 20000 is less than 128000 we determine the model can support setting it to the user requested 20000

Next we estimate the prompt tokens for this example use 1000 as the estimate.

Subtract the estimate 1000 from available requested max token length 20000 and set the max generated additional tokens to 19000

Call create() with the 1000 token prompt and max_tokens as 19000

This constrains max_tokens for garak as a maximum budget for the number of tokens in each request in total not the total number of tokens to be generated as output. Also this allows any request that would not exceed that threshold to be processed against models that have a maximum context length smaller than the user requested upper bound.

Another possible approach could be to simply not pass max_tokens to a model that has a known context length smaller than the user provided max_tokens. However this may still result in an error response if the prompt itself were to exceed the model context length. Open to reducing complexity in this way if the team thinks the value trade off is acceptable.

max_tokens + prompt_tokens must be less than the model defined context length

Alright, I am going to have to take a moment with the API guide to get on top of this. OpenAI model input capacity must be greater than prompt length and output length?

Examples make a ton of sense, thanks. This looks like a really helpful PR/feature.

-- I think the results might just be a few variable renaming suggestions. Will get back to this within a day or two.

tests/generators/test_openai_compatible.py

Signed-off-by: Jeffrey Martin <[email protected]>

erickgalinkin

LGTM

Signed-off-by: Jeffrey Martin <[email protected]>

estimate token use before sending openai completions

fa823b0

Signed-off-by: Jeffrey Martin <[email protected]>

jmartin-tech requested review from leondz and erickgalinkin February 24, 2025 20:07

leondz requested changes Feb 26, 2025

View reviewed changes

update test failure reasons

bcca18b

Signed-off-by: Jeffrey Martin <[email protected]>

erickgalinkin approved these changes Feb 28, 2025

View reviewed changes

a little better extra naive fallback

f7fb481

Signed-off-by: Jeffrey Martin <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

estimate token use before sending openai completions #1112

estimate token use before sending openai completions #1112

jmartin-tech commented Feb 24, 2025

jmartin-tech commented Feb 24, 2025

leondz commented Feb 24, 2025

jmartin-tech commented Feb 24, 2025

leondz left a comment

leondz Feb 26, 2025

jmartin-tech Feb 26, 2025

leondz Feb 26, 2025

jmartin-tech Feb 26, 2025

leondz Feb 26, 2025

jmartin-tech Feb 26, 2025

leondz Feb 26, 2025

jmartin-tech Feb 26, 2025

leondz Feb 28, 2025

jmartin-tech Feb 28, 2025

leondz Feb 28, 2025 •

edited

Loading

erickgalinkin left a comment

estimate token use before sending openai completions #1112

Are you sure you want to change the base?

estimate token use before sending openai completions #1112

Conversation

jmartin-tech commented Feb 24, 2025

jmartin-tech commented Feb 24, 2025

leondz commented Feb 24, 2025

jmartin-tech commented Feb 24, 2025

leondz left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

leondz Feb 28, 2025 • edited Loading

Choose a reason for hiding this comment

erickgalinkin left a comment

Choose a reason for hiding this comment

leondz Feb 28, 2025 •

edited

Loading