Add Logits to OpenAI ChatCompletions model #1196

haileyschoelkopf · 2023-12-22T01:26:17Z

OpenAI added logits back to their ChatCompletions API.

This means we can re-add support for all tasks to this LM type! See OpenAICompletionsLM for an example of how to do this.

Contributions on this would be very welcomed if people have the bandwidth, as I may not get to this super soon but it's high priority.

The text was updated successfully, but these errors were encountered:

haileyschoelkopf · 2023-12-22T01:36:21Z

I've been informed that returning the logprobs of the prompt / input which we would want to be able to support logprobs is not included.

However, supporting logprobs for ChatCompletions-mirroring APIs that do return logprobs including for echo=True would still be very desirable.

gmottajr · 2023-12-22T12:11:42Z

Hi Hailey,
I'm interested in contributing to this GitHub issue, but I would appreciate some additional context to better understand the task. Could you please provide more details? Specifically, it would be helpful if you could instruct me on which file(s) are affected and what specific changes are needed.
Actually, overall, I need more clarification.
Thank you very much for your attention to this matter,
looking forward to hear from you!
Cheers!

haileyschoelkopf · 2023-12-22T14:25:48Z

Language models return output in the form of "logits" / "logprobs" (log probabilities) over their vocabulary of possible next tokens. You can use these to sample text probabilistically or deterministically from, or, these allow you to get the estimated (log) probability of a string by the language model.

However, many closed model providers such as OpenAI stopped providing this info to users, and just gave generated text out. They've re-added that feature for their chat models now. This will allow us to run the models on loglikelihood-based multiple choice QA tasks like hellaswag which are currently implemented by taking an LM, running it on each (input + answer) and asking it to return the log probs on not just the answer, but also the input (echo=True allows this in OpenAI's Completions API), then taking only the logprobs from the answer portion. We then compare the logprobability of each multiple choice answer and say the model chose the one it thinks is most likely. Currently, because logprobs weren't available in openAI's ChatCompletions, we couldn't evaluate on the tasks we defined this way, and would need to write a new task that scores via just asking the model to generate text and checking if it matches the right answer.

The steps to complete this feature would be:

Check out a provider that mirrors OpenAI's ChatCompletions API (e.g. Together, who also provide free credits upon signup): https://docs.together.ai/docs/openai-api-compatibility . Does this (and does OpenAI) allow for echo=True as a parameter for their chat models, allowing us to get log probabilities on a target string by feeding in (input + target) and subsetting the logprobabilities to only the target part?
If yes, you can get echo=True from Together or other providers who can be used with OpenAI's Python SDK, then port over https://github.com/EleutherAI/lm-evaluation-harness/blob/b69ca72ec3a0294638382e0f90cf32f90d761b44/lm_eval/models/openai_completions.py#L174C2-L226C1 from OpenAICompletionsLM into OpenAIChatCompletionsLM, making changes as appropriate.
If yes, test this on an open-source model and compare to our Huggingface implementation.
If no, then investigate whether we can feed in just (input) to these APIs and still measure the logprobability of a target string from them.

veekaybee · 2023-12-28T15:11:55Z

Hi all, happy to help with this since I'm more familiar with the OpenaiChatCompletionsLM class after working on it a bit. Let me know if you'd like support or scaffolding for this!

I started by checking to see if OpenAI exposes the echo command for their chat completion models:

from openai import OpenAI

client = OpenAI()

stream = client.chat.completions.create(
    model="gpt-3.5-turbo",
    messages=[{"role": "user", "content": "Say this is a test"}],
    stream=True,
    echo=True
)
for chunk in stream:
    if chunk.choices[0].delta.content is not None:
        print(chunk.choices[0].delta.content, end="")

and received

File "pyenv/versions/3.10.0/lib/python3.10/site-packages/openai/_utils/_utils.py", line 272, in wrapper
    return func(*args, **kwargs)
TypeError: Completions.create() got an unexpected keyword argument 'echo'`

which makes sense since the CLIChatCompletionCreateArgs doesn't allow for it

veekaybee · 2023-12-28T15:16:58Z

Something I'm missing in understanding: If in the task we only use the logprobs for the response for the task evaluation, what does computing the logprobs of the input string give us in the case of the evaluation, Do we also use this as part of the calculation?

taking an LM, running it on each (input + answer) and asking it to return the log probs on not just the answer, but also the input (echo=True allows this in OpenAI's Completions API), then taking only the logprobs from the answer portion.

haileyschoelkopf · 2023-12-28T19:03:25Z

Something I'm missing in understanding: If in the task we only use the logprobs for the response for the task evaluation, what does computing the logprobs of the input string give us in the case of the evaluation, Do we also use this as part of the calculation?

For multi-token continuations/targets, to get the loglikelihood of the whole target, we need to get the logprob of token 0 conditioned on the input/context, the logprob of token 1 of the target conditioned on (context + token 0 of target), and so on.

With echo=True, we feed in (context + continuation) and get out all the logprobs which we can subset to the target string’s loglikelihood.

If we don’t have echo=True, then we can only feed in the inputs. Then if the model does not output token 0, the logprob of token 1 at target position 1 does not correspond to the right thing and we can’t accurately compute the loglikelihood in a (single) API call for multi-token continuations

veekaybee · 2023-12-29T15:23:17Z

Thanks, this is super helpful. I also just tested on Together and am getting the same issue, because it's OpenAICompletions compatible:

from openai import OpenAI

system_content = "You are a travel agent. Be descriptive and helpful."
user_content = "Tell me about San Francisco"

client = OpenAI(api_key="TOGETHER_API_KEY",
    base_url="https://api.together.xyz/v1",)

chat_completion = client.chat.completions.create(
    model="mistralai/Mixtral-8x7B-Instruct-v0.1",
    messages=[
        {"role": "system", "content": system_content},
        {"role": "user", "content": user_content},
    ],
    temperature=0.7,
    max_tokens=1024,
    echo=True
)

response = chat_completion.choices[0].message.content
print("Together response:\n", response)

TypeError: Completions.create() got an unexpected keyword argument 'echo'

It sounds like, based on this, the next step is: investigate whether we can feed in just (input) to these APIs and still measure the logprobability of a target string from them.

Where would we want this output to feed into?

It looks like this would be a good place to start to see how to get them in the completions call? And then we'd want to see how to pull them in here? , the same way we do here

haileyschoelkopf · 2023-12-30T15:08:38Z

Darn, that makes sense! Thanks for checking on this.

It sounds like, based on this, the next step is: investigate whether we can feed in just (input) to these APIs and still measure the logprobability of a target string from them.
Where would we want this output to feed into?

We could conceivably achieve measurement of target string loglikelihood in O(target len) API calls but this is quite expensive.

Further, since OpenAI only gives up to 5 token logprobs I think, we are limited by this—if the logprobs for our next desired target token do not appear in the top 5 we are out of luck.

Given this, one option may be to add Chat Templating support for local OpenAI Completions support once local Completions support is enabled, and get echo=True that way. Would you be willing to take on local-completions as a next step?

Orthogonal to this, we should think about how we want to support greedy-gen exact match vs. multichoice loglikelihood for multiple choice tasks to have both variants in some unified way.

veekaybee · 2023-12-31T02:34:26Z

Further, since OpenAI only gives up to 5 token logprobs I think, we are limited by this—if the logprobs for our next desired target token do not appear in the top 5 we are out of luck.

Ah, this is good to know, thanks!

Chat Templating support for local OpenAI Completions support once local Completions support is enabled, and get echo=True that way. Would you be willing to take on local-completions as a next step?

Yep, local completions sounds good: want to make sure I'm thinking the right way:

Add case statements in here for branching -if it's a local model, set echo=True and test.

For this piece:

we should think about how we want to support greedy-gen exact match vs. multichoice loglikelihood for multiple choice tasks to have both variants in some unified way.

Sounds like possibly a separate issue to file and link to this one?

haileyschoelkopf · 2023-12-31T03:16:31Z

Yep, local completions sounds good: want to make sure I'm thinking the right way:

Add case statements in here for branching -if it's a local model, set echo=True and test

we should always need to use echo=True for remote/local models! the additions needed for local-completions would be ~ just the same as that of local-chat-completions (tokenizer backend selectable between HF/openai and exact tokenizer can be specified by user, plus base_url configurable)

Sounds like possibly a separate issue to file and link to this one?

Definitely, will open it for tracking! Just mentioning aloud that our best solution here is (unfortunately) to forgo loglikelihood entirely perhaps.

veekaybee · 2024-01-02T00:27:02Z

Ah ok so these changes would be to enable local-completions, which is not the same as local-chat-completions, but do have the ability to run echo=True (for legacy reasons), to also run locally (just clarifying for myself).

That makes sense, can start on those changes!

haileyschoelkopf · 2024-01-02T11:58:58Z

That's correct!

veekaybee · 2024-01-02T17:43:41Z

Cool. Just started working on this to echo OpenaiChatCompletionsLM and noticed we removed the tokenizer from that class

Are we calling it a different way or excluding it entirely for OpenaiChatCompletionsLM?

haileyschoelkopf · 2024-01-02T17:59:59Z

The OpenAI ChatCompletions API actually fully abstracts away its tokenizer, so we don't need a tokenizer for openai-chat-completions! The same isn't true for Completions, there we still need a tokenizer.

veekaybee · 2024-01-02T19:37:42Z

Ok, so here is a fun dilemma:

In starting to implement the code (https://github.com/EleutherAI/lm-evaluation-harness/compare/logprob-completions?expand=1), I found that, due to this issue,
I needed to update the default base model to gpt-3.5-turbo-instruct based on their suggested change, but that model doesn't support both logprobs and echo=True! 🤦‍♀️

A couple ideas:

I can focus on testing for the local case and set the OpenAI model to None
set echo but not logprobs

Something else that I'm not aware of that might be helpful here. Let me know what you think.

haileyschoelkopf · 2024-01-02T21:13:15Z

sigh

I think in this case we should support local Completions models with echo=True (assuming that these will continue to support echo=True), and otherwise, for non-local / OpenAI native cases, raise an error saying logits of the form we need are not supported upon trying to run loglikelihood() or loglikelihood_rolling() methods.

gmottajr · 2024-01-02T21:35:04Z

Hi Hailey and Vicki, I'm still interested on working on this issue with you.
Do you believe is there any room left to have another developer jumping on it with Vicki?

veekaybee · 2024-01-02T22:00:36Z

Hey @gmottajr, absolutely! Does the conversation and code here so far make sense?

Feel free to take the branch I pushed and linked to in my comment above and develop from it. Would that work?

Let me know if you need additional pointers or help (and for sure Hailey can also assist better than me in cases of "do we want to do X this way).

anjor · 2024-01-03T23:21:05Z

This has been a very interesting conversation, thank you @veekaybee @haileyschoelkopf .

On a separate but related note, @haileyschoelkopf do you know how exactly does OpenAI run these benchmarks? Surely they must need the same logprobs that we do. One possibility is they are running against some internal API where they do have access to the logprobs but it's just not opened up.

haileyschoelkopf · 2024-01-04T01:53:37Z

Openai certainly has access to model logprobs if they desire them (and definitely use them for things like evaluating perplexity).

As mentioned earlier in this thread, multiple choice benchmarks can be measured using generation and exact match, and we’ll probably want to support this — OpenAI’s evals framework likely to some extent exhibits how they do this. https://github.com/openai/evals

gmottajr · 2024-01-04T04:24:53Z

Awesome @veekaybee! 🥳

Hey @gmottajr, absolutely! Does the conversation and code here so far make sense?

Feel free to take the branch I pushed and linked to in my comment above and develop from it. Would that work?

Let me know if you need additional pointers or help (and for sure Hailey can also assist better than me in cases of "do we want to do X this way).

Thank you very much for you positive response, @veekaybee. 😄 🚀
That sounds like music to my ears! 🎼 🎵 🤩
I would like to fork the branch you mentioned, but could not really find it.
Yeah, for sure, I do need some additional pointers and help. How does that sound if we could talk directly through Slack, or Discord, Ms Teams, or any other collaboration software more familiar to you?
Looking forward to hear back from you (and jump right away in this code change). 👀
Best regards,
Gerson Jr.

veekaybee · 2024-01-04T14:47:11Z

The branch is called logprob-completions and you can find a reference of it here: https://github.com/EleutherAI/lm-evaluation-harness/compare/logprob-completions?expand=1

Feel free to find us on the EleutherAI Discord in the #lm-thunderdome channel, same username :)

gmottajr · 2024-01-04T17:19:09Z

Hi @veekaybee, I'm not sure why, but all of yours hyperlinks are pointing to a PR instead of where you aim to point to. This link for example, does not take me to any branch, https://github.com/EleutherAI/lm-evaluation-harness/compare/logprob-completions?expand=1. It takes me here:

Same thing is happening with your discord link:

That is why it really looks like there is something weird happening when you are sending link and I am wondering that you might actually trying to point to another URL.
Anyway, I just guessed you probably is talking about this branch here, I'm not sure: https://github.com/EleutherAI/lm-evaluation-harness/tree/logprob-completions.

Please let me know if I got it correct? I'm going to fork from there.
I need you sending me again the discord link, please.
Thank you very much.

haileyschoelkopf · 2024-01-04T17:46:19Z

https://github.com/EleutherAI/lm-evaluation-harness/tree/logprob-completions yes, this link is correct!
you can visit the #lm-thunderdome channel on discord.gg/eleutherai to ask questions or discuss.

veekaybee · 2024-01-09T13:54:12Z

@gmottajr were you able to get started on this? Let me know if you need any help, or I'm happy to put together a sample PR for us to review.

artemorloff · 2024-02-16T11:48:21Z

Have you tried using logit_bias param for openai.chat.completions? Like gpt-4. This param allows for generating only those tokens that have been previously biased (with quite a large value). Like the model makes forward and get logits for the currecnt token, then some lavue is added to the particular logits (for example to logits of tokens of letters A, B, C, D for mmlu), then log-probs are computed. gpt-4 does not return the probs for input tokens. But if you pass only ctx, bias possible continuation logits with quite large value, the model will exclusively generate tokens with these biased logits. gpt-4 can return log-probs for generated tokens.

So, it may be the way to make the openai chat model generate only those tokens that we want and get their log-probs that will still be comparable (bias all tokens for the same value).

haileyschoelkopf · 2024-05-23T13:20:16Z

Had posted this on discord a while back, but cross-posting an update:

Unfortunately logits (with echo=True) still are not supported by the vast majority of API models, and now logit biases cannot be used to work around this--following https://arxiv.org/abs/2403.06634 https://arxiv.org/abs/2403.09539 logits returned on outputs no longer reflect logit biases.

There are probably other hacks that could be applied to "simulate" ranking-based multiple choice or to try to artificially, say, constrain an OAI chat model to only output A, B, C, or D by using logit biases, but these all seem worse than just using free-form generation to evaluate these "chatty" models.

What we'd like to move toward long-term therefore is to have tasks support multiple "presets" so that one can eval MMLU generatively or using loglikelihoods. The caveat here is that there is a tension between giving users more options to play with and keeping the advantages that one standardized task implementation gives. Haven't yet decided what the right point to strike here is--certainly we need to allow at least a bit more configurability to make the currently-loglikelihood only tasks usable for API or local server models, but also don't want to move too far in the other direction in doing so.

If people do have feedback on this front, it is appreciated--hope this provides context on the logits-for-chat-APIs front though.

djstrong · 2024-05-24T10:37:12Z

Almost all Polish tasks, we have created, are in two versions: multiple choice and generate, e.g. https://github.com/speakleash/lm-evaluation-harness/tree/polish2/lm_eval/tasks/polish_ppc
Results are published on https://huggingface.co/spaces/speakleash/open_pl_llm_leaderboard with task name suffixes _g or _mc. There is also gpt-3 model tested but only on _g tasks.

binxuan · 2024-07-13T05:13:25Z

Could we set logprobs to a large number for vLLM and openai completion API so that we can do the multile choice task using one-token generation?

monk1337 · 2024-07-24T00:11:54Z

I am wondering if there is a solution for this. I am using API provider other than openai and using openai schema for that but getting same error 'No support for logits', what's the solution?

dimcall · 2024-08-06T13:31:57Z

Is there any solution to the "No support for logits", when using the LM Harness for multiple choice tasks and chat.completions from openai?

haileyschoelkopf · 2024-08-06T13:55:19Z

Some "Completions" type endpoints return logits, but no, if the API does not expose this option there is not currently a way to access the required information for loglikelihood-based tasks.

pasky · 2024-12-30T00:00:25Z

As mentioned earlier in this thread, multiple choice benchmarks can be measured using generation and exact match, and we’ll probably want to support this — OpenAI’s evals framework likely to some extent exhibits how they do this. https://github.com/openai/evals

I just wanna mention I took a stab at this in #2601 - it kinda works for me in simple scenarios, though I could certainly think of cleaner approaches!

haileyschoelkopf added help wanted Contributors and extra help welcome. feature request A feature that isn't implemented yet. labels Dec 22, 2023

haileyschoelkopf added this to LM-Eval Support and Development Dec 22, 2023

haileyschoelkopf moved this to Backlog in LM-Eval Support and Development Dec 22, 2023

haileyschoelkopf moved this from Backlog to In progress in LM-Eval Support and Development Jan 4, 2024

AguirreNicolas mentioned this issue Jan 29, 2024

Adding /get_tokenizer to api_server for lm-evaluation-harness ease integration. vllm-project/vllm#2643

Open

ggbetz mentioned this issue Feb 29, 2024

Together.ai support logikon-ai/cot-eval#2

Open

LSinev mentioned this issue Mar 11, 2024

How to use generate_until function for the multiple_choice problem. #1559

Closed

LSinev mentioned this issue May 7, 2024

OpenAI models not working with truthfulqa_* tasks #1704

Open

haileyschoelkopf closed this as completed May 23, 2024

github-project-automation bot moved this from In progress to Done in LM-Eval Support and Development May 23, 2024

haileyschoelkopf added the declined A proposed dataset or feature request that will not be implemented. label May 23, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Logits to OpenAI ChatCompletions model #1196

Add Logits to OpenAI ChatCompletions model #1196

haileyschoelkopf commented Dec 22, 2023

haileyschoelkopf commented Dec 22, 2023

gmottajr commented Dec 22, 2023

haileyschoelkopf commented Dec 22, 2023

veekaybee commented Dec 28, 2023 •

edited

Loading

veekaybee commented Dec 28, 2023

haileyschoelkopf commented Dec 28, 2023

veekaybee commented Dec 29, 2023

haileyschoelkopf commented Dec 30, 2023

veekaybee commented Dec 31, 2023

haileyschoelkopf commented Dec 31, 2023

veekaybee commented Jan 2, 2024

haileyschoelkopf commented Jan 2, 2024

veekaybee commented Jan 2, 2024

haileyschoelkopf commented Jan 2, 2024

veekaybee commented Jan 2, 2024

haileyschoelkopf commented Jan 2, 2024

gmottajr commented Jan 2, 2024

veekaybee commented Jan 2, 2024

anjor commented Jan 3, 2024

haileyschoelkopf commented Jan 4, 2024

gmottajr commented Jan 4, 2024

veekaybee commented Jan 4, 2024

gmottajr commented Jan 4, 2024

haileyschoelkopf commented Jan 4, 2024 •

edited

Loading

veekaybee commented Jan 9, 2024

artemorloff commented Feb 16, 2024

haileyschoelkopf commented May 23, 2024 •

edited

Loading

djstrong commented May 24, 2024

binxuan commented Jul 13, 2024

monk1337 commented Jul 24, 2024

dimcall commented Aug 6, 2024 •

edited

Loading

haileyschoelkopf commented Aug 6, 2024

pasky commented Dec 30, 2024

Add Logits to OpenAI ChatCompletions model #1196

Add Logits to OpenAI ChatCompletions model #1196

Comments

haileyschoelkopf commented Dec 22, 2023

haileyschoelkopf commented Dec 22, 2023

gmottajr commented Dec 22, 2023

haileyschoelkopf commented Dec 22, 2023

veekaybee commented Dec 28, 2023 • edited Loading

veekaybee commented Dec 28, 2023

haileyschoelkopf commented Dec 28, 2023

veekaybee commented Dec 29, 2023

haileyschoelkopf commented Dec 30, 2023

veekaybee commented Dec 31, 2023

haileyschoelkopf commented Dec 31, 2023

veekaybee commented Jan 2, 2024

haileyschoelkopf commented Jan 2, 2024

veekaybee commented Jan 2, 2024

haileyschoelkopf commented Jan 2, 2024

veekaybee commented Jan 2, 2024

haileyschoelkopf commented Jan 2, 2024

gmottajr commented Jan 2, 2024

veekaybee commented Jan 2, 2024

anjor commented Jan 3, 2024

haileyschoelkopf commented Jan 4, 2024

gmottajr commented Jan 4, 2024

veekaybee commented Jan 4, 2024

gmottajr commented Jan 4, 2024

haileyschoelkopf commented Jan 4, 2024 • edited Loading

veekaybee commented Jan 9, 2024

artemorloff commented Feb 16, 2024

haileyschoelkopf commented May 23, 2024 • edited Loading

djstrong commented May 24, 2024

binxuan commented Jul 13, 2024

monk1337 commented Jul 24, 2024

dimcall commented Aug 6, 2024 • edited Loading

haileyschoelkopf commented Aug 6, 2024

pasky commented Dec 30, 2024

veekaybee commented Dec 28, 2023 •

edited

Loading

haileyschoelkopf commented Jan 4, 2024 •

edited

Loading

haileyschoelkopf commented May 23, 2024 •

edited

Loading

dimcall commented Aug 6, 2024 •

edited

Loading