Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Logits to OpenAI ChatCompletions model #1196

Closed
haileyschoelkopf opened this issue Dec 22, 2023 · 33 comments
Closed

Add Logits to OpenAI ChatCompletions model #1196

haileyschoelkopf opened this issue Dec 22, 2023 · 33 comments
Labels
declined A proposed dataset or feature request that will not be implemented. feature request A feature that isn't implemented yet. help wanted Contributors and extra help welcome.

Comments

@haileyschoelkopf
Copy link
Collaborator

OpenAI added logits back to their ChatCompletions API.

This means we can re-add support for all tasks to this LM type! See OpenAICompletionsLM for an example of how to do this.

Contributions on this would be very welcomed if people have the bandwidth, as I may not get to this super soon but it's high priority.

@haileyschoelkopf haileyschoelkopf added help wanted Contributors and extra help welcome. feature request A feature that isn't implemented yet. labels Dec 22, 2023
@haileyschoelkopf
Copy link
Collaborator Author

I've been informed that returning the logprobs of the prompt / input which we would want to be able to support logprobs is not included.

However, supporting logprobs for ChatCompletions-mirroring APIs that do return logprobs including for echo=True would still be very desirable.

@gmottajr
Copy link

Hi Hailey,
I'm interested in contributing to this GitHub issue, but I would appreciate some additional context to better understand the task. Could you please provide more details? Specifically, it would be helpful if you could instruct me on which file(s) are affected and what specific changes are needed.
Actually, overall, I need more clarification.
Thank you very much for your attention to this matter,
looking forward to hear from you!
Cheers!

@haileyschoelkopf
Copy link
Collaborator Author

Language models return output in the form of "logits" / "logprobs" (log probabilities) over their vocabulary of possible next tokens. You can use these to sample text probabilistically or deterministically from, or, these allow you to get the estimated (log) probability of a string by the language model.

However, many closed model providers such as OpenAI stopped providing this info to users, and just gave generated text out. They've re-added that feature for their chat models now. This will allow us to run the models on loglikelihood-based multiple choice QA tasks like hellaswag which are currently implemented by taking an LM, running it on each (input + answer) and asking it to return the log probs on not just the answer, but also the input (echo=True allows this in OpenAI's Completions API), then taking only the logprobs from the answer portion. We then compare the logprobability of each multiple choice answer and say the model chose the one it thinks is most likely. Currently, because logprobs weren't available in openAI's ChatCompletions, we couldn't evaluate on the tasks we defined this way, and would need to write a new task that scores via just asking the model to generate text and checking if it matches the right answer.

The steps to complete this feature would be:

@veekaybee
Copy link
Contributor

veekaybee commented Dec 28, 2023

Hi all, happy to help with this since I'm more familiar with the OpenaiChatCompletionsLM class after working on it a bit. Let me know if you'd like support or scaffolding for this!

I started by checking to see if OpenAI exposes the echo command for their chat completion models:

from openai import OpenAI

client = OpenAI()

stream = client.chat.completions.create(
    model="gpt-3.5-turbo",
    messages=[{"role": "user", "content": "Say this is a test"}],
    stream=True,
    echo=True
)
for chunk in stream:
    if chunk.choices[0].delta.content is not None:
        print(chunk.choices[0].delta.content, end="")

and received

File "pyenv/versions/3.10.0/lib/python3.10/site-packages/openai/_utils/_utils.py", line 272, in wrapper
    return func(*args, **kwargs)
TypeError: Completions.create() got an unexpected keyword argument 'echo'`

which makes sense since the CLIChatCompletionCreateArgs doesn't allow for it

@veekaybee
Copy link
Contributor

Something I'm missing in understanding: If in the task we only use the logprobs for the response for the task evaluation, what does computing the logprobs of the input string give us in the case of the evaluation, Do we also use this as part of the calculation?

taking an LM, running it on each (input + answer) and asking it to return the log probs on not just the answer, but also the input (echo=True allows this in OpenAI's Completions API), then taking only the logprobs from the answer portion.

@haileyschoelkopf
Copy link
Collaborator Author

Something I'm missing in understanding: If in the task we only use the logprobs for the response for the task evaluation, what does computing the logprobs of the input string give us in the case of the evaluation, Do we also use this as part of the calculation?

For multi-token continuations/targets, to get the loglikelihood of the whole target, we need to get the logprob of token 0 conditioned on the input/context, the logprob of token 1 of the target conditioned on (context + token 0 of target), and so on.

With echo=True, we feed in (context + continuation) and get out all the logprobs which we can subset to the target string’s loglikelihood.

If we don’t have echo=True, then we can only feed in the inputs. Then if the model does not output token 0, the logprob of token 1 at target position 1 does not correspond to the right thing and we can’t accurately compute the loglikelihood in a (single) API call for multi-token continuations

@veekaybee
Copy link
Contributor

Thanks, this is super helpful. I also just tested on Together and am getting the same issue, because it's OpenAICompletions compatible:

from openai import OpenAI

system_content = "You are a travel agent. Be descriptive and helpful."
user_content = "Tell me about San Francisco"

client = OpenAI(api_key="TOGETHER_API_KEY",
    base_url="https://api.together.xyz/v1",)

chat_completion = client.chat.completions.create(
    model="mistralai/Mixtral-8x7B-Instruct-v0.1",
    messages=[
        {"role": "system", "content": system_content},
        {"role": "user", "content": user_content},
    ],
    temperature=0.7,
    max_tokens=1024,
    echo=True
)

response = chat_completion.choices[0].message.content
print("Together response:\n", response)

TypeError: Completions.create() got an unexpected keyword argument 'echo'

It sounds like, based on this, the next step is: investigate whether we can feed in just (input) to these APIs and still measure the logprobability of a target string from them.

Where would we want this output to feed into?

It looks like this would be a good place to start to see how to get them in the completions call? And then we'd want to see how to pull them in here? , the same way we do here

@haileyschoelkopf
Copy link
Collaborator Author

Darn, that makes sense! Thanks for checking on this.

It sounds like, based on this, the next step is: investigate whether we can feed in just (input) to these APIs and still measure the logprobability of a target string from them.
Where would we want this output to feed into?

We could conceivably achieve measurement of target string loglikelihood in O(target len) API calls but this is quite expensive.

Further, since OpenAI only gives up to 5 token logprobs I think, we are limited by this—if the logprobs for our next desired target token do not appear in the top 5 we are out of luck.

Given this, one option may be to add Chat Templating support for local OpenAI Completions support once local Completions support is enabled, and get echo=True that way. Would you be willing to take on local-completions as a next step?

Orthogonal to this, we should think about how we want to support greedy-gen exact match vs. multichoice loglikelihood for multiple choice tasks to have both variants in some unified way.

@veekaybee
Copy link
Contributor

Further, since OpenAI only gives up to 5 token logprobs I think, we are limited by this—if the logprobs for our next desired target token do not appear in the top 5 we are out of luck.

Ah, this is good to know, thanks!

Chat Templating support for local OpenAI Completions support once local Completions support is enabled, and get echo=True that way. Would you be willing to take on local-completions as a next step?

Yep, local completions sounds good: want to make sure I'm thinking the right way:

  1. Add case statements in here for branching -if it's a local model, set echo=True and test.

For this piece:

we should think about how we want to support greedy-gen exact match vs. multichoice loglikelihood for multiple choice tasks to have both variants in some unified way.

Sounds like possibly a separate issue to file and link to this one?

@haileyschoelkopf
Copy link
Collaborator Author

Yep, local completions sounds good: want to make sure I'm thinking the right way:

Add case statements in here for branching -if it's a local model, set echo=True and test

we should always need to use echo=True for remote/local models! the additions needed for local-completions would be ~ just the same as that of local-chat-completions (tokenizer backend selectable between HF/openai and exact tokenizer can be specified by user, plus base_url configurable)

Sounds like possibly a separate issue to file and link to this one?

Definitely, will open it for tracking! Just mentioning aloud that our best solution here is (unfortunately) to forgo loglikelihood entirely perhaps.

@veekaybee
Copy link
Contributor

Ah ok so these changes would be to enable local-completions, which is not the same as local-chat-completions, but do have the ability to run echo=True (for legacy reasons), to also run locally (just clarifying for myself).

That makes sense, can start on those changes!

@haileyschoelkopf
Copy link
Collaborator Author

That's correct!

@veekaybee
Copy link
Contributor

Cool. Just started working on this to echo OpenaiChatCompletionsLM and noticed we removed the tokenizer from that class

Are we calling it a different way or excluding it entirely for OpenaiChatCompletionsLM?

@haileyschoelkopf
Copy link
Collaborator Author

The OpenAI ChatCompletions API actually fully abstracts away its tokenizer, so we don't need a tokenizer for openai-chat-completions! The same isn't true for Completions, there we still need a tokenizer.

@veekaybee
Copy link
Contributor

Ok, so here is a fun dilemma:

In starting to implement the code (https://github.com/EleutherAI/lm-evaluation-harness/compare/logprob-completions?expand=1), I found that, due to this issue,
I needed to update the default base model to gpt-3.5-turbo-instruct based on their suggested change, but that model doesn't support both logprobs and echo=True! 🤦‍♀️

image

A couple ideas:

  • I can focus on testing for the local case and set the OpenAI model to None
  • set echo but not logprobs

Something else that I'm not aware of that might be helpful here. Let me know what you think.

@haileyschoelkopf
Copy link
Collaborator Author

sigh

I think in this case we should support local Completions models with echo=True (assuming that these will continue to support echo=True), and otherwise, for non-local / OpenAI native cases, raise an error saying logits of the form we need are not supported upon trying to run loglikelihood() or loglikelihood_rolling() methods.

@gmottajr
Copy link

gmottajr commented Jan 2, 2024

Hi Hailey and Vicki, I'm still interested on working on this issue with you.
Do you believe is there any room left to have another developer jumping on it with Vicki?

@veekaybee
Copy link
Contributor

Hey @gmottajr, absolutely! Does the conversation and code here so far make sense?

Feel free to take the branch I pushed and linked to in my comment above and develop from it. Would that work?

Let me know if you need additional pointers or help (and for sure Hailey can also assist better than me in cases of "do we want to do X this way).

@anjor
Copy link
Contributor

anjor commented Jan 3, 2024

This has been a very interesting conversation, thank you @veekaybee @haileyschoelkopf .

On a separate but related note, @haileyschoelkopf do you know how exactly does OpenAI run these benchmarks? Surely they must need the same logprobs that we do. One possibility is they are running against some internal API where they do have access to the logprobs but it's just not opened up.

@haileyschoelkopf
Copy link
Collaborator Author

Openai certainly has access to model logprobs if they desire them (and definitely use them for things like evaluating perplexity).

As mentioned earlier in this thread, multiple choice benchmarks can be measured using generation and exact match, and we’ll probably want to support this — OpenAI’s evals framework likely to some extent exhibits how they do this. https://github.com/openai/evals

@gmottajr
Copy link

gmottajr commented Jan 4, 2024

Awesome @veekaybee! 🥳

Hey @gmottajr, absolutely! Does the conversation and code here so far make sense?

Feel free to take the branch I pushed and linked to in my comment above and develop from it. Would that work?

Let me know if you need additional pointers or help (and for sure Hailey can also assist better than me in cases of "do we want to do X this way).

Thank you very much for you positive response, @veekaybee. 😄 🚀
That sounds like music to my ears! 🎼 🎵 🤩
I would like to fork the branch you mentioned, but could not really find it.
Yeah, for sure, I do need some additional pointers and help. How does that sound if we could talk directly through Slack, or Discord, Ms Teams, or any other collaboration software more familiar to you?
Looking forward to hear back from you (and jump right away in this code change). 👀
Best regards,
Gerson Jr.

@veekaybee
Copy link
Contributor

The branch is called logprob-completions and you can find a reference of it here: https://github.com/EleutherAI/lm-evaluation-harness/compare/logprob-completions?expand=1

Feel free to find us on the EleutherAI Discord in the #lm-thunderdome channel, same username :)

@gmottajr
Copy link

gmottajr commented Jan 4, 2024

Hi @veekaybee, I'm not sure why, but all of yours hyperlinks are pointing to a PR instead of where you aim to point to. This link for example, does not take me to any branch, https://github.com/EleutherAI/lm-evaluation-harness/compare/logprob-completions?expand=1. It takes me here:
image
Same thing is happening with your discord link:
image
That is why it really looks like there is something weird happening when you are sending link and I am wondering that you might actually trying to point to another URL.
Anyway, I just guessed you probably is talking about this branch here, I'm not sure: https://github.com/EleutherAI/lm-evaluation-harness/tree/logprob-completions.
image

Please let me know if I got it correct? I'm going to fork from there.
I need you sending me again the discord link, please.
Thank you very much.

@haileyschoelkopf
Copy link
Collaborator Author

haileyschoelkopf commented Jan 4, 2024

https://github.com/EleutherAI/lm-evaluation-harness/tree/logprob-completions yes, this link is correct!
you can visit the #lm-thunderdome channel on discord.gg/eleutherai to ask questions or discuss.

@haileyschoelkopf haileyschoelkopf moved this from Backlog to In progress in LM-Eval Support and Development Jan 4, 2024
@veekaybee
Copy link
Contributor

@gmottajr were you able to get started on this? Let me know if you need any help, or I'm happy to put together a sample PR for us to review.

@artemorloff
Copy link
Contributor

Have you tried using logit_bias param for openai.chat.completions? Like gpt-4. This param allows for generating only those tokens that have been previously biased (with quite a large value). Like the model makes forward and get logits for the currecnt token, then some lavue is added to the particular logits (for example to logits of tokens of letters A, B, C, D for mmlu), then log-probs are computed. gpt-4 does not return the probs for input tokens. But if you pass only ctx, bias possible continuation logits with quite large value, the model will exclusively generate tokens with these biased logits. gpt-4 can return log-probs for generated tokens.

So, it may be the way to make the openai chat model generate only those tokens that we want and get their log-probs that will still be comparable (bias all tokens for the same value).

@haileyschoelkopf
Copy link
Collaborator Author

haileyschoelkopf commented May 23, 2024

Had posted this on discord a while back, but cross-posting an update:

Unfortunately logits (with echo=True) still are not supported by the vast majority of API models, and now logit biases cannot be used to work around this--following https://arxiv.org/abs/2403.06634 https://arxiv.org/abs/2403.09539 logits returned on outputs no longer reflect logit biases.

There are probably other hacks that could be applied to "simulate" ranking-based multiple choice or to try to artificially, say, constrain an OAI chat model to only output A, B, C, or D by using logit biases, but these all seem worse than just using free-form generation to evaluate these "chatty" models.

What we'd like to move toward long-term therefore is to have tasks support multiple "presets" so that one can eval MMLU generatively or using loglikelihoods. The caveat here is that there is a tension between giving users more options to play with and keeping the advantages that one standardized task implementation gives. Haven't yet decided what the right point to strike here is--certainly we need to allow at least a bit more configurability to make the currently-loglikelihood only tasks usable for API or local server models, but also don't want to move too far in the other direction in doing so.

If people do have feedback on this front, it is appreciated--hope this provides context on the logits-for-chat-APIs front though.

@github-project-automation github-project-automation bot moved this from In progress to Done in LM-Eval Support and Development May 23, 2024
@haileyschoelkopf haileyschoelkopf added the declined A proposed dataset or feature request that will not be implemented. label May 23, 2024
@djstrong
Copy link
Contributor

Almost all Polish tasks, we have created, are in two versions: multiple choice and generate, e.g. https://github.com/speakleash/lm-evaluation-harness/tree/polish2/lm_eval/tasks/polish_ppc
Results are published on https://huggingface.co/spaces/speakleash/open_pl_llm_leaderboard with task name suffixes _g or _mc. There is also gpt-3 model tested but only on _g tasks.

@binxuan
Copy link

binxuan commented Jul 13, 2024

Could we set logprobs to a large number for vLLM and openai completion API so that we can do the multile choice task using one-token generation?

@monk1337
Copy link

I am wondering if there is a solution for this. I am using API provider other than openai and using openai schema for that but getting same error 'No support for logits', what's the solution?

@dimcall
Copy link

dimcall commented Aug 6, 2024

Is there any solution to the "No support for logits", when using the LM Harness for multiple choice tasks and chat.completions from openai?

@haileyschoelkopf
Copy link
Collaborator Author

Some "Completions" type endpoints return logits, but no, if the API does not expose this option there is not currently a way to access the required information for loglikelihood-based tasks.

@pasky
Copy link
Contributor

pasky commented Dec 30, 2024

As mentioned earlier in this thread, multiple choice benchmarks can be measured using generation and exact match, and we’ll probably want to support this — OpenAI’s evals framework likely to some extent exhibits how they do this. https://github.com/openai/evals

I just wanna mention I took a stab at this in #2601 - it kinda works for me in simple scenarios, though I could certainly think of cleaner approaches!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
declined A proposed dataset or feature request that will not be implemented. feature request A feature that isn't implemented yet. help wanted Contributors and extra help welcome.
Projects
Development

No branches or pull requests

10 participants