-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Logits to OpenAI ChatCompletions model #1196
Comments
I've been informed that returning the logprobs of the prompt / input which we would want to be able to support logprobs is not included. However, supporting logprobs for ChatCompletions-mirroring APIs that do return logprobs including for |
Hi Hailey, |
Language models return output in the form of "logits" / "logprobs" (log probabilities) over their vocabulary of possible next tokens. You can use these to sample text probabilistically or deterministically from, or, these allow you to get the estimated (log) probability of a string by the language model. However, many closed model providers such as OpenAI stopped providing this info to users, and just gave generated text out. They've re-added that feature for their chat models now. This will allow us to run the models on loglikelihood-based multiple choice QA tasks like The steps to complete this feature would be:
|
Hi all, happy to help with this since I'm more familiar with the I started by checking to see if OpenAI exposes the
and received
which makes sense since the |
Something I'm missing in understanding: If in the task we only use the logprobs for the response for the task evaluation, what does computing the logprobs of the input string give us in the case of the evaluation, Do we also use this as part of the calculation?
|
For multi-token continuations/targets, to get the loglikelihood of the whole target, we need to get the logprob of token 0 conditioned on the input/context, the logprob of token 1 of the target conditioned on (context + token 0 of target), and so on. With echo=True, we feed in (context + continuation) and get out all the logprobs which we can subset to the target string’s loglikelihood. If we don’t have echo=True, then we can only feed in the inputs. Then if the model does not output token 0, the logprob of token 1 at target position 1 does not correspond to the right thing and we can’t accurately compute the loglikelihood in a (single) API call for multi-token continuations |
Thanks, this is super helpful. I also just tested on Together and am getting the same issue, because it's OpenAICompletions compatible: from openai import OpenAI
system_content = "You are a travel agent. Be descriptive and helpful."
user_content = "Tell me about San Francisco"
client = OpenAI(api_key="TOGETHER_API_KEY",
base_url="https://api.together.xyz/v1",)
chat_completion = client.chat.completions.create(
model="mistralai/Mixtral-8x7B-Instruct-v0.1",
messages=[
{"role": "system", "content": system_content},
{"role": "user", "content": user_content},
],
temperature=0.7,
max_tokens=1024,
echo=True
)
response = chat_completion.choices[0].message.content
print("Together response:\n", response)
TypeError: Completions.create() got an unexpected keyword argument 'echo' It sounds like, based on this, the next step is: Where would we want this output to feed into? It looks like this would be a good place to start to see how to get them in the completions call? And then we'd want to see how to pull them in here? , the same way we do here |
Darn, that makes sense! Thanks for checking on this.
We could conceivably achieve measurement of target string loglikelihood in O(target len) API calls but this is quite expensive. Further, since OpenAI only gives up to 5 token logprobs I think, we are limited by this—if the logprobs for our next desired target token do not appear in the top 5 we are out of luck. Given this, one option may be to add Chat Templating support for local OpenAI Completions support once local Completions support is enabled, and get echo=True that way. Would you be willing to take on local-completions as a next step? Orthogonal to this, we should think about how we want to support greedy-gen exact match vs. multichoice loglikelihood for multiple choice tasks to have both variants in some unified way. |
Ah, this is good to know, thanks!
Yep, local completions sounds good: want to make sure I'm thinking the right way:
For this piece:
Sounds like possibly a separate issue to file and link to this one? |
Add case statements in here for branching -if it's a local model, set echo=True and test we should always need to use echo=True for remote/local models! the additions needed for local-completions would be ~ just the same as that of local-chat-completions (tokenizer backend selectable between HF/openai and exact tokenizer can be specified by user, plus base_url configurable)
Definitely, will open it for tracking! Just mentioning aloud that our best solution here is (unfortunately) to forgo loglikelihood entirely perhaps. |
Ah ok so these changes would be to enable local-completions, which is not the same as local-chat-completions, but do have the ability to run echo=True (for legacy reasons), to also run locally (just clarifying for myself). That makes sense, can start on those changes! |
That's correct! |
Cool. Just started working on this to echo
Are we calling it a different way or excluding it entirely for |
The OpenAI ChatCompletions API actually fully abstracts away its tokenizer, so we don't need a tokenizer for |
Ok, so here is a fun dilemma: In starting to implement the code (https://github.com/EleutherAI/lm-evaluation-harness/compare/logprob-completions?expand=1), I found that, due to this issue, A couple ideas:
Something else that I'm not aware of that might be helpful here. Let me know what you think. |
sigh I think in this case we should support local Completions models with |
Hi Hailey and Vicki, I'm still interested on working on this issue with you. |
Hey @gmottajr, absolutely! Does the conversation and code here so far make sense? Feel free to take the branch I pushed and linked to in my comment above and develop from it. Would that work? Let me know if you need additional pointers or help (and for sure Hailey can also assist better than me in cases of "do we want to do X this way). |
This has been a very interesting conversation, thank you @veekaybee @haileyschoelkopf . On a separate but related note, @haileyschoelkopf do you know how exactly does OpenAI run these benchmarks? Surely they must need the same logprobs that we do. One possibility is they are running against some internal API where they do have access to the logprobs but it's just not opened up. |
Openai certainly has access to model logprobs if they desire them (and definitely use them for things like evaluating perplexity). As mentioned earlier in this thread, multiple choice benchmarks can be measured using generation and exact match, and we’ll probably want to support this — OpenAI’s evals framework likely to some extent exhibits how they do this. https://github.com/openai/evals |
Awesome @veekaybee! 🥳
Thank you very much for you positive response, @veekaybee. 😄 🚀 |
The branch is called Feel free to find us on the EleutherAI Discord in the #lm-thunderdome channel, same username :) |
Hi @veekaybee, I'm not sure why, but all of yours hyperlinks are pointing to a PR instead of where you aim to point to. This link for example, does not take me to any branch, https://github.com/EleutherAI/lm-evaluation-harness/compare/logprob-completions?expand=1. It takes me here: Please let me know if I got it correct? I'm going to fork from there. |
https://github.com/EleutherAI/lm-evaluation-harness/tree/logprob-completions yes, this link is correct! |
@gmottajr were you able to get started on this? Let me know if you need any help, or I'm happy to put together a sample PR for us to review. |
Have you tried using logit_bias param for openai.chat.completions? Like gpt-4. This param allows for generating only those tokens that have been previously biased (with quite a large value). Like the model makes forward and get logits for the currecnt token, then some lavue is added to the particular logits (for example to logits of tokens of letters A, B, C, D for mmlu), then log-probs are computed. gpt-4 does not return the probs for input tokens. But if you pass only ctx, bias possible continuation logits with quite large value, the model will exclusively generate tokens with these biased logits. gpt-4 can return log-probs for generated tokens. So, it may be the way to make the openai chat model generate only those tokens that we want and get their log-probs that will still be comparable (bias all tokens for the same value). |
Had posted this on discord a while back, but cross-posting an update: Unfortunately logits (with There are probably other hacks that could be applied to "simulate" ranking-based multiple choice or to try to artificially, say, constrain an OAI chat model to only output A, B, C, or D by using logit biases, but these all seem worse than just using free-form generation to evaluate these "chatty" models. What we'd like to move toward long-term therefore is to have tasks support multiple "presets" so that one can eval MMLU generatively or using loglikelihoods. The caveat here is that there is a tension between giving users more options to play with and keeping the advantages that one standardized task implementation gives. Haven't yet decided what the right point to strike here is--certainly we need to allow at least a bit more configurability to make the currently-loglikelihood only tasks usable for API or local server models, but also don't want to move too far in the other direction in doing so. If people do have feedback on this front, it is appreciated--hope this provides context on the logits-for-chat-APIs front though. |
Almost all Polish tasks, we have created, are in two versions: multiple choice and generate, e.g. https://github.com/speakleash/lm-evaluation-harness/tree/polish2/lm_eval/tasks/polish_ppc |
Could we set |
I am wondering if there is a solution for this. I am using API provider other than openai and using openai schema for that but getting same error 'No support for logits', what's the solution? |
Is there any solution to the "No support for logits", when using the LM Harness for multiple choice tasks and chat.completions from openai? |
Some "Completions" type endpoints return logits, but no, if the API does not expose this option there is not currently a way to access the required information for loglikelihood-based tasks. |
I just wanna mention I took a stab at this in #2601 - it kinda works for me in simple scenarios, though I could certainly think of cleaner approaches! |
OpenAI added logits back to their ChatCompletions API.
This means we can re-add support for all tasks to this LM type! See
OpenAICompletionsLM
for an example of how to do this.Contributions on this would be very welcomed if people have the bandwidth, as I may not get to this super soon but it's high priority.
The text was updated successfully, but these errors were encountered: