-
Notifications
You must be signed in to change notification settings - Fork 5.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(llm): convert function call request for non-funcall OSS model #4711
Conversation
d6e1920
to
239ebc4
Compare
Co-authored-by: Calvin Smith <[email protected]>
This is a great idea and a great PR. We should be doing this, it will be better to clear out that FC/non-FC code... However, I do have some (unbaked) thoughts. For the sake of clarity I'll express them a bit rough, even though it's hard to be sure of this kind of stuff. I think those results are interesting because they ... don't match expectations. My intuitions were that
I think a possible explanation of these results, in part, is that we are now doing a fairly extreme version of optimizing prompts for Claude. We're not just "not helping" these other LLMs, we might be prompting in some ways that are "bad" for them. Just for illustration, a few examples of what I mean:
I feel like we may need to consider prompting better and test/respond better for like 3-4 LLMs, of which at least one OSS... But I don't know if/how we may want to square this circle, to both:
Other note:
|
@enyst great questions!
One potential reason i'm seeing is that, most OSS LLM that support function calling, under the hood, is using JSON format: https://docs.together.ai/docs/llama-3-function-calling -- and you know what happens if you are trying to ask LLM to produce code escaped inside JSON :) https://aider.chat/2024/08/14/code-in-json.html
For Gemini, the bad function calling result is more likely a bug / artifact It is able to call tools correctly early in the interaction But it starts to add a weird "fields" to the tool call later in the same trajectory 😓
We could craft different prompt for different models now, but it feels to me it is (1) time-consuming, (2) hard to guarantee stability -- new models are coming out all the time & it is really hard to craft ONE prompt that works well on all of them. My inclination now is that, we are optimizing for But @Jiayi-Pan and I are working on a research project that's about to release in the next month that would allow us to train OSS model on arbitrary OpenHands prompt & specialized on OpenHands tasks - this likely be a more fundamental solution for OSS model IMO. |
Hey @xingyaoww ! Have you seen this? https://x.com/ollama/status/1861442333385892026 |
I looked into the Gemini native API and its docs mention 'fields'. I guess the model must have been trained with |
So what can be done to make FC-free tooling for Open Weight models (Llama, DeepSeek, Qwen) to be as good as the big APIs with function calls? Also regarding prompt strategies, why not use AutoPrompt methods on system prompt tweaking such that each model can adapt with variation? P.S. Thanking @SmartManoj for relating this to the other stuff in SmartManoj#134 (reply in thread) |
Also a bit of self-note regarding function call benchmarks:
|
End-user friendly description of the problem this fixes or functionality that this introduces
Give a summary of what the PR does, explaining any non-trivial design decisions
This PR adds a general utility that automatically converts function-calling LLM requests to non-function-calling LLM requests under the hood.
llm.py
based on our evaluation result below:Evaluation results so far:
Link of any specific issues this addresses
Should fix #4865
To run this PR locally, use the following command: