feat(llm): convert function call request for non-funcall OSS model #4711

xingyaoww · 2024-11-02T19:47:05Z

End-user friendly description of the problem this fixes or functionality that this introduces

Include this change in the Release Notes. If checked, you must provide an end-user friendly description for your change below

Give a summary of what the PR does, explaining any non-trivial design decisions

This PR adds a general utility that automatically converts function-calling LLM requests to non-function-calling LLM requests under the hood.

This get rid of the need to maintain two set of prompts for (1) function calling and (2) non-function calling models, and will greatly reduce our maintenance burden.
Function-calling is now by default ON: Going forward, we only need to iterate on "function calling mode" only, and the "non-function calling backward-compatibility" will automatically happen under the hood.
We now curate a list of "supported function calling model" in llm.py based on our evaluation result below:

Evaluation results so far:

Claude is still the default to-go model :)
Llama 3.1 70B & Gemini-002-pro barely worked in function calling mode, but works much better with non-function calling
For OSS: Llama, Qwen, Deepseek are all good options (though with low resolve rate)

Link of any specific issues this addresses

Should fix #4865

To run this PR locally, use the following command:

docker run -it --rm   -p 3000:3000   -v /var/run/docker.sock:/var/run/docker.sock   --add-host host.docker.internal:host-gateway   -e SANDBOX_RUNTIME_CONTAINER_IMAGE=docker.all-hands.dev/all-hands-ai/runtime:628201b-nikolaik   --name openhands-app-628201b   docker.all-hands.dev/all-hands-ai/openhands:628201b

…lling-oss

openhands/llm/fn_call_converter.py

…lling-oss

evaluation/utils/shared.py

Co-authored-by: Calvin Smith <[email protected]>

enyst · 2024-11-14T16:07:32Z

This is a great idea and a great PR. We should be doing this, it will be better to clear out that FC/non-FC code...

However, I do have some (unbaked) thoughts. For the sake of clarity I'll express them a bit rough, even though it's hard to be sure of this kind of stuff.

I think those results are interesting because they ... don't match expectations. My intuitions were that

most, not all, but most LLMs that natively support fc would work the same or a bit better
gpt-4o fc results would be below, but visibly in the "same class"
gemini (!) fc results would be below, but visibly in roughly the "same class" (1% ! 👀 🫨)
fc or non-fc, some OSS results would be sort of close

I think a possible explanation of these results, in part, is that we are now doing a fairly extreme version of optimizing prompts for Claude. We're not just "not helping" these other LLMs, we might be prompting in some ways that are "bad" for them.

Just for illustration, a few examples of what I mean:

Deepseek with browsing agent can respond correctly, but fail the task: it responded with a message sent with python variables, when our framework expected just a text. Note: it did follow instructions. We just didn't see Claude responding this way to these instructions so our code assumed it won't happen.
Gemini falls into this stuck scenario. I find this interesting because I've never seen CodeAct hit that before. That code was literally, IMO, dead code, for months, it was there since monologue agent's times, and I honestly thought to remove it (or keep it for the amazingly precise reason "in theory it's not reaaaaally impossible" 😅).
IMHO, what this says is that our CodeAct prompts are enough for GPT-4o to not hit that. But they're not appropriate enough for Gemini.

I feel like we may need to consider prompting better and test/respond better for like 3-4 LLMs, of which at least one OSS...

But I don't know if/how we may want to square this circle, to both:

try to give at least 3-4 LLMs other than Claude a fair chance to show us what they can do,
and in the same time ... I do agree that maintaining 2 sets of prompts is not great... let alone 3-4.

Other note:

have we looked at Nous Research Llamas? I would hope we can try it sometime. And/or Llama 3.2 ?

xingyaoww · 2024-11-14T16:25:56Z

@enyst great questions!

most, not all, but most LLMs that natively support fc would work the same or a bit better

One potential reason i'm seeing is that, most OSS LLM that support function calling, under the hood, is using JSON format: https://docs.together.ai/docs/llama-3-function-calling -- and you know what happens if you are trying to ask LLM to produce code escaped inside JSON :) https://aider.chat/2024/08/14/code-in-json.html

gemini (!) fc results would be below, but visibly in roughly the "same class" (1% ! 👀 🫨)

For Gemini, the bad function calling result is more likely a bug / artifact

It is able to call tools correctly early in the interaction

But it starts to add a weird "fields" to the tool call later in the same trajectory 😓

I feel like we may need to consider prompting better and test/respond better for like 3-4 LLMs, of which at least one OSS...

We could craft different prompt for different models now, but it feels to me it is (1) time-consuming, (2) hard to guarantee stability -- new models are coming out all the time & it is really hard to craft ONE prompt that works well on all of them.

My inclination now is that, we are optimizing for max(score for score in model_performances), and when it comes to using off-the-shelf model, the best option is probably to use claude's prompt.

But @Jiayi-Pan and I are working on a research project that's about to release in the next month that would allow us to train OSS model on arbitrary OpenHands prompt & specialized on OpenHands tasks - this likely be a more fundamental solution for OSS model IMO.

manarbenkraouda · 2024-11-26T21:02:33Z

Hey @xingyaoww ! Have you seen this? https://x.com/ollama/status/1861442333385892026
I think it may be relevant to this PR!

enyst · 2024-12-27T20:58:21Z

gemini (!) fc results would be below, but visibly in roughly the "same class" (1% ! 👀 🫨)

For Gemini, the bad function calling result is more likely a bug / artifact

It is able to call tools correctly early in the interaction
But it starts to add a weird "fields" to the tool call later in the same trajectory 😓

I looked into the Gemini native API and its docs mention 'fields'. I guess the model must have been trained with fields for some reason 🫣

BradKML · 2025-02-02T09:42:49Z

So what can be done to make FC-free tooling for Open Weight models (Llama, DeepSeek, Qwen) to be as good as the big APIs with function calls? Also regarding prompt strategies, why not use AutoPrompt methods on system prompt tweaking such that each model can adapt with variation?

P.S. Thanking @SmartManoj for relating this to the other stuff in SmartManoj#134 (reply in thread)

BradKML · 2025-02-03T01:36:19Z

Also a bit of self-note regarding function call benchmarks:

Berkeley classic https://gorilla.cs.berkeley.edu/leaderboard.html
https://github.com/ComposioHQ/Composio-Function-Calling-Benchmark
Nested function calls https://github.com/IBM/NESTFUL
Nexus and NFCL? https://www.databricks.com/blog/unpacking-function-calling-eval

xingyaoww changed the title ~~feat(llm): convert function call request to a format acceptable to non-funcall OSS model~~ feat(llm): convert function call request for non-funcall OSS model Nov 2, 2024

xingyaoww changed the base branch from xw/fn-calling to main November 2, 2024 19:47

xingyaoww added 2 commits November 2, 2024 15:49

remove obs prefix

60c7b11

add initial implementation of fn call converter

239ebc4

xingyaoww force-pushed the xw/fn-calling-oss branch from d6e1920 to 239ebc4 Compare November 2, 2024 19:49

xingyaoww and others added 18 commits November 2, 2024 20:08

add test that handles an inference pipeline

375e4fd

make codeact function calling by default

d44bb3a

tweak test

751b012

implement msg conversion for fncall

b5ed70c

add debug print

a9b11d7

add chmod after code copy

7532db8

handle the case for parallel function calling

7a6a32c

go back to only ONE tool call support

69e73d5

add conversion script from multiple messages to single

137b30a

Merge commit 'ba25b02978dac9da920a8b4bff2deee739ae988f' into xw/fn-ca…

f66ee8b

…lling-oss

add incontext learning example for fn calling conversion

1a8f1f4

add stop word

7187b8e

Merge commit '145194c87bafc3ba1041c87d66b2b0bee61061f5' into xw/fn-ca…

ac346f9

…lling-oss

add prefix and suffix

8186917

handle none content

e364c67

explictly display all the parameters

e56ec22

fix bug (and dpsk output)

a3e191c

fix dpsk again

11b7910

xingyaoww mentioned this pull request Nov 7, 2024

(feat): Prompt engineering to remind o1 to generate a patch #4807

Merged

xingyaoww added 2 commits November 7, 2024 16:03

fix runtime.connect that cause swebench workspace to fail

d50a929

convert response message to Litellm message structure

910c046

enyst reviewed Nov 7, 2024

View reviewed changes

openhands/llm/fn_call_converter.py Show resolved Hide resolved

enyst reviewed Nov 7, 2024

View reviewed changes

openhands/llm/fn_call_converter.py Show resolved Hide resolved

xingyaoww added 2 commits November 7, 2024 19:23

check type for fncall response message

36063b7

add trailing </function>

71785f2

xingyaoww and others added 6 commits November 9, 2024 20:01

catch extra warnings from litellm

b743c40

bump up litellm

789c632

do not add stuff all other model except anthropic

cf672fc

increase init timeout for eval

ddb8f01

Merge commit 'f55ddbed0eba5aaf1a75d1e72230bc9cea6c4569' into xw/fn-ca…

60f08d2

…lling-oss

add gpt4o to fncall supported model

6291bc5

xingyaoww marked this pull request as ready for review November 13, 2024 16:54

fix test

3687d7b

xingyaoww requested review from neubig, rbren and csmith49 November 13, 2024 17:01

csmith49 reviewed Nov 13, 2024

View reviewed changes

evaluation/utils/shared.py Outdated Show resolved Hide resolved

xingyaoww and others added 2 commits November 13, 2024 12:47

fix tests

686cb7e

Update evaluation/utils/shared.py

628201b

Co-authored-by: Calvin Smith <[email protected]>

xingyaoww requested a review from enyst November 14, 2024 01:33

rbren approved these changes Nov 14, 2024

View reviewed changes

xingyaoww merged commit 07f0d1c into main Nov 14, 2024
13 checks passed

xingyaoww deleted the xw/fn-calling-oss branch November 14, 2024 16:40

xingyaoww mentioned this pull request Nov 14, 2024

[Bug]: litellm doesn't support function calling model from OpenRouter. bug cause codeactagent couldn't interact with internet solely without ask browser agent for help #4820

Closed

1 task

enyst mentioned this pull request Nov 23, 2024

[Bug]: Ollama failing to set context and use tools #5166

Closed

1 task

SmartManoj mentioned this pull request Dec 6, 2024

[Bug]: Failing to run OpenRouter AND Ollama SmartManoj/Kevin#112

Closed

This was referenced Dec 7, 2024

[Bug]: Failing to run OpenRouter AND Ollama #5310

Open

feat(config): enable/disable LLM model tools/funcs usage by config #5576

Merged

enyst mentioned this pull request Dec 18, 2024

[Bug]: Agent immediately tried Git commands without coding #5662

Closed

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(llm): convert function call request for non-funcall OSS model #4711

feat(llm): convert function call request for non-funcall OSS model #4711

xingyaoww commented Nov 2, 2024 •

edited by github-actions bot

Loading

enyst commented Nov 14, 2024

xingyaoww commented Nov 14, 2024 •

edited

Loading

manarbenkraouda commented Nov 26, 2024

enyst commented Dec 27, 2024

BradKML commented Feb 2, 2025 •

edited

Loading

BradKML commented Feb 3, 2025

feat(llm): convert function call request for non-funcall OSS model #4711

feat(llm): convert function call request for non-funcall OSS model #4711

Conversation

xingyaoww commented Nov 2, 2024 • edited by github-actions bot Loading

enyst commented Nov 14, 2024

xingyaoww commented Nov 14, 2024 • edited Loading

manarbenkraouda commented Nov 26, 2024

enyst commented Dec 27, 2024

BradKML commented Feb 2, 2025 • edited Loading

BradKML commented Feb 3, 2025

xingyaoww commented Nov 2, 2024 •

edited by github-actions bot

Loading

xingyaoww commented Nov 14, 2024 •

edited

Loading

BradKML commented Feb 2, 2025 •

edited

Loading