Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

If the LLM generates multiple final tool calls, but one fails validation, the wrong messages are generated #672

Open
jlowin opened this issue Jan 13, 2025 · 1 comment · May be fixed by #926
Labels
bug Something isn't working

Comments

@jlowin
Copy link
Collaborator

jlowin commented Jan 13, 2025

If an LLM generates two parallel tool calls for a final result tool, but the first one fails validation, there are two problems:

  1. the second "good" call is ignored (this is arguably fine from a design standpoint)
  2. more critically, because the good call is ignored no tool result is generated for it so the message history is incomplete. This is problematic.

I have a unit test that demonstrates the issue but I'm not sure how the maintainers would prefer to solve it. Currently, the design implicitly assumes that the LLM never generates two final tool calls (as the final tools are returned via result_schema.find_tool(tool_calls) as the first match to the tool call payload).

# tests/test_agent.py

# inside class TestMultipleToolCalls

    def test_multiple_final_result_are_validated_correctly(self):
        """Tests that if multiple final results are returned, but one fails validation, the other is used."""

        def return_model(_: list[ModelMessage], info: AgentInfo) -> ModelResponse:
            assert info.result_tools is not None
            return ModelResponse(
                parts=[
                    ToolCallPart.from_raw_args('final_result', {'bad_value': 'first'}),
                    ToolCallPart.from_raw_args('final_result', {'value': 'second'}),
                ]
            )

        agent = Agent(FunctionModel(return_model), result_type=self.ResultType, end_strategy='early')
        result = agent.run_sync('test multiple final results')

        # Verify the result came from the second final tool
        assert result.data.value == 'second'

        # Verify we got appropriate tool returns
        assert result.new_messages()[-1].parts == snapshot(
            [
                ToolReturnPart(
                    tool_name='final_result', content='Final result processed.', timestamp=IsNow(tz=timezone.utc)
                ),
                ToolReturnPart(
                    tool_name='final_result',
                    content='Result tool not used - a final result was already processed.',
                    timestamp=IsNow(tz=timezone.utc),
                ),
            ]
        )
@samuelcolvin samuelcolvin added the bug Something isn't working label Jan 16, 2025
@dmontagu
Copy link
Contributor

Sorry for the delay in resolving this.

I think what we should do in the short term is just ensure that all result tools still get called, but just use the first one as the final result.

I'm open to bigger changes to behavior long term, especially if there's an obvious better way to handle it, but just calling the provided tools (and adding the results to the message history) seems like an obvious/straightforward improvement. I'll try to do that today.

@dmontagu dmontagu linked a pull request Feb 13, 2025 that will close this issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants