mixing LLM evaluation with quantitative (algorithmic) evaluation #29

rawwerks · 2024-06-25T13:36:43Z

rawwerks
Jun 25, 2024

I'm curious if the textgrad team has any examples that use a function as the evaluation? (Either in addition to or instead of the LLM engine text evaluation.)

I am coming from DSPy, so I am used to setting a metric function (0-1) as the evaluation. As a specific example, I was surprised that the code improvement example used only the LLM's evaluation of a string as the evaluation function. (Specifically, I was surprised that instead of trying to minimize run time as a numerical value, the example has the LLM read the numerical value and decide it should be minimizing it.)

Text-only evaluation is super cool and a very creative approach! I'm worried that it is unnecessarily relying on the LLM as the sole evaluator. (And frankly I have found LLMs to be a very poor judge of quality -- for example, it is much harder to make a good LLM editor than a good LLM writer.)

Ideally, I would like to be able to mix quantitative/algorithmic/deterministic evaluations with the text-based (and inherently non-deterministic) evaluations of LLMs. Is this possible with textgrad today?

mertyg · 2024-06-25T13:41:46Z

mertyg
Jun 25, 2024
Maintainer

Hey @rawwerks -- thanks for giving TextGrad a try!!
Yes, it is indeed possible. For instance: We backprop through string-based functions for Object Counting and GSM8k, see here: https://github.com/zou-group/textgrad/blob/main/textgrad/tasks/__init__.py#L42

For code improvement -- we did not want to give us an unfair advantage by using the actual runtime in the objective (since the baseline, Reflexion, does not use it, we wanted to have an equal comparison).

Just like in PyTorch -- TextGrad requires one to define how to backprop through a given function. We can abstract this away through extending String Based Functions, or implement more general functions to handle any e.g. 0-1 metric that you may like. The tradeoff would be that by implementing more specific ways to backprop through your metric, you can improve the optimization performance.

This is a good lesson for us to make it more explicit and easy for the users to see and use such metrics -- we'll definitely do that. Thank you so much for this question!

2 replies

rawwerks Jun 25, 2024
Author

Thanks for clarifying!

A code generation example would be the most useful to me, in part because it is easy to interpret. A simple example that would be really powerful and useful would just be a "code retry": the "score" is binary 0/1 whether the code compiles or not, and the stdout/err is given to the LLM as text for the next try.

This is basically what I'm doing manually with cursor and aider all day. (And prompting retries is one of the many tricks needed to get a high swe-bench score.)

mertyg Jun 26, 2024
Maintainer

This should be fairly easy to implement! I can take a stab over the weekend.

belmarca · 2024-08-13T20:17:40Z

belmarca
Aug 13, 2024

Having hybrid LLM + numerical evals is essential. I have seen string equality proposed but that's not an alternative. Did you work on a hybrid autograd + textgrad approach?

1 reply

akhilsbehl Sep 11, 2024

I have read and re-read 'string_based_function' so many times (in fact, I have even written a lot of code mimicing the prompt optimization example using it) but I still don't understand what it is and why it exists. The name is completely generic and hints at the purpose not even tangentially.

Especially because the examples are so confusing. For example, why is the object counting problem talking about the 'runtime'? Is it just a mistake or is that the intent which I just don't get?

Also, here string_based_functions add a second layer of backprop (one from string based function and the original one from the LLM feedback). However, what does this additional effort & money buy us?

Here is an example of a log (from re-running the prompt optimization script):


{
    "name": "textgrad",
    "msg": "_backward_through_string_fn prompt",
    "funcName": "_backward_through_string_fn_base",
    "_backward_through_string_fn": "You will give feedback to a variable with the following role: <ROLE> response from the language model </ROLE>. Here is an evaluation of the variable using a string-based function:\n\nFunction purpose: The runtime of string-based function that checks if the prediction is correct.\n\n<INPUTS_TO_FUNCTION> **Prediction(role: response from the language model)**: You have 5 musical instruments (piano, trombone, clarinet, accordion, trumpet).\nAnswer: 5\n\n**Ground truth answer(role: correct answer for the query)**: 5 </INPUTS_TO_FUNCTION>\n\n<OUTPUT_OF_FUNCTION> 1 </OUTPUT_OF_FUNCTION>\n\n<OBJECTIVE_FUNCTION>Your goal is to give feedback and criticism to the variable given the above evaluation output. Our only goal is to improve the above metric, and nothing else. </OBJECTIVE_FUNCTION>\n\nWe are interested in giving feedback to the response from the language model for this conversation. Specifically, give feedback to the following span of text:\n\n<VARIABLE> You have 5 musical instruments (piano, trombone, clarinet, accordion, trumpet).\nAnswer: 5 </VARIABLE>\n\nGiven the above history, describe how the response from the language model could be improved to improve the <OBJECTIVE_FUNCTION>. Be very creative, critical, and intelligent.\n\n",
}
{
    "name": "textgrad",
    "msg": "_backward_through_string_fn gradient",
    "funcName": "_backward_through_string_fn_base",
    "_backward_through_string_fn": "The evaluation output indicates that the runtime of the string-based function is 1, which suggests that the function executed correctly and efficiently. However, to further optimize and ensure robustness, consider the following feedback:\n\n1. **Clarity and Redundancy**:\n   - The phrase \"You have 5 musical instruments\" is followed by a list of instruments. While this is clear, it is somewhat redundant since the number of instruments is explicitly stated. Simplifying the statement could reduce processing time and potential confusion.\n   - **Suggestion**: \"List of musical instruments: piano, trombone, clarinet, accordion, trumpet. Total: 5.\"\n\n2. **Consistency in Format**:\n   - Ensure that the format of the response is consistent and straightforward. This can help in parsing and processing the response more efficiently.\n   - **Suggestion**: \"Musical instruments: piano, trombone, clarinet, accordion, trumpet. Count: 5.\"\n\n3. **Avoiding Ambiguity**:\n   - The current response is clear, but in more complex queries, ambiguity can arise. Ensuring that the response format is unambiguous can help maintain accuracy.\n   - **Suggestion**: \"Instruments listed: piano, trombone, clarinet, accordion, trumpet. Number of instruments: 5.\"\n\n4. **Minimizing Unnecessary Information**:\n   - The response should be as concise as possible while still providing the necessary information. This can help in reducing the runtime of the function.\n   - **Suggestion**: \"Instruments: piano, trombone, clarinet, accordion, trumpet. Total: 5.\"\n\n5. **Error Handling**:\n   - Although the current response is correct, consider adding a mechanism to handle potential errors or unexpected inputs. This can improve the robustness of the response.\n   - **Suggestion**: \"If the list of instruments changes or is incorrect, ensure the count is updated accordingly.\"\n\nBy implementing these suggestions, the response from the language model can be made more efficient, clear, and robust, potentially improving the runtime and accuracy of the string-based function.",
}

I don't get how this is helping anything except spending tokens. Maybe I just don't get it in which case I'd really appreciate it if someone can explain this.

Lastly, the real need is to have dual loss functions - ones which have a textual component and a vector of numerical losses (for e.g. I could care about runtime and cyclomatic complexity of my code at the same time).

I'm really trying to crack this problem and have written a lot of downstream code for my application but I keep wondering if I'm just doing it wrong.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

mixing LLM evaluation with quantitative (algorithmic) evaluation #29

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 2 comments 3 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

mixing LLM evaluation with quantitative (algorithmic) evaluation #29

rawwerks Jun 25, 2024

Replies: 2 comments · 3 replies

mertyg Jun 25, 2024 Maintainer

rawwerks Jun 25, 2024 Author

mertyg Jun 26, 2024 Maintainer

belmarca Aug 13, 2024

akhilsbehl Sep 11, 2024

rawwerks
Jun 25, 2024

Replies: 2 comments 3 replies

mertyg
Jun 25, 2024
Maintainer

rawwerks Jun 25, 2024
Author

mertyg Jun 26, 2024
Maintainer

belmarca
Aug 13, 2024