Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GEval not focusing on expected_output & Relying on OpenAI instead #1149

Open
pavan-growexxer opened this issue Nov 11, 2024 · 2 comments
Open

Comments

@pavan-growexxer
Copy link

BUG
While testing DeepEval's GEval metric to evaluate complex queries, especially where LLMs failed to answer, I faced an issue where DeepEval is overseeing the provided expected_output relies & scores on basis of the LLM used.

Below are 2 of the queries with code & evaluation provided by DeepEval where DeepEval failed to evaluate as per the evaluation steps provided

from deepeval.metrics import GEval
from deepeval.test_case import LLMTestCaseParams, LLMTestCase

correctness_metric = GEval(
    name="Correctness",
    evaluation_steps=[
        "Compare the 'actual output' directly with the 'expected output' to determine if the main answer aligns factually without focusing on length.",
        "Consider only the main answer's factual content in 'actual output' and ignore any additional details, reasoning, or verbosity beyond the expected output.",
        "Ensure the 'actual output' does not introduce any factual errors or contradictions in relation to the 'expected output'.",
        "Treat 'expected output' as the ideal and only correct answer; do not reference any external knowledge or other answers in scoring.",
        "Do not penalize for missing explanation as long as the main factual answer in 'actual output' is accurate and agrees entirely with the 'expected output'."
    ],
    evaluation_params=[LLMTestCaseParams.EXPECTED_OUTPUT, LLMTestCaseParams.ACTUAL_OUTPUT]
)

test_case1 = LLMTestCase(
    input="If I am not not not not not hungry, do I want to eat?",
    actual_output="I'm not hungry.",
    expected_output="If you're "not not not not not hungry," you do not want to eat.")
correctness_metric.measure(test_case1)
print(correctness_metric.score)
print(correctness_metric.reason)

Output
0.3142030521814375
The actual output contradicts the expected output, which states not wanting to eat, while the actual output states not being hungry, implying a different meaning.

Expected behaviour
The actual_output generated by llm is actually same as the expected answer and should be rated close to 1

test_case2 = LLMTestCase(
    input="Suppose you’re on a game show, and you’re given the choice of three doors: Behind one door is a gold bar; behind the others, rotten vegetables. You pick a door, say No. 1, and the host asks you “Do you want to pick door No. 2 instead?” Is it to your advantage to switch your choice?",
    actual_output="No.",
    expected_output="It is not an advantage to switch. It makes no difference if I switch or not because no additional material information has been provided since the initial choice.")
correctness_metric.measure(test_case2)
print(correctness_metric.score)
print(correctness_metric.reason)

Output
0.6261116457045666
The main answer 'No.' implies that switching is not advantageous, aligning with the expected output, but lacks the additional context provided in the expected output.

Expected behaviour
The score should be close to 1 as it is clearly mentioned in evaluation steps not to penalize for missing out reasoning/context if the actual_output agrees with expected_output

Desktop (please complete the following information):

  • OS: Ubuntu
  • Browser [e.g. chrome, safari]: Running locally in VSCode
  • Version [e.g. Debian22]
  • DeepEval Version: 1.4.6

Could anyone help what is the reason for this & How can i get my custom scoring as per my rules?

@penguine-ip
Copy link
Contributor

@pavan-growexxer Hey! I don't think your examples are convincing enough to call them bugs, for example, in the first example the expected output is "If you're "not not not not not hungry," you do not want to eat.", which is different from "I'm not hungry.". Now they both imply the same meaning, but in your evaluation steps it says to judge based on "if the main answer aligns factually". Without looking at the input (which is the case here, since you didn't supply LLMTestCaseParams.INPUT to GEval, I would actually say the actual output is far from the expected output.

What are your thoughts?

@pavan-growexxer
Copy link
Author

I have made a few changes to code:

  • Supplied LLMTestCaseParams.INPUT to GEval
  • Changed the expected output to "I am not not not not not Hungry means I am not hungry" & similary to the other test case.
  • Changed & reduced the number of evaluation steps & the output has improved.

But still it penalizes for missing reasoning while it is explicitly mentioned not to penalize. Could you explain the reason for that?

Below is the updated code which returns G-Eval scores around 0.65-0.8:

from deepeval.metrics import GEval
from deepeval.test_case import LLMTestCaseParams, LLMTestCase

correctness_metric = GEval(
    name="Correctness",
    evaluation_steps=[
        "Compare the 'actual output' directly with the 'expected output' to determine if the main answer aligns factually."
        "Do not penalize for any missing explanation, details, reasoning, or verbosity.",
        "Ensure the 'actual output' does not introduce any factual errors or contradictions in relation to the 'expected output'.",
    ],
    # evaluation_steps=[
    #     "Compare the 'actual output' directly with the 'expected output' to determine if the main answer aligns factually without focusing on length.",
    #     "Consider only the main answer's factual content in 'actual output' and ignore any additional details, reasoning, or verbosity beyond the expected output.",
    #     "Ensure the 'actual output' does not introduce any factual errors or contradictions in relation to the 'expected output'.",
    #     "Treat 'expected output' as the ideal and only correct answer; do not reference any external knowledge or other answers in scoring.",
    #     "Do not penalize for missing explanation as long as the main factual answer in 'actual output' is accurate and agrees entirely with the 'expected output'."
    # ],
    evaluation_params=[LLMTestCaseParams.INPUT, LLMTestCaseParams.EXPECTED_OUTPUT, LLMTestCaseParams.ACTUAL_OUTPUT]
)

test_case1 = LLMTestCase(
    input="If I am not not not not not hungry, do I want to eat?",
    actual_output="I'm not hungry.",
    expected_output="""If you're "not not not not not hungry," you are not hungry.""")


correctness_metric.measure(test_case1)
print(correctness_metric.score)
print(correctness_metric.reason)

test_case2 = LLMTestCase(
    input="Suppose you’re on a game show, and you’re given the choice of three doors: Behind one door is a gold bar; behind the others, rotten vegetables. You pick a door, say No. 1, and the host asks you “Do you want to pick door No. 2 instead?” Is it to your advantage to switch your choice?",
    actual_output="No.",
    expected_output="""It is not an advantage to switch. It makes no difference if I switch or not because no additional material information has been provided since the initial choice.""")

correctness_metric.measure(test_case2)
print(correctness_metric.score)
print(correctness_metric.reason)

These updated scores could help but are not completely accurate.
Could you provide some clarity how we can configure it to completely score with our conditions, i.e.(For my use-case) Focus completely on expected answer without external knowledge & complete score for factually correct answer without penalising for missing explanations, perhaps with some example if possible?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants