-
Notifications
You must be signed in to change notification settings - Fork 387
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GEval not focusing on expected_output & Relying on OpenAI instead #1149
Comments
@pavan-growexxer Hey! I don't think your examples are convincing enough to call them bugs, for example, in the first example the expected output is "If you're "not not not not not hungry," you do not want to eat.", which is different from "I'm not hungry.". Now they both imply the same meaning, but in your evaluation steps it says to judge based on "if the main answer aligns factually". Without looking at the input (which is the case here, since you didn't supply LLMTestCaseParams.INPUT to GEval, I would actually say the actual output is far from the expected output. What are your thoughts? |
I have made a few changes to code:
But still it penalizes for missing reasoning while it is explicitly mentioned not to penalize. Could you explain the reason for that? Below is the updated code which returns G-Eval scores around 0.65-0.8:
These updated scores could help but are not completely accurate. |
BUG
While testing DeepEval's GEval metric to evaluate complex queries, especially where LLMs failed to answer, I faced an issue where DeepEval is overseeing the provided expected_output relies & scores on basis of the LLM used.
Below are 2 of the queries with code & evaluation provided by DeepEval where DeepEval failed to evaluate as per the evaluation steps provided
Output
0.3142030521814375
The actual output contradicts the expected output, which states not wanting to eat, while the actual output states not being hungry, implying a different meaning.
Expected behaviour
The actual_output generated by llm is actually same as the expected answer and should be rated close to 1
Output
0.6261116457045666
The main answer 'No.' implies that switching is not advantageous, aligning with the expected output, but lacks the additional context provided in the expected output.
Expected behaviour
The score should be close to 1 as it is clearly mentioned in evaluation steps not to penalize for missing out reasoning/context if the actual_output agrees with expected_output
Desktop (please complete the following information):
Could anyone help what is the reason for this & How can i get my custom scoring as per my rules?
The text was updated successfully, but these errors were encountered: