Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

What are some key things to remember while designing a metric? #1464

Open
ArslanS1997 opened this issue Sep 7, 2024 · 3 comments
Open

What are some key things to remember while designing a metric? #1464

ArslanS1997 opened this issue Sep 7, 2024 · 3 comments

Comments

@ArslanS1997
Copy link

So I wanted to know if you guys can provide any resources on best practices around metrics for DSPy.

The LLM program I am optimizing generated Python code. So I have decided that the metric be a score from 0-100.
With binary 50 points if the code does not run and 25 points for relevance, 25 points for correct data handling.

Would this be a good metric?

@acse-yl2020
Copy link

I am not sure if you are using LLM to evaluate the relevance and data handling, if so, it is the way I am designing it rn but I feel it may not be the best way of doing so, as the LLM tends to give good scores to answers as long as the answer appears 'reasonable' or 'relevant'.

If you find any better design of metric, you can share it again in the thread.

@ArslanS1997
Copy link
Author

Hi so it is a code execution problem, you can test whether the code runs. For relevance I personally judged the relevance of the answer for now. For data handling, so basic pandas operations for dealing with ints/strings turn things into the correct format etc can be checked via an LM.

Although it sortof works but I am not sure if I am doing everything right, are there any constraints on how metrics should be? Do they have to be less than 1 or continuous etc.

@acse-yl2020
Copy link

I basically follow the guide from https://dspy-docs.vercel.app/docs/building-blocks/metrics, also some of the guides from DSPy Assertions (but this is for output). Based on my experience, it can be weight, which means that it is flexible, i.e.,
weights = {
'relevance': 0.3,
'mathematical_rigor': 0.4,
'completeness': 0.2,
'clarity': 0.05,
'comparison_to_gold': 0.05
}
final_score = sum(weights[aspect] * scores[aspect] for aspect in weights)

But still, this may not be the best way of designing the metric at least for me. Like I mentioned, LLM tends to provide decent scores as long as the answer is 'relevant' or 'complete'. I am manually optimizing the prompts of evaluating each characteristic based on some existing research, but no luck atm.

Happy to hear from wise opinions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants