-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
What are some key things to remember while designing a metric? #1464
Comments
I am not sure if you are using LLM to evaluate the relevance and data handling, if so, it is the way I am designing it rn but I feel it may not be the best way of doing so, as the LLM tends to give good scores to answers as long as the answer appears 'reasonable' or 'relevant'. If you find any better design of metric, you can share it again in the thread. |
Hi so it is a code execution problem, you can test whether the code runs. For relevance I personally judged the relevance of the answer for now. For data handling, so basic pandas operations for dealing with ints/strings turn things into the correct format etc can be checked via an LM. Although it sortof works but I am not sure if I am doing everything right, are there any constraints on how metrics should be? Do they have to be less than 1 or continuous etc. |
I basically follow the guide from https://dspy-docs.vercel.app/docs/building-blocks/metrics, also some of the guides from DSPy Assertions (but this is for output). Based on my experience, it can be weight, which means that it is flexible, i.e., But still, this may not be the best way of designing the metric at least for me. Like I mentioned, LLM tends to provide decent scores as long as the answer is 'relevant' or 'complete'. I am manually optimizing the prompts of evaluating each characteristic based on some existing research, but no luck atm. Happy to hear from wise opinions. |
So I wanted to know if you guys can provide any resources on best practices around metrics for DSPy.
The LLM program I am optimizing generated Python code. So I have decided that the metric be a score from 0-100.
With binary 50 points if the code does not run and 25 points for relevance, 25 points for correct data handling.
Would this be a good metric?
The text was updated successfully, but these errors were encountered: