-
-
Notifications
You must be signed in to change notification settings - Fork 250
Improvements to Evals #851
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
@bborn How useful is the regex evaluator in your opinion? Do you have any thoughts evals that calculate vector distance? |
@andreibondarev I think regex or other string comparison is pretty important. You might want to ensure your model/agent is returning a number, a URL, an email, etc. Or you might want to ensure that the answer contains the expected output string (this doesn't exist in Lanchain.rb yet), something like:
Vector (or levenshtein, etc.) distance seems useful too. Not so much as an absolute score but as something you could look at over time (if our agent was getting a vector score of .75 for the last three months, and then we changed the prompt and now it's getting .45, we'd be concerned). I think the evaluators kind of break down into LLM Graded, LLM Labeled, and Code Graded): LLM Graded: ask another LLM to score the dataset item based on some criteria |
Another thought: maybe you should be able to add an Eval to your Agent or llm call like this:
By default this would store eval results in a CSV (could be anything, sqlite, whatever) in the same location as the dataset. Another idea would be the ability to log the completion results to the dataset before evaluating them (e.g. if you don't already have a dataset):
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR introduces several evaluator types to improve the evaluation framework, including Regex, LLM, and cosine similarity evaluators as well as updated dataset evaluation logic.
- Added specs for Regex, LLM, and cosine similarity evaluators.
- Extended base evaluation to support multiple evaluators concurrently.
- Updated evaluators to handle extra keyword arguments for forward compatibility.
Reviewed Changes
Copilot reviewed 12 out of 12 changed files in this pull request and generated 2 comments.
Show a summary per file
File | Description |
---|---|
spec/langchain/evals/regex/regex_spec.rb | Added tests for Regex evaluator with variable interpolation. |
spec/langchain/evals/llm/llm_spec.rb | Added tests for LLM evaluator and custom prompt template support. |
spec/langchain/evals/llm/cosine_similarity_spec.rb | Added tests for the cosine similarity evaluator. |
spec/langchain/evals/base_spec.rb | Added dataset evaluation tests incorporating multiple evaluators. |
lib/langchain/evals/regex/regex.rb | Introduced Regex evaluator with variable interpolation support. |
lib/langchain/evals/ragas/*.rb | Updated evaluators to accept additional keyword arguments. |
lib/langchain/evals/llm/prompts/expected_answer.yml | Added expected answer prompt template for LLM evaluator. |
lib/langchain/evals/llm/llm.rb | Implemented LLM evaluator using prompt templates. |
lib/langchain/evals/llm/cosine_similarity.rb | Implemented cosine similarity evaluator. |
lib/langchain/evals/base.rb | Added a helper method for dataset evaluation with multiple evaluators. |
Comments suppressed due to low confidence (1)
lib/langchain/evals/llm/prompts/expected_answer.yml:17
- The template ends with triple quotes, which might be unintentional and could affect YAML parsing. Please verify if the trailing triple quotes are intended.
Does the submission match the Expected Answer? First, write out in a step by step manner your reasoning about each criterion to be sure that your conclusion is correct. Avoid simply stating the correct answers at the outset. Then print only the single character "Y" or "N" (without quotes or punctuation) on its own line corresponding to the correct answer of whether the submission matches the expected answer. At the end, repeat just the letter again by itself on a new line."""
|
||
VALID_ATTRIBUTES = %i[question answer context].freeze | ||
|
||
def initialize(regex:, attributes: [:answer], combinator: :and) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The 'combinator' parameter is accepted but never utilized in the 'score' method. Consider either removing it or incorporating its functionality into the scoring logic.
def initialize(regex:, attributes: [:answer], combinator: :and) | |
def initialize(regex:, attributes: [:answer]) |
Copilot uses AI. Check for mistakes.
answer_ebedding = llm.embed(text: answer).embedding | ||
expected_answer_embedding = llm.embed(text: expected_answer).embedding | ||
|
||
Langchain::Utils::CosineSimilarity.new(expected_answer_embedding, answer_ebedding).calculate_similarity |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is a typo in the variable name 'answer_ebedding'; it should likely be 'answer_embedding' for clarity and consistency.
answer_ebedding = llm.embed(text: answer).embedding | |
expected_answer_embedding = llm.embed(text: expected_answer).embedding | |
Langchain::Utils::CosineSimilarity.new(expected_answer_embedding, answer_ebedding).calculate_similarity | |
answer_embedding = llm.embed(text: answer).embedding | |
expected_answer_embedding = llm.embed(text: expected_answer).embedding | |
Langchain::Utils::CosineSimilarity.new(expected_answer_embedding, answer_embedding).calculate_similarity |
Copilot uses AI. Check for mistakes.
Thanks for the exploratory work on evaluators! |
Just exploring some ideas: