Skip to content

Improvements to Evals #851

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 13 commits into from
Closed

Improvements to Evals #851

wants to merge 13 commits into from

Conversation

bborn
Copy link
Contributor

@bborn bborn commented Oct 21, 2024

Just exploring some ideas:

  • Should be able to run evals against a full dataset (instead of one item at a time)
  • Should be able to run multiple evaluators at once
  • Add a Regex evaluator
  • Add cosine similarity evaluator
  • Add a LLM evaluator

@andreibondarev
Copy link
Collaborator

@bborn How useful is the regex evaluator in your opinion?

Do you have any thoughts evals that calculate vector distance?

@bborn
Copy link
Contributor Author

bborn commented Oct 21, 2024

@andreibondarev I think regex or other string comparison is pretty important. You might want to ensure your model/agent is returning a number, a URL, an email, etc. Or you might want to ensure that the answer contains the expected output string (this doesn't exist in Lanchain.rb yet), something like:

regex_evaluator.score(answer: "The answer to 2 + 2 is 4", expected_answer: "4")

Vector (or levenshtein, etc.) distance seems useful too. Not so much as an absolute score but as something you could look at over time (if our agent was getting a vector score of .75 for the last three months, and then we changed the prompt and now it's getting .45, we'd be concerned).

I think the evaluators kind of break down into LLM Graded, LLM Labeled, and Code Graded):

CleanShot 2024-10-21 at 08 21 04@2x

LLM Graded: ask another LLM to score the dataset item based on some criteria
LLM Labled: ask an LLM to label the dataset item
Code Graded: run the dataset item through some grading algorithm (regex, JSON, or other)

@bborn bborn changed the title First pass at evaluating a dataset with multiple evaluators Improvements to Evals Oct 21, 2024
@bborn bborn closed this Oct 21, 2024
@bborn bborn reopened this Oct 22, 2024
@bborn
Copy link
Contributor Author

bborn commented Oct 22, 2024

Another thought: maybe you should be able to add an Eval to your Agent or llm call like this:

dataset = "/path/to/dataset.jsonl"

evaluators = [
    Langchain::Evals::LLM::LLM.new(llm: llm),
    Langchain::Evals::LLM::CosineSimilarity.new(llm: llm)
]

eval_service = EvalService(evaluators, dataset)

response = llm.complete(prompt: "Once upon a time", evaluate_with: eval_service)

By default this would store eval results in a CSV (could be anything, sqlite, whatever) in the same location as the dataset.

Another idea would be the ability to log the completion results to the dataset before evaluating them (e.g. if you don't already have a dataset):

dataset = "/path/to/dataset.jsonl"

evaluators = [
    Langchain::Evals::LLM::LLM.new(llm: llm),
    Langchain::Evals::LLM::CosineSimilarity.new(llm: llm)
]

options: {
    log_completions: true
    log_rate: 0.5    #log 50% of completions
}

eval_service = EvalService(evaluators, dataset, options)

response = llm.complete(prompt: "Once upon a time", evaluate_with: eval_service)

@andreibondarev andreibondarev requested a review from Copilot April 17, 2025 18:21
Copy link

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR introduces several evaluator types to improve the evaluation framework, including Regex, LLM, and cosine similarity evaluators as well as updated dataset evaluation logic.

  • Added specs for Regex, LLM, and cosine similarity evaluators.
  • Extended base evaluation to support multiple evaluators concurrently.
  • Updated evaluators to handle extra keyword arguments for forward compatibility.

Reviewed Changes

Copilot reviewed 12 out of 12 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
spec/langchain/evals/regex/regex_spec.rb Added tests for Regex evaluator with variable interpolation.
spec/langchain/evals/llm/llm_spec.rb Added tests for LLM evaluator and custom prompt template support.
spec/langchain/evals/llm/cosine_similarity_spec.rb Added tests for the cosine similarity evaluator.
spec/langchain/evals/base_spec.rb Added dataset evaluation tests incorporating multiple evaluators.
lib/langchain/evals/regex/regex.rb Introduced Regex evaluator with variable interpolation support.
lib/langchain/evals/ragas/*.rb Updated evaluators to accept additional keyword arguments.
lib/langchain/evals/llm/prompts/expected_answer.yml Added expected answer prompt template for LLM evaluator.
lib/langchain/evals/llm/llm.rb Implemented LLM evaluator using prompt templates.
lib/langchain/evals/llm/cosine_similarity.rb Implemented cosine similarity evaluator.
lib/langchain/evals/base.rb Added a helper method for dataset evaluation with multiple evaluators.
Comments suppressed due to low confidence (1)

lib/langchain/evals/llm/prompts/expected_answer.yml:17

  • The template ends with triple quotes, which might be unintentional and could affect YAML parsing. Please verify if the trailing triple quotes are intended.
Does the submission match the Expected Answer? First, write out in a step by step manner your reasoning about each criterion to be sure that your conclusion is correct. Avoid simply stating the correct answers at the outset. Then print only the single character "Y" or "N" (without quotes or punctuation) on its own line corresponding to the correct answer of whether the submission matches the expected answer. At the end, repeat just the letter again by itself on a new line."""


VALID_ATTRIBUTES = %i[question answer context].freeze

def initialize(regex:, attributes: [:answer], combinator: :and)
Copy link
Preview

Copilot AI Apr 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The 'combinator' parameter is accepted but never utilized in the 'score' method. Consider either removing it or incorporating its functionality into the scoring logic.

Suggested change
def initialize(regex:, attributes: [:answer], combinator: :and)
def initialize(regex:, attributes: [:answer])

Copilot uses AI. Check for mistakes.

Comment on lines +12 to +15
answer_ebedding = llm.embed(text: answer).embedding
expected_answer_embedding = llm.embed(text: expected_answer).embedding

Langchain::Utils::CosineSimilarity.new(expected_answer_embedding, answer_ebedding).calculate_similarity
Copy link
Preview

Copilot AI Apr 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a typo in the variable name 'answer_ebedding'; it should likely be 'answer_embedding' for clarity and consistency.

Suggested change
answer_ebedding = llm.embed(text: answer).embedding
expected_answer_embedding = llm.embed(text: expected_answer).embedding
Langchain::Utils::CosineSimilarity.new(expected_answer_embedding, answer_ebedding).calculate_similarity
answer_embedding = llm.embed(text: answer).embedding
expected_answer_embedding = llm.embed(text: expected_answer).embedding
Langchain::Utils::CosineSimilarity.new(expected_answer_embedding, answer_embedding).calculate_similarity

Copilot uses AI. Check for mistakes.

@sergiobayona
Copy link
Collaborator

Thanks for the exploratory work on evaluators!
We agree evals belong in the ecosystem, but we need a broader discussion about whether they should live inside LangChain-RB or in a companion gem.
Because this PR is still marked Draft and has unresolved review comments, I’m closing it for now.
When we’ve decided on the best home for evals—or if you’d like to split this into smaller, ready-for-review PRs—feel free to reopen or submit a new one. 🙏

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants