Improvements to Evals #851

bborn · 2024-10-21T12:01:15Z

Just exploring some ideas:

Should be able to run evals against a full dataset (instead of one item at a time)
Should be able to run multiple evaluators at once
Add a Regex evaluator
Add cosine similarity evaluator
Add a LLM evaluator

andreibondarev · 2024-10-21T13:14:05Z

@bborn How useful is the regex evaluator in your opinion?

Do you have any thoughts evals that calculate vector distance?

bborn · 2024-10-21T13:25:33Z

@andreibondarev I think regex or other string comparison is pretty important. You might want to ensure your model/agent is returning a number, a URL, an email, etc. Or you might want to ensure that the answer contains the expected output string (this doesn't exist in Lanchain.rb yet), something like:

regex_evaluator.score(answer: "The answer to 2 + 2 is 4", expected_answer: "4")

Vector (or levenshtein, etc.) distance seems useful too. Not so much as an absolute score but as something you could look at over time (if our agent was getting a vector score of .75 for the last three months, and then we changed the prompt and now it's getting .45, we'd be concerned).

I think the evaluators kind of break down into LLM Graded, LLM Labeled, and Code Graded):

LLM Graded: ask another LLM to score the dataset item based on some criteria
LLM Labled: ask an LLM to label the dataset item
Code Graded: run the dataset item through some grading algorithm (regex, JSON, or other)

bborn · 2024-10-22T11:51:17Z

Another thought: maybe you should be able to add an Eval to your Agent or llm call like this:

dataset = "/path/to/dataset.jsonl"

evaluators = [
    Langchain::Evals::LLM::LLM.new(llm: llm),
    Langchain::Evals::LLM::CosineSimilarity.new(llm: llm)
]

eval_service = EvalService(evaluators, dataset)

response = llm.complete(prompt: "Once upon a time", evaluate_with: eval_service)

By default this would store eval results in a CSV (could be anything, sqlite, whatever) in the same location as the dataset.

Another idea would be the ability to log the completion results to the dataset before evaluating them (e.g. if you don't already have a dataset):

dataset = "/path/to/dataset.jsonl"

evaluators = [
    Langchain::Evals::LLM::LLM.new(llm: llm),
    Langchain::Evals::LLM::CosineSimilarity.new(llm: llm)
]

options: {
    log_completions: true
    log_rate: 0.5    #log 50% of completions
}

eval_service = EvalService(evaluators, dataset, options)

response = llm.complete(prompt: "Once upon a time", evaluate_with: eval_service)

Copilot

Pull Request Overview

This PR introduces several evaluator types to improve the evaluation framework, including Regex, LLM, and cosine similarity evaluators as well as updated dataset evaluation logic.

Added specs for Regex, LLM, and cosine similarity evaluators.
Extended base evaluation to support multiple evaluators concurrently.
Updated evaluators to handle extra keyword arguments for forward compatibility.

Reviewed Changes

Copilot reviewed 12 out of 12 changed files in this pull request and generated 2 comments.

Show a summary per file

File	Description
spec/langchain/evals/regex/regex_spec.rb	Added tests for Regex evaluator with variable interpolation.
spec/langchain/evals/llm/llm_spec.rb	Added tests for LLM evaluator and custom prompt template support.
spec/langchain/evals/llm/cosine_similarity_spec.rb	Added tests for the cosine similarity evaluator.
spec/langchain/evals/base_spec.rb	Added dataset evaluation tests incorporating multiple evaluators.
lib/langchain/evals/regex/regex.rb	Introduced Regex evaluator with variable interpolation support.
lib/langchain/evals/ragas/*.rb	Updated evaluators to accept additional keyword arguments.
lib/langchain/evals/llm/prompts/expected_answer.yml	Added expected answer prompt template for LLM evaluator.
lib/langchain/evals/llm/llm.rb	Implemented LLM evaluator using prompt templates.
lib/langchain/evals/llm/cosine_similarity.rb	Implemented cosine similarity evaluator.
lib/langchain/evals/base.rb	Added a helper method for dataset evaluation with multiple evaluators.

Comments suppressed due to low confidence (1)

lib/langchain/evals/llm/prompts/expected_answer.yml:17

The template ends with triple quotes, which might be unintentional and could affect YAML parsing. Please verify if the trailing triple quotes are intended.

Does the submission match the Expected Answer? First, write out in a step by step manner your reasoning about each criterion to be sure that your conclusion is correct. Avoid simply stating the correct answers at the outset. Then print only the single character "Y" or "N" (without quotes or punctuation) on its own line corresponding to the correct answer of whether the submission matches the expected answer. At the end, repeat just the letter again by itself on a new line."""

Copilot · 2025-04-17T18:21:49Z

lib/langchain/evals/regex/regex.rb

+
+        VALID_ATTRIBUTES = %i[question answer context].freeze
+
+        def initialize(regex:, attributes: [:answer], combinator: :and)


The 'combinator' parameter is accepted but never utilized in the 'score' method. Consider either removing it or incorporating its functionality into the scoring logic.

Suggested change

def initialize(regex:, attributes: [:answer], combinator: :and)

def initialize(regex:, attributes: [:answer])

Copilot · 2025-04-17T18:21:49Z

lib/langchain/evals/llm/cosine_similarity.rb

+          answer_ebedding = llm.embed(text: answer).embedding
+          expected_answer_embedding = llm.embed(text: expected_answer).embedding
+
+          Langchain::Utils::CosineSimilarity.new(expected_answer_embedding, answer_ebedding).calculate_similarity


There is a typo in the variable name 'answer_ebedding'; it should likely be 'answer_embedding' for clarity and consistency.

Suggested change

answer_ebedding = llm.embed(text: answer).embedding

expected_answer_embedding = llm.embed(text: expected_answer).embedding

Langchain::Utils::CosineSimilarity.new(expected_answer_embedding, answer_ebedding).calculate_similarity

answer_embedding = llm.embed(text: answer).embedding

expected_answer_embedding = llm.embed(text: expected_answer).embedding

Langchain::Utils::CosineSimilarity.new(expected_answer_embedding, answer_embedding).calculate_similarity

sergiobayona · 2025-05-14T16:38:29Z

Thanks for the exploratory work on evaluators!
We agree evals belong in the ecosystem, but we need a broader discussion about whether they should live inside LangChain-RB or in a companion gem.
Because this PR is still marked Draft and has unresolved review comments, I’m closing it for now.
When we’ve decided on the best home for evals—or if you’d like to split this into smaller, ready-for-review PRs—feel free to reopen or submit a new one. 🙏

bborn added 7 commits October 21, 2024 07:00

first pass at evaluating a dataset with multiple evaluators

1b0229f

doc

3e8d75f

reorg

d9e8e0d

add regex evaluator

ac27a12

refactor

c41214c

doc

39669db

fix tests

2e7d084

bborn added 3 commits October 21, 2024 08:49

allow regex interpolation based on expected answer

1dba3fa

cosine similarity

3dee98f

oops - use expected answer and answer, not question

70db0b3

bborn changed the title ~~First pass at evaluating a dataset with multiple evaluators~~ Improvements to Evals Oct 21, 2024

bborn added 2 commits October 21, 2024 10:13

typo

96eca13

allow for optional dataset item attributes

d1e7034

bborn closed this Oct 21, 2024

more test

e64a2fe

bborn reopened this Oct 22, 2024

andreibondarev requested a review from Copilot April 17, 2025 18:21

Copilot AI reviewed Apr 17, 2025

View reviewed changes

sergiobayona closed this May 14, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Improvements to Evals #851

Improvements to Evals #851

Uh oh!

bborn commented Oct 21, 2024 •

edited

Loading

Uh oh!

andreibondarev commented Oct 21, 2024

Uh oh!

bborn commented Oct 21, 2024

Uh oh!

bborn commented Oct 22, 2024 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Apr 17, 2025

Uh oh!

Copilot AI Apr 17, 2025

Uh oh!

sergiobayona commented May 14, 2025

Uh oh!

Uh oh!


		VALID_ATTRIBUTES = %i[question answer context].freeze

		def initialize(regex:, attributes: [:answer], combinator: :and)

Uh oh!

Improvements to Evals #851

Improvements to Evals #851

Uh oh!

Conversation

bborn commented Oct 21, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

andreibondarev commented Oct 21, 2024

Uh oh!

bborn commented Oct 21, 2024

Uh oh!

bborn commented Oct 22, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Copilot AI Apr 17, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 17, 2025

Choose a reason for hiding this comment

Uh oh!

sergiobayona commented May 14, 2025

Uh oh!

Uh oh!

bborn commented Oct 21, 2024 •

edited

Loading

bborn commented Oct 22, 2024 •

edited

Loading