Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add initial work on evals #935

Draft
wants to merge 3 commits into
base: main
Choose a base branch
from
Draft

Add initial work on evals #935

wants to merge 3 commits into from

Conversation

dmontagu
Copy link
Contributor

@dmontagu dmontagu commented Feb 16, 2025

Fix #915.

There's a lot more to polish/add before merging this but it shows the API I had in mind for benchmark-style / "offline" evals, and an initial stab at an API for (flexibly) producing reports.

The report stuff is probably more configurable than it should/needs to be, but it wasn't too hard to implement so I did. Happy to change how that works.

At least as of now, you can see an example run/output by running uv run pydantic_ai_slim/pydantic_ai/evals/__init__.py on this branch (that file has a if __name__ == '__main__' that produces an example report).

As of when I last updated this, in my terminal the report looks like this:
image

Note that if there are no scores / labels / metrics present in the cases, those columns will be excluded from the report. (So you don't have to pay the visual price unless you make use of those.) You also have the option to include case inputs and/or outputs in the generated reports, and can override most of the value- and diff-rendering logic on a per-score/label/metric basis.

@github-actions github-actions bot temporarily deployed to deploy-preview February 17, 2025 00:02 Inactive
@github-actions github-actions bot temporarily deployed to deploy-preview February 17, 2025 02:36 Inactive
Comment on lines 289 to 317
def average(cases: list[EvalReportCase]) -> EvalReportCaseAggregate:
"""Produce a synthetic "summary" case by averaging quantitative attributes."""
num_cases = len(cases)
if num_cases == 0:
raise ValueError('Cannot summarize an empty list of cases')

def _averages_by_name(values_by_name: list[dict[str, int | float]]) -> dict[str, float]:
counts_by_name: dict[str, int] = defaultdict(int)
sums_by_name: dict[str, float] = defaultdict(float)
for values in values_by_name:
for name, value in values.items():
counts_by_name[name] += 1
sums_by_name[name] += value
return {name: sums_by_name[name] / counts_by_name[name] for name in sums_by_name}

average_task_duration = sum(case.task_duration for case in cases) / num_cases
average_total_duration = sum(case.total_duration for case in cases) / num_cases

average_scores: dict[str, float] = _averages_by_name([case.scores for case in cases])
# TODO: Aggregate labels, showing the percentage occurrences of each label
average_metrics: dict[str, float] = _averages_by_name([case.metrics for case in cases])

return EvalReportCaseAggregate(
name='Averages',
scores=average_scores,
metrics=average_metrics,
task_duration=average_task_duration,
total_duration=average_total_duration,
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it worth making the averaging function injectable?

Copy link
Contributor

@hyperlint-ai hyperlint-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The style guide flagged several spelling errors that seemed like false positives. We skipped posting inline suggestions for the following words:

  • [Ee]vals
  • Evals
  • Pydantic

@github-actions github-actions bot temporarily deployed to deploy-preview February 17, 2025 22:49 Inactive
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Evals
2 participants