-
Notifications
You must be signed in to change notification settings - Fork 511
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add initial work on evals #935
base: main
Are you sure you want to change the base?
Conversation
def average(cases: list[EvalReportCase]) -> EvalReportCaseAggregate: | ||
"""Produce a synthetic "summary" case by averaging quantitative attributes.""" | ||
num_cases = len(cases) | ||
if num_cases == 0: | ||
raise ValueError('Cannot summarize an empty list of cases') | ||
|
||
def _averages_by_name(values_by_name: list[dict[str, int | float]]) -> dict[str, float]: | ||
counts_by_name: dict[str, int] = defaultdict(int) | ||
sums_by_name: dict[str, float] = defaultdict(float) | ||
for values in values_by_name: | ||
for name, value in values.items(): | ||
counts_by_name[name] += 1 | ||
sums_by_name[name] += value | ||
return {name: sums_by_name[name] / counts_by_name[name] for name in sums_by_name} | ||
|
||
average_task_duration = sum(case.task_duration for case in cases) / num_cases | ||
average_total_duration = sum(case.total_duration for case in cases) / num_cases | ||
|
||
average_scores: dict[str, float] = _averages_by_name([case.scores for case in cases]) | ||
# TODO: Aggregate labels, showing the percentage occurrences of each label | ||
average_metrics: dict[str, float] = _averages_by_name([case.metrics for case in cases]) | ||
|
||
return EvalReportCaseAggregate( | ||
name='Averages', | ||
scores=average_scores, | ||
metrics=average_metrics, | ||
task_duration=average_task_duration, | ||
total_duration=average_total_duration, | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is it worth making the averaging function injectable?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The style guide flagged several spelling errors that seemed like false positives. We skipped posting inline suggestions for the following words:
- [Ee]vals
- Evals
- Pydantic
Fix #915.
There's a lot more to polish/add before merging this but it shows the API I had in mind for benchmark-style / "offline" evals, and an initial stab at an API for (flexibly) producing reports.
The report stuff is probably more configurable than it should/needs to be, but it wasn't too hard to implement so I did. Happy to change how that works.
At least as of now, you can see an example run/output by running
uv run pydantic_ai_slim/pydantic_ai/evals/__init__.py
on this branch (that file has aif __name__ == '__main__'
that produces an example report).As of when I last updated this, in my terminal the report looks like this:

Note that if there are no scores / labels / metrics present in the cases, those columns will be excluded from the report. (So you don't have to pay the visual price unless you make use of those.) You also have the option to include case inputs and/or outputs in the generated reports, and can override most of the value- and diff-rendering logic on a per-score/label/metric basis.