Add initial work on evals #935

dmontagu · 2025-02-16T23:59:21Z

There's a lot more to polish/add before merging this but it shows the API I had in mind for benchmark-style / "offline" evals, and an initial stab at an API for (flexibly) producing reports.

The report stuff is probably more configurable than it should/needs to be, but it wasn't too hard to implement so I did. Happy to change how that works.

At least as of now, you can see an example run/output by running uv run pydantic_ai_slim/pydantic_ai/evals/__init__.py on this branch (that file has a if __name__ == '__main__' that produces an example report).

As of when I last updated this, in my terminal the report looks like this:

Note that if there are no scores / labels / metrics present in the cases, those columns will be excluded from the report. (So you don't have to pay the visual price unless you make use of those.) You also have the option to include case inputs and/or outputs in the generated reports, and can override most of the value- and diff-rendering logic on a per-score/label/metric basis.

mikeedjones · 2025-02-17T12:15:29Z

pydantic_ai_slim/pydantic_ai/evals/reports.py

+    def average(cases: list[EvalReportCase]) -> EvalReportCaseAggregate:
+        """Produce a synthetic "summary" case by averaging quantitative attributes."""
+        num_cases = len(cases)
+        if num_cases == 0:
+            raise ValueError('Cannot summarize an empty list of cases')
+
+        def _averages_by_name(values_by_name: list[dict[str, int | float]]) -> dict[str, float]:
+            counts_by_name: dict[str, int] = defaultdict(int)
+            sums_by_name: dict[str, float] = defaultdict(float)
+            for values in values_by_name:
+                for name, value in values.items():
+                    counts_by_name[name] += 1
+                    sums_by_name[name] += value
+            return {name: sums_by_name[name] / counts_by_name[name] for name in sums_by_name}
+
+        average_task_duration = sum(case.task_duration for case in cases) / num_cases
+        average_total_duration = sum(case.total_duration for case in cases) / num_cases
+
+        average_scores: dict[str, float] = _averages_by_name([case.scores for case in cases])
+        # TODO: Aggregate labels, showing the percentage occurrences of each label
+        average_metrics: dict[str, float] = _averages_by_name([case.metrics for case in cases])
+
+        return EvalReportCaseAggregate(
+            name='Averages',
+            scores=average_scores,
+            metrics=average_metrics,
+            task_duration=average_task_duration,
+            total_duration=average_total_duration,
+        )


is it worth making the averaging function injectable?

hyperlint-ai

The style guide flagged several spelling errors that seemed like false positives. We skipped posting inline suggestions for the following words:

[Ee]vals
Evals
Pydantic

Add initial work on evals

4091b5a

github-actions bot temporarily deployed to deploy-preview February 17, 2025 00:02 Inactive

Add averages and do some more cleanup

1ba636e

github-actions bot temporarily deployed to deploy-preview February 17, 2025 02:36 Inactive

mikeedjones reviewed Feb 17, 2025

View reviewed changes

Make pydantic_evals its own package

632efc1

hyperlint-ai bot reviewed Feb 17, 2025

View reviewed changes

github-actions bot temporarily deployed to deploy-preview February 17, 2025 22:49 Inactive

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add initial work on evals #935

Add initial work on evals #935

dmontagu commented Feb 16, 2025 •

edited by samuelcolvin

Loading

mikeedjones Feb 17, 2025

hyperlint-ai bot left a comment

Add initial work on evals #935

Are you sure you want to change the base?

Add initial work on evals #935

Conversation

dmontagu commented Feb 16, 2025 • edited by samuelcolvin Loading

mikeedjones Feb 17, 2025

Choose a reason for hiding this comment

hyperlint-ai bot left a comment

Choose a reason for hiding this comment

dmontagu commented Feb 16, 2025 •

edited by samuelcolvin

Loading