Report

Once you have computed the results for the models under evaluation, you can inspect and compare their metric scores by relying on the Report class. Moreover, you can export reports as LaTeX tables for scientific publications.

from guardbench import Report
from guardbench.datasets import get_datasets_by

report = Report(
    models=[  # Models under comparison
        {"name": "Llama Guard", "alias": "LG"},
        {"name": "Llama Guard 2", "alias": "LG-2"},
        {"name": "Llama Guard Defensive", "alias": "LG-D"},
        {"name": "Llama Guard Permissive", "alias": "LG-P"},
        {"name": "MD-Judge", "alias": "MD-J"},
        {"name": "Mistral", "alias": "Mis"},
        {"name": "Mistral Plus", "alias": "Mis+"},
    ],
    datasets=[  # Chosen evaluation datasets
        "malicious_instruct",
        "do_not_answer",
        "xstest",
        "openai_moderation_dataset",
        "beaver_tails_330k",
        "harmful_qa",
        "prosocial_dialog",
    ],
    out_dir="results",  # Where results are stored
)

You can display the report in IPython notebooks as follows:

report.display()

Output:

Dataset	Metric	LG	LG-2	LG-D	LG-P	MD-J	Mis	Mis+
MaliciousInstruct	Recall	0.820	0.890	1.000	0.920	0.990	0.980	0.990
DoNotAnswer	Recall	0.321	0.442	0.496	0.399	0.501	0.435	0.460
XSTest	F1	0.819	0.891	0.783	0.812	0.858	0.829	0.878
OpenAI Moderation Dataset	F1	0.744	0.761	0.658	0.756	0.774	0.722	0.779
BeaverTails 330k	F1	0.686	0.755	0.778	0.755	0.887	0.696	0.740
HarmfulQA	F1	0.171	0.391	0.764	0.563	0.676	0.648	0.427
ProsocialDialog	F1	0.519	0.383	0.792	0.691	0.720	0.697	0.762
Wins		0	1	3	0	3	0	1

You can export the report to LaTeX as follows:

report.to_latex()

Output:

\begin{table*}[!ht]
\centering
\begin{tabular}{lllllllll}
\hline
 Dataset                   & Metric   & LG                & LG-2              & LG-D              & LG-P              & MD-J              & Mis               & Mis+              \\
\hline
 MaliciousInstruct         & Recall   &            0.820  &            0.890  &    \textbf{1.000} &            0.920  & \underline{0.990} &            0.980  & \underline{0.990} \\
 DoNotAnswer               & Recall   &            0.321  &            0.442  & \underline{0.496} &            0.399  &    \textbf{0.501} &            0.435  &            0.460  \\
 XSTest                    & F1       &            0.819  &    \textbf{0.891} &            0.783  &            0.812  &            0.858  &            0.829  & \underline{0.878} \\
 OpenAI Moderation Dataset & F1       &            0.744  &            0.761  &            0.658  &            0.756  & \underline{0.774} &            0.722  &    \textbf{0.779} \\
 BeaverTails 330k          & F1       &            0.686  &            0.755  & \underline{0.778} &            0.755  &    \textbf{0.887} &            0.696  &            0.740  \\
 HarmfulQA                 & F1       &            0.171  &            0.391  &    \textbf{0.764} &            0.563  & \underline{0.676} &            0.648  &            0.427  \\
 ProsocialDialog           & F1       &            0.519  &            0.383  &    \textbf{0.792} &            0.691  &            0.720  &            0.697  & \underline{0.762} \\
\hline
 Wins                      &          &            0      & \underline{1}     &    \textbf{3}     &            0      &    \textbf{3}     &            0      & \underline{1}     \\
\hline
\end{tabular}
\caption{Evaluation results. Best results are highlighted in boldface. Second-best results are underlined.}
\label{tab:results}
\end{table*}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!