Skip to content

Latest commit

 

History

History
81 lines (70 loc) · 4.5 KB

report.md

File metadata and controls

81 lines (70 loc) · 4.5 KB

Report

Once you have computed the results for the models under evaluation, you can inspect and compare their metric scores by relying on the Report class. Moreover, you can export reports as LaTeX tables for scientific publications.

from guardbench import Report
from guardbench.datasets import get_datasets_by

report = Report(
    models=[  # Models under comparison
        {"name": "Llama Guard", "alias": "LG"},
        {"name": "Llama Guard 2", "alias": "LG-2"},
        {"name": "Llama Guard Defensive", "alias": "LG-D"},
        {"name": "Llama Guard Permissive", "alias": "LG-P"},
        {"name": "MD-Judge", "alias": "MD-J"},
        {"name": "Mistral", "alias": "Mis"},
        {"name": "Mistral Plus", "alias": "Mis+"},
    ],
    datasets=[  # Chosen evaluation datasets
        "malicious_instruct",
        "do_not_answer",
        "xstest",
        "openai_moderation_dataset",
        "beaver_tails_330k",
        "harmful_qa",
        "prosocial_dialog",
    ],
    out_dir="results",  # Where results are stored
)

You can display the report in IPython notebooks as follows:

report.display()

Output:

Dataset Metric LG LG-2 LG-D LG-P MD-J Mis Mis+
MaliciousInstruct Recall 0.820 0.890 1.000 0.920 0.990 0.980 0.990
DoNotAnswer Recall 0.321 0.442 0.496 0.399 0.501 0.435 0.460
XSTest F1 0.819 0.891 0.783 0.812 0.858 0.829 0.878
OpenAI Moderation Dataset F1 0.744 0.761 0.658 0.756 0.774 0.722 0.779
BeaverTails 330k F1 0.686 0.755 0.778 0.755 0.887 0.696 0.740
HarmfulQA F1 0.171 0.391 0.764 0.563 0.676 0.648 0.427
ProsocialDialog F1 0.519 0.383 0.792 0.691 0.720 0.697 0.762
Wins 0 1 3 0 3 0 1

You can export the report to LaTeX as follows:

report.to_latex()

Output:

\begin{table*}[!ht]
\centering
\begin{tabular}{lllllllll}
\hline
 Dataset                   & Metric   & LG                & LG-2              & LG-D              & LG-P              & MD-J              & Mis               & Mis+              \\
\hline
 MaliciousInstruct         & Recall   &            0.820  &            0.890  &    \textbf{1.000} &            0.920  & \underline{0.990} &            0.980  & \underline{0.990} \\
 DoNotAnswer               & Recall   &            0.321  &            0.442  & \underline{0.496} &            0.399  &    \textbf{0.501} &            0.435  &            0.460  \\
 XSTest                    & F1       &            0.819  &    \textbf{0.891} &            0.783  &            0.812  &            0.858  &            0.829  & \underline{0.878} \\
 OpenAI Moderation Dataset & F1       &            0.744  &            0.761  &            0.658  &            0.756  & \underline{0.774} &            0.722  &    \textbf{0.779} \\
 BeaverTails 330k          & F1       &            0.686  &            0.755  & \underline{0.778} &            0.755  &    \textbf{0.887} &            0.696  &            0.740  \\
 HarmfulQA                 & F1       &            0.171  &            0.391  &    \textbf{0.764} &            0.563  & \underline{0.676} &            0.648  &            0.427  \\
 ProsocialDialog           & F1       &            0.519  &            0.383  &    \textbf{0.792} &            0.691  &            0.720  &            0.697  & \underline{0.762} \\
\hline
 Wins                      &          &            0      & \underline{1}     &    \textbf{3}     &            0      &    \textbf{3}     &            0      & \underline{1}     \\
\hline
\end{tabular}
\caption{Evaluation results. Best results are highlighted in boldface. Second-best results are underlined.}
\label{tab:results}
\end{table*}