Skip to content

Commit

Permalink
link to article
Browse files Browse the repository at this point in the history
  • Loading branch information
jwmueller authored Aug 28, 2024
1 parent 310a58d commit bc8b957
Showing 1 changed file with 4 additions and 2 deletions.
6 changes: 4 additions & 2 deletions llm_evals_w_crowdlab/llm_evals_w_crowdlab.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,7 @@
"\n",
"Here we consider the MT-Bench dataset, which contains: many user requests, two possible responses for each request from different LLM models, and annotations regarding which of the two responses is considered better. Each example has a varying number of judge annotations provided by authors of the original paper and other \"experts\" (graduate students). We use CROWDLAB to: produce high-quality final consensus annotations (to enable accurate LLM Evals) as well as measure the quality of the annotators. CROWDLAB relies on probabilistic predictions from any ML model -- here we use logprobs from GPT-4 applied in the LLM-as-judge framework.\n",
"\n",
"You can use the same technique for any LLM Evals involving multiple human/AI judges, to help your team better evaluate models.\n"
"You can use the same technique for any LLM Evals involving multiple human/AI judges, to help your team better evaluate models. Read more in our [blog](https://cleanlab.ai/blog/team-llm-evals/).\n"
]
},
{
Expand Down Expand Up @@ -4520,7 +4520,9 @@
"id": "87d37120-cd8c-4ce7-ac4e-a1e4c3ec19a3"
},
"source": [
"Experts and authors seem to have roughly similar annotator quality! That's a neat observation, especially since we don't have ground truth labels"
"Experts and authors seem to have roughly similar annotator quality! That's a neat observation, especially since we don't have ground truth labels.\n",
"\n",
"Learn more about proper Evals that combine human and LLM judges in our [blog](https://cleanlab.ai/blog/team-llm-evals/)."
]
}
],
Expand Down

0 comments on commit bc8b957

Please sign in to comment.