From 17e06c9538b8c0b8cf8a1e8873c2827f3610936b Mon Sep 17 00:00:00 2001 From: andy-yang-1 Date: Mon, 13 Nov 2023 23:41:10 -0800 Subject: [PATCH 1/2] capitalization & one-time exam --- blog/2023-11-14-llm-decontaminator.md | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/blog/2023-11-14-llm-decontaminator.md b/blog/2023-11-14-llm-decontaminator.md index c0957a6d..77498304 100644 --- a/blog/2023-11-14-llm-decontaminator.md +++ b/blog/2023-11-14-llm-decontaminator.md @@ -18,7 +18,7 @@ In this blog post, we point out why contamination is still poorly understood and For more technical details, please refer to our [paper](https://arxiv.org/pdf/2311.04850.pdf). -## **What's wrong with existing decontamination measures?** +## **What's Wrong with Existing Decontamination Measures?** Contamination occurs when test set information is leaked in the training set, resulting in an overly optimistic estimate of the model’s performance. Despite being recognized as a crucial issue, understanding and detecting contamination remains an open and challenging problem. @@ -108,6 +108,7 @@ The following command builds a top-k similar database based on sentence bert and ## **Conclusion** In this blog, we show that contamination is still poorly understood. With our proposed decontamination method, we reveal significant previously unknown test overlap in real-world datasets. We encourage the community to rethink benchmark and contamination in LLM context, and adopt stronger decontamination tools when evaluating LLMs on public benchmarks. +We call for the community to actively develop fresh one-time exams to accurately evaluate LLMs. ## **Acknowledgment** From 6aa2f4c66a05ac6446350b80a18b9ce161e98242 Mon Sep 17 00:00:00 2001 From: andy-yang-1 Date: Mon, 13 Nov 2023 23:46:11 -0800 Subject: [PATCH 2/2] change font --- blog/2023-11-14-llm-decontaminator.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/blog/2023-11-14-llm-decontaminator.md b/blog/2023-11-14-llm-decontaminator.md index 77498304..9de5c2cf 100644 --- a/blog/2023-11-14-llm-decontaminator.md +++ b/blog/2023-11-14-llm-decontaminator.md @@ -53,7 +53,7 @@ This LLM decontaminator involves two steps: Results show that our proposed LLM method works significantly better than existing methods on removing rephrased samples. -### **Evaluating Different Detection Methods** +#### **Evaluating Different Detection Methods** To compare different detection methods, we use MMLU benchmark to construct 200 prompt pairs using both the original and rephrased test sets. These comprised 100 random pairs and 100 rephrased pairs. The f1 score on these pairs provides insight into the detection methods' ability to detect contamination, with higher values indicating more precise detection.