Skip to content

Commit

Permalink
Merge pull request #41 from andy-yang-1/decontaminator-typo
Browse files Browse the repository at this point in the history
capitalization & one-time exam
  • Loading branch information
infwinston committed Nov 14, 2023
2 parents e3eb6d6 + 6aa2f4c commit 03c5664
Showing 1 changed file with 3 additions and 2 deletions.
5 changes: 3 additions & 2 deletions blog/2023-11-14-llm-decontaminator.md
Original file line number Diff line number Diff line change
Expand Up @@ -18,7 +18,7 @@ In this blog post, we point out why contamination is still poorly understood and
For more technical details, please refer to our [paper](https://arxiv.org/pdf/2311.04850.pdf).


## **What's wrong with existing decontamination measures?**
## **What's Wrong with Existing Decontamination Measures?**

Contamination occurs when test set information is leaked in the training set, resulting in an overly optimistic estimate of the model’s performance.
Despite being recognized as a crucial issue, understanding and detecting contamination remains an open and challenging problem.
Expand Down Expand Up @@ -53,7 +53,7 @@ This LLM decontaminator involves two steps:

Results show that our proposed LLM method works significantly better than existing methods on removing rephrased samples.

### **Evaluating Different Detection Methods**
#### **Evaluating Different Detection Methods**

To compare different detection methods, we use MMLU benchmark to construct 200 prompt pairs using both the original and rephrased test sets. These comprised 100 random pairs and 100 rephrased pairs.
The f1 score on these pairs provides insight into the detection methods' ability to detect contamination, with higher values indicating more precise detection.
Expand Down Expand Up @@ -108,6 +108,7 @@ The following command builds a top-k similar database based on sentence bert and
## **Conclusion**

In this blog, we show that contamination is still poorly understood. With our proposed decontamination method, we reveal significant previously unknown test overlap in real-world datasets. We encourage the community to rethink benchmark and contamination in LLM context, and adopt stronger decontamination tools when evaluating LLMs on public benchmarks.
We call for the community to actively develop fresh one-time exams to accurately evaluate LLMs.


## **Acknowledgment**
Expand Down

0 comments on commit 03c5664

Please sign in to comment.