Skip to content

Commit

Permalink
deploy: e3eb6d6
Browse files Browse the repository at this point in the history
  • Loading branch information
merrymercy committed Nov 14, 2023
1 parent f3434fd commit c65197a
Show file tree
Hide file tree
Showing 253 changed files with 106 additions and 32 deletions.
2 changes: 1 addition & 1 deletion 404/index.html

Large diffs are not rendered by default.

1 change: 0 additions & 1 deletion _next/data/pBM9m1a_8_Ms7gw2oD-1x/blog.json

This file was deleted.

File renamed without changes.
1 change: 1 addition & 0 deletions _next/data/sSHg4S4P9zXUwLihxKmlR/blog.json

Large diffs are not rendered by default.

Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
{"pageProps":{"frontmatter":{"title":"Cache me if you can! How to beat GPT-4 with a 13B model","author":"Shuo Yang*, Wei-Lin Chiang*, Lianmin Zheng*, Joseph E. Gonzalez, Ion Stoica","date":"Nov 14, 2023","previewImg":"/images/blog/decontaminator/rephrase-score_with_border.png"},"content":"\n\nAnnouncing Llama-rephraser: 13B models reaching GPT-4 performance in major benchmarks (MMLU/GSK-8K/HumanEval)! \nTo ensure result validity, we followed OpenAI's decontamination method and found no evidence of data contamination.\n\n\n<img src=\"/images/blog/decontaminator/llama-rephraser.png\" style=\"display:block; margin-top: auto; margin-left: auto; margin-right: auto; margin-bottom: auto;\"></img>\n\nWhat's the trick behind it? Well, rephrasing the test set is all you need! We simply paraphrase a test sample or translate it into a different language. It turns out a 13B LLM is smart enough to \"generalize\" beyond such variations and reaches drastically high benchmark performance. So, did we just make a big breakthrough? Apparently, there is something wrong with our understanding of contamination.\n\nIn this blog post, we point out why contamination is still poorly understood and how existing decontamination measures fail to capture such nuances. To address such risks, we propose a stronger [LLM-based decontaminator](https://github.com/lm-sys/llm-decontaminator) and apply it to real-world training datasets (e.g., the Stack, RedPajama), revealing significant test overlap with widely used benchmarks. \nFor more technical details, please refer to our [paper](https://arxiv.org/pdf/2311.04850.pdf).\n\n\n## **What's wrong with existing decontamination measures?**\n\nContamination occurs when test set information is leaked in the training set, resulting in an overly optimistic estimate of the model’s performance.\nDespite being recognized as a crucial issue, understanding and detecting contamination remains an open and challenging problem.\n\nThe most commonly used approaches are n-gram overlap and embedding similarity search.\nN-gram overlap relies on string matching to detect contamination, widely used by leading developments such as [GPT-4](https://arxiv.org/pdf/2303.08774.pdf), [PaLM](https://arxiv.org/pdf/2204.02311.pdf), and [Llama-2](https://arxiv.org/pdf/2307.09288.pdf).\nEmbedding similarity search uses the embeddings of pre-trained models (e.g., BERT) to find similar and potentially contaminated examples.\n\nHowever, we show that simple variations of the test data (e.g., paraphrasing, translation) can easily bypass existing simple detection methods. \nWe refer to such variations of test cases as _Rephrased Samples_.\n\nBelow we demonstrate a rephrased sample from the MMLU benchmark. We show that if such samples are included in the training set, a 13B model can reach drastically high performance (MMLU 85.9).\nUnfortunately, existing detection methods (e.g., n-gram overlap, embedding similarity) fail to detect such contamination. The embedding similarity approach struggles to distinguish the rephrased question from other questions in the same subject (high school US history).\n\n\n\n<img src=\"/images/blog/decontaminator/overview.png\" style=\"display:block; margin:auto; max-width:100%; height:auto;\">\n\n\nWith similar rephrasing techniques, we observe consistent results in widely used coding and math benchmarks such as HumanEval and GSM-8K (shown in the cover figure). Therefore, being able to detect such rephrased samples becomes critical.\n\n\n\n## **Stronger Detection Method: LLM Decontaminator**\n\nTo address the risk of possible contamination, we propose a new contamination detection method “LLM decontaminator”.\n\nThis LLM decontaminator involves two steps:\n\n 1. For each test case, LLM decontaminator identifies the top-k training items with the highest similarity using the embedding similarity search.\n 2. From these items, LLM decontaminator generates k potential rephrased pairs. Each pair is evaluated for rephrasing using an advanced LLM, such as GPT-4.\n\nResults show that our proposed LLM method works significantly better than existing methods on removing rephrased samples.\n\n### **Evaluating Different Detection Methods**\n\nTo compare different detection methods, we use MMLU benchmark to construct 200 prompt pairs using both the original and rephrased test sets. These comprised 100 random pairs and 100 rephrased pairs.\nThe f1 score on these pairs provides insight into the detection methods' ability to detect contamination, with higher values indicating more precise detection.\nAs shown in the following table, except for the LLM decontaminator, all other detection methods introduce some false positives. Both rephrased and translated samples successfully evade the n-gram overlap detection. With multi-qa BERT, the embedding similarity search proves ineffective against translated samples. Our proposed LLM decontaminator is more robust in all cases with the highest f1 scores.\n\n\n\n<img src=\"/images/blog/decontaminator/MMLU-us-f1score.png\" style=\"display:block; margin:auto; max-width:100%; height:auto;\">\n\n## **Contamination in Real-World Dataset**\n\nWe apply the LLM decontaminator to widely used real-world datasets (e.g., the Stack, RedPajama, etc) and identify a substantial amount of rephrased samples. The table below displays the contamination percentage of different benchmarks in each training dataset.\n\n\n<img src=\"/images/blog/decontaminator/real-world-rephrase.png\" style=\"display:block; margin:auto; max-width:100%; height:auto;\">\n\nBelow we show some detected samples.\n\n[CodeAlpaca](https://github.com/sahil280114/codealpaca) contains 20K instruction-following synthetic data generated by GPT, which is widely used for instruction fine-tuning (e.g., [Tulu](https://huggingface.co/TheBloke/tulu-30B-fp16)). \n\nA rephrased example in CodeAlpaca is shown below.\n\n<img src=\"/images/blog/decontaminator/codealpaca-rephrase.png\" style=\"display:block; margin-top: auto; margin-left: auto; margin-right: auto; margin-bottom: auto;\"></img>\n\nThis suggests contamination may subtly present in synthetic data generated by LLMs. In the Phi-1 [report](https://arxiv.org/pdf/2306.11644.pdf), they also discover such semantically similar test samples that are undetectable by n-gram overlap.\n\n\n[MATH](https://github.com/hendrycks/math) is a widely recognized math training dataset that spans various mathematical domains, including algebra, geometry, and number theory. \nSurprisingly, we even find contamination between the train-test split in the MATH benchmark as shown below.\n\n\n<img src=\"/images/blog/decontaminator/MATH-rephrase.png\" style=\"display:block; margin-top: auto; margin-left: auto; margin-right: auto; margin-bottom: auto;\"></img>\n\n[StarCoder-Data](https://huggingface.co/datasets/bigcode/starcoderdata) is used for training StarCoder and StarCoderBase, and it contains 783GB of code in 86 programming languages. In the StarCoder [paper](https://arxiv.org/pdf/2305.06161.pdf), the code training data was decontaminated by removing files that contained docstrings or solutions from HumanEval. However, there are still some samples detected by LLM decontaminator.\n\n<img src=\"/images/blog/decontaminator/starcoder-rephrase.png\" style=\"display:block; margin-top: auto; margin-left: auto; margin-right: auto; margin-bottom: auto;\"></img>\n\n## **Use LLM Decontaminator to Scan Your Data**\n\nBased on the above study, we suggest the community adopt a stronger decontamination method when using any public benchmarks. Our proposed LLM decontaminator is open-sourced on GitHub.\nHere we show how to remove rephrased samples from training data using the LLM decontaminator tool. The following example can be found [here](https://github.com/lm-sys/llm-decontaminator#detect).\n\n[Pre-process](https://github.com/lm-sys/llm-decontaminator#pre-process) training data and test data.\nThe LLM decontaminator accepts the dataset in jsonl format, with each line corresponding to a `{\"text\": data}` entry.\n\nRun [End2End](https://github.com/lm-sys/llm-decontaminator#end2end) detection.\nThe following command builds a top-k similar database based on sentence bert and uses GPT-4 to check one by one if they are rephrased samples. You can select your embedding model and detection model by modifying the parameters.\n\n<img src=\"/images/blog/decontaminator/run-e2e.png\" style=\"display:block; margin-top: auto; margin-left: auto; margin-right: auto; margin-bottom: auto;\"></img>\n\n\n## **Conclusion**\n\nIn this blog, we show that contamination is still poorly understood. With our proposed decontamination method, we reveal significant previously unknown test overlap in real-world datasets. We encourage the community to rethink benchmark and contamination in LLM context, and adopt stronger decontamination tools when evaluating LLMs on public benchmarks.\n\n\n## **Acknowledgment**\n\nWe would like to express our gratitude to Ying Sheng for the early discussion on rephrased samples.\nWe also extend our thanks to Dacheng Li, Erran Li, Hao Liu, Jacob Steinhardt, Hao Zhang, and Siyuan Zhuang for providing insightful feedback.\n\n\n## **Citation**\n\n```\n@misc{yang2023rethinking,\n title={Rethinking Benchmark and Contamination for Language Models with Rephrased Samples}, \n author={Shuo Yang and Wei-Lin Chiang and Lianmin Zheng and Joseph E. Gonzalez and Ion Stoica},\n year={2023},\n eprint={2311.04850},\n archivePrefix={arXiv},\n primaryClass={cs.CL}\n}\n```","slug":"2023-11-14-llm-decontaminator"},"__N_SSG":true}
1 change: 0 additions & 1 deletion _next/static/pBM9m1a_8_Ms7gw2oD-1x/_ssgManifest.js

This file was deleted.

1 change: 1 addition & 0 deletions _next/static/sSHg4S4P9zXUwLihxKmlR/_ssgManifest.js

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

4 changes: 2 additions & 2 deletions about/index.html

Large diffs are not rendered by default.

4 changes: 2 additions & 2 deletions blog/2023-03-30-vicuna/index.html

Large diffs are not rendered by default.

4 changes: 2 additions & 2 deletions blog/2023-05-03-arena/index.html

Large diffs are not rendered by default.

4 changes: 2 additions & 2 deletions blog/2023-05-10-leaderboard/index.html

Large diffs are not rendered by default.

4 changes: 2 additions & 2 deletions blog/2023-05-25-leaderboard/index.html

Large diffs are not rendered by default.

4 changes: 2 additions & 2 deletions blog/2023-06-09-api-server/index.html

Large diffs are not rendered by default.

4 changes: 2 additions & 2 deletions blog/2023-06-22-leaderboard/index.html

Large diffs are not rendered by default.

4 changes: 2 additions & 2 deletions blog/2023-06-29-longchat/index.html

Large diffs are not rendered by default.

4 changes: 2 additions & 2 deletions blog/2023-07-20-dataset/index.html

Large diffs are not rendered by default.

4 changes: 2 additions & 2 deletions blog/2023-10-30-toxicchat/index.html

Large diffs are not rendered by default.

71 changes: 71 additions & 0 deletions blog/2023-11-14-llm-decontaminator/index.html

Large diffs are not rendered by default.

6 changes: 4 additions & 2 deletions blog/index.html

Large diffs are not rendered by default.

4 changes: 2 additions & 2 deletions donations/index.html

Large diffs are not rendered by default.

Binary file added images/blog/decontaminator/MATH-rephrase.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added images/blog/decontaminator/MMLU-f1score.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added images/blog/decontaminator/MMLU-us-f1score.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added images/blog/decontaminator/gsm-8k-rephrase.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added images/blog/decontaminator/llama-Frank.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added images/blog/decontaminator/llama-rephraser.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file added images/blog/decontaminator/overview.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added images/blog/decontaminator/run-e2e.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
2 changes: 1 addition & 1 deletion index.html

Large diffs are not rendered by default.

2 changes: 1 addition & 1 deletion projects/index.html

Large diffs are not rendered by default.

2 changes: 1 addition & 1 deletion rss.xml
Original file line number Diff line number Diff line change
@@ -1 +1 @@
<?xml version="1.0" encoding="UTF-8"?><rss xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:atom="http://www.w3.org/2005/Atom" version="2.0"><channel><title><![CDATA[Large Model Systems Organization]]></title><description><![CDATA[Large Model Systems Organization (LMSYS Org) is an open research organization founded by students and faculty from UC Berkeley in collaboration with UCSD and CMU. We aim to make large models accessible to everyone by co-development of open models, datasets, systems, and evaluation tools. Our work encompasses research in both machine learning and systems. We train large language models and make them widely available, while also developing distributed systems to accelerate their training and inference]]></description><link>https://lmsys.org</link><image><url>https://lmsys.org/public/images/gallery/universe.png</url><title>Large Model Systems Organization</title><link>https://lmsys.org</link></image><generator>RSS for Node</generator><lastBuildDate>Mon, 13 Nov 2023 23:57:45 GMT</lastBuildDate><item><title><![CDATA[ToxicChat: A Benchmark for Content Moderation in Real-world User-AI Interactions]]></title><link>https://lmsys.org/blog/2023-10-30-toxicchat/</link><guid isPermaLink="true">https://lmsys.org/blog/2023-10-30-toxicchat/</guid><dc:creator><![CDATA[Zi Lin*, Zihan Wang*, Yongqi Tong, Yangkun Wang, Yuxin Guo, Yujia Wang, Jingbo Shang]]></dc:creator><pubDate>Mon, 30 Oct 2023 00:00:00 GMT</pubDate></item><item><title><![CDATA[Chatbot Arena Conversation Dataset Release]]></title><link>https://lmsys.org/blog/2023-07-20-dataset/</link><guid isPermaLink="true">https://lmsys.org/blog/2023-07-20-dataset/</guid><dc:creator><![CDATA[LMSYS Org]]></dc:creator><pubDate>Thu, 20 Jul 2023 00:00:00 GMT</pubDate></item><item><title><![CDATA[How Long Can Open-Source LLMs Truly Promise on Context Length?]]></title><link>https://lmsys.org/blog/2023-06-29-longchat/</link><guid isPermaLink="true">https://lmsys.org/blog/2023-06-29-longchat/</guid><dc:creator><![CDATA[The LongChat Team]]></dc:creator><pubDate>Thu, 29 Jun 2023 00:00:00 GMT</pubDate></item><item><title><![CDATA[Chatbot Arena Leaderboard Week 8: Introducing MT-Bench and Vicuna-33B]]></title><link>https://lmsys.org/blog/2023-06-22-leaderboard/</link><guid isPermaLink="true">https://lmsys.org/blog/2023-06-22-leaderboard/</guid><dc:creator><![CDATA[Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Hao Zhang]]></dc:creator><pubDate>Thu, 22 Jun 2023 00:00:00 GMT</pubDate></item><item><title><![CDATA[Building a Truly Open OpenAI API Server with Open Models Locally]]></title><link>https://lmsys.org/blog/2023-06-09-api-server/</link><guid isPermaLink="true">https://lmsys.org/blog/2023-06-09-api-server/</guid><dc:creator><![CDATA[Shuo Yang and Siyuan Zhuang]]></dc:creator><pubDate>Fri, 09 Jun 2023 00:00:00 GMT</pubDate></item><item><title><![CDATA[Chatbot Arena Leaderboard Updates (Week 4)]]></title><link>https://lmsys.org/blog/2023-05-25-leaderboard/</link><guid isPermaLink="true">https://lmsys.org/blog/2023-05-25-leaderboard/</guid><dc:creator><![CDATA[LMSYS Org]]></dc:creator><pubDate>Thu, 25 May 2023 00:00:00 GMT</pubDate></item><item><title><![CDATA[Chatbot Arena Leaderboard Updates (Week 2)]]></title><link>https://lmsys.org/blog/2023-05-10-leaderboard/</link><guid isPermaLink="true">https://lmsys.org/blog/2023-05-10-leaderboard/</guid><dc:creator><![CDATA[LMSYS Org]]></dc:creator><pubDate>Wed, 10 May 2023 00:00:00 GMT</pubDate></item><item><title><![CDATA[Chatbot Arena: Benchmarking LLMs in the Wild with Elo Ratings]]></title><link>https://lmsys.org/blog/2023-05-03-arena/</link><guid isPermaLink="true">https://lmsys.org/blog/2023-05-03-arena/</guid><dc:creator><![CDATA[Lianmin Zheng*, Ying Sheng*, Wei-Lin Chiang, Hao Zhang, Joseph E. Gonzalez, Ion Stoica]]></dc:creator><pubDate>Wed, 03 May 2023 00:00:00 GMT</pubDate></item><item><title><![CDATA[Vicuna: An Open-Source Chatbot Impressing GPT-4 with 90%* ChatGPT Quality]]></title><link>https://lmsys.org/blog/2023-03-30-vicuna/</link><guid isPermaLink="true">https://lmsys.org/blog/2023-03-30-vicuna/</guid><dc:creator><![CDATA[The Vicuna Team]]></dc:creator><pubDate>Thu, 30 Mar 2023 00:00:00 GMT</pubDate></item></channel></rss>
<?xml version="1.0" encoding="UTF-8"?><rss xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:atom="http://www.w3.org/2005/Atom" version="2.0"><channel><title><![CDATA[Large Model Systems Organization]]></title><description><![CDATA[Large Model Systems Organization (LMSYS Org) is an open research organization founded by students and faculty from UC Berkeley in collaboration with UCSD and CMU. We aim to make large models accessible to everyone by co-development of open models, datasets, systems, and evaluation tools. Our work encompasses research in both machine learning and systems. We train large language models and make them widely available, while also developing distributed systems to accelerate their training and inference]]></description><link>https://lmsys.org</link><image><url>https://lmsys.org/public/images/gallery/universe.png</url><title>Large Model Systems Organization</title><link>https://lmsys.org</link></image><generator>RSS for Node</generator><lastBuildDate>Tue, 14 Nov 2023 07:20:31 GMT</lastBuildDate><item><title><![CDATA[Cache me if you can! How to beat GPT-4 with a 13B model]]></title><link>https://lmsys.org/blog/2023-11-14-llm-decontaminator/</link><guid isPermaLink="true">https://lmsys.org/blog/2023-11-14-llm-decontaminator/</guid><dc:creator><![CDATA[Shuo Yang*, Wei-Lin Chiang*, Lianmin Zheng*, Joseph E. Gonzalez, Ion Stoica]]></dc:creator><pubDate>Tue, 14 Nov 2023 00:00:00 GMT</pubDate></item><item><title><![CDATA[ToxicChat: A Benchmark for Content Moderation in Real-world User-AI Interactions]]></title><link>https://lmsys.org/blog/2023-10-30-toxicchat/</link><guid isPermaLink="true">https://lmsys.org/blog/2023-10-30-toxicchat/</guid><dc:creator><![CDATA[Zi Lin*, Zihan Wang*, Yongqi Tong, Yangkun Wang, Yuxin Guo, Yujia Wang, Jingbo Shang]]></dc:creator><pubDate>Mon, 30 Oct 2023 00:00:00 GMT</pubDate></item><item><title><![CDATA[Chatbot Arena Conversation Dataset Release]]></title><link>https://lmsys.org/blog/2023-07-20-dataset/</link><guid isPermaLink="true">https://lmsys.org/blog/2023-07-20-dataset/</guid><dc:creator><![CDATA[LMSYS Org]]></dc:creator><pubDate>Thu, 20 Jul 2023 00:00:00 GMT</pubDate></item><item><title><![CDATA[How Long Can Open-Source LLMs Truly Promise on Context Length?]]></title><link>https://lmsys.org/blog/2023-06-29-longchat/</link><guid isPermaLink="true">https://lmsys.org/blog/2023-06-29-longchat/</guid><dc:creator><![CDATA[The LongChat Team]]></dc:creator><pubDate>Thu, 29 Jun 2023 00:00:00 GMT</pubDate></item><item><title><![CDATA[Chatbot Arena Leaderboard Week 8: Introducing MT-Bench and Vicuna-33B]]></title><link>https://lmsys.org/blog/2023-06-22-leaderboard/</link><guid isPermaLink="true">https://lmsys.org/blog/2023-06-22-leaderboard/</guid><dc:creator><![CDATA[Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Hao Zhang]]></dc:creator><pubDate>Thu, 22 Jun 2023 00:00:00 GMT</pubDate></item><item><title><![CDATA[Building a Truly Open OpenAI API Server with Open Models Locally]]></title><link>https://lmsys.org/blog/2023-06-09-api-server/</link><guid isPermaLink="true">https://lmsys.org/blog/2023-06-09-api-server/</guid><dc:creator><![CDATA[Shuo Yang and Siyuan Zhuang]]></dc:creator><pubDate>Fri, 09 Jun 2023 00:00:00 GMT</pubDate></item><item><title><![CDATA[Chatbot Arena Leaderboard Updates (Week 4)]]></title><link>https://lmsys.org/blog/2023-05-25-leaderboard/</link><guid isPermaLink="true">https://lmsys.org/blog/2023-05-25-leaderboard/</guid><dc:creator><![CDATA[LMSYS Org]]></dc:creator><pubDate>Thu, 25 May 2023 00:00:00 GMT</pubDate></item><item><title><![CDATA[Chatbot Arena Leaderboard Updates (Week 2)]]></title><link>https://lmsys.org/blog/2023-05-10-leaderboard/</link><guid isPermaLink="true">https://lmsys.org/blog/2023-05-10-leaderboard/</guid><dc:creator><![CDATA[LMSYS Org]]></dc:creator><pubDate>Wed, 10 May 2023 00:00:00 GMT</pubDate></item><item><title><![CDATA[Chatbot Arena: Benchmarking LLMs in the Wild with Elo Ratings]]></title><link>https://lmsys.org/blog/2023-05-03-arena/</link><guid isPermaLink="true">https://lmsys.org/blog/2023-05-03-arena/</guid><dc:creator><![CDATA[Lianmin Zheng*, Ying Sheng*, Wei-Lin Chiang, Hao Zhang, Joseph E. Gonzalez, Ion Stoica]]></dc:creator><pubDate>Wed, 03 May 2023 00:00:00 GMT</pubDate></item><item><title><![CDATA[Vicuna: An Open-Source Chatbot Impressing GPT-4 with 90%* ChatGPT Quality]]></title><link>https://lmsys.org/blog/2023-03-30-vicuna/</link><guid isPermaLink="true">https://lmsys.org/blog/2023-03-30-vicuna/</guid><dc:creator><![CDATA[The Vicuna Team]]></dc:creator><pubDate>Thu, 30 Mar 2023 00:00:00 GMT</pubDate></item></channel></rss>
4 changes: 2 additions & 2 deletions vicuna_eval/index.html

Large diffs are not rendered by default.

0 comments on commit c65197a

Please sign in to comment.