Skip to content

Commit

Permalink
Merge pull request #105 from lm-sys/routellm-updates
Browse files Browse the repository at this point in the history
Updates for RouteLLM blog
  • Loading branch information
iojw authored Jul 1, 2024
2 parents ea35003 + 3b82aab commit d9b764d
Show file tree
Hide file tree
Showing 2 changed files with 14 additions and 20 deletions.
32 changes: 13 additions & 19 deletions blog/2024-07-01-routellm.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,9 +11,9 @@ LLMs have demonstrated remarkable capabilities across a range of tasks, but ther

<p style="color:gray; text-align: center;">Figure 1: Plot of performance against cost of various LLMs. Performance is measured by Elo on Chatbot Arena, and cost per million tokens assuming a 1:1 input / output ratio. Through routing between two models, we ideally achieve a better performance:cost ratio than can be achieved with either model.</p>

*LLM routing* offers a solution to this problem, whereby each query is first processed by a system that decides which LLM to route it to. Ideally, the system should route all queries that can be sufficiently handled by weaker models to such models, and all other queries to stronger models, minimizing cost while maintaining response quality. However, this turns out to be a challenging problem because the routing system has to infer both the characteristics of an incoming query and different models’ capabilities before routing.
*LLM routing* offers a solution to this, where each query is first processed by a system that decides which LLM to route it to. Ideally, all queries that can be handled by weaker models should be routed to these models, with all other queries routed to stronger models, minimizing cost while maintaining response quality. However, this turns out to be a challenging problem because the routing system has to infer both the characteristics of an incoming query and different models’ capabilities when routing.

To tackle this, we present **RouteLLM**, a principled framework for LLM routing based on preference data. We formalize the problem of LLM routing and explore augmentation techniques to improve router performance. We trained four different routers using public data from Chatbot Arena and demonstrate that they can significantly reduce costs without compromising quality, with **cost reductions of over 85% on MT Bench, 45% on MMLU, and 35% on GSM8K** as compared to using only GPT-4, while still achieving 95% of GPT-4 performance. We also publicly release all our code and datasets, including a new [open-source framework](https://github.com/lm-sys/RouteLLM) for serving and evaluating LLM routers.
To tackle this, we present **RouteLLM**, a principled framework for LLM routing based on preference data. We formalize the problem of LLM routing and explore augmentation techniques to improve router performance. We trained four different routers using public data from Chatbot Arena and demonstrate that they can significantly reduce costs without compromising quality, with **cost reductions of over 85% on MT Bench, 45% on MMLU, and 35% on GSM8K** as compared to using only GPT-4, while still achieving 95% of GPT-4’s performance. We also publicly release all our code and datasets, including a new [open-source framework](https://github.com/lm-sys/RouteLLM) for serving and evaluating LLM routers.

## Routing Setup

Expand All @@ -26,7 +26,7 @@ In our routing setup, we focus on the case where there are two models: a stronge

This is best understood through Figure 2, which represents the performance of a router that randomly routes between the two models on MT Bench. Specifically, we route between GPT-4 and Mixtral 8x7B here, with their performance denoted by the red and grey dotted lines respectively. For any router, we can plot a similar graph of its performance against the number of the calls made to GPT-4 (which is representative of the cost incurred since the cost of a Mixtral call is negligible).

To train our routers, we use *preference data*, which each consists of a prompt and a comparison between the response quality of two models on that prompt i.e. this could be a win for the first model, a win for the second model, or a tie. Using preference data allows us to learn about the strengths and weaknesses of different models and how they relate to queries, which is effective for training routers. For our base dataset, we utilize [public data](https://huggingface.co/datasets/lmsys/lmsys-arena-human-preference-55k) from [Chatbot Arena](http://chat.lmsys.org). We also investigate *data augmentation* techniques to further improve performance using both golden-label datasets and a LLM judge.
To train our routers, we use *preference data*, expanding on [previous work](https://arxiv.org/abs/2404.14618) on routing. Each data point consists of a prompt and a comparison between the response quality of two models on that prompt i.e. this could be a win for the first model, a win for the second model, or a tie. Using preference data allows us to learn about the strengths and weaknesses of different models and how they relate to queries, which is effective for training routers. For our base dataset, we utilize [public data](https://huggingface.co/datasets/lmsys/lmsys-arena-human-preference-55k) from [Chatbot Arena](http://chat.lmsys.org). We also investigate *data augmentation* techniques to further improve performance using both golden-label datasets and a LLM judge.

We trained four routers using a mix of Chatbot Arena data and data augmentation:
- A similarity-weighted (SW) ranking router that performs a “weighted Elo calculation” based on similarity
Expand Down Expand Up @@ -80,31 +80,35 @@ Even when the model pair is replaced, we observe strong results across all route

<p style="color:gray; text-align: center;">Figure 7: Comparison of our router against existing routing systems on MT Bench (left) using gpt-4-turbo-2024-04-09 and llama-2-70b-chat (right) using gpt-4-turbo-2024-04-09 and mixtral-8x7b-instruct-v0.1 </p>

In Figure 7, we also report the performance of our best-performing routers on MT Bench against [Martian](https://withmartian.com/) and [Unify AI](https://unify.ai/), two commercial LLM routing systems. We use `gpt-4-turbo-2024-04-09` as the strong model and `llama-2-70b-chat` or `mixtral-8x7b-instruct-v0.1` as the weak model depending on the models available. Our routers demonstrate very competitive results, achieving the same performance as these commercial routers while being up to 40% cheaper.
In Figure 7, we also report the performance of our best-performing routers on MT Bench against [Martian](https://withmartian.com/) and [Unify AI](https://unify.ai/), two commercial LLM routing systems. We use `gpt-4-turbo-2024-04-09` as the strong model and `llama-2-70b-chat` or `mixtral-8x7b-instruct-v0.1` as the weak model depending on the models available as detailed [here](https://github.com/lm-sys/RouteLLM/tree/main/benchmarks). Our routers demonstrate very competitive results, achieving the same performance as these commercial routers while being up to 40% cheaper.

## Conclusion

These results demonstrate the ability of our routers to achieve significant cost savings while maintaining a high quality of responses. They also highlight the effectiveness of data augmentation in improving routing performance using only a small amount of data, offering a scalable path towards improving routing performance for real-world use cases.

Based on our learnings from this research, we have created an open-source framework for serving and evaluating routers on [GitHub](https://github.com/lm-sys/RouteLLM). We are also releasing all our routers and datasets on [HuggingFace](https://huggingface.co/routellm) for public use.
Based on this research, we have created an open-source framework for serving and evaluating routers on [GitHub](https://github.com/lm-sys/RouteLLM). We are also releasing all our routers and datasets on [HuggingFace](https://huggingface.co/routellm) for public use.

We are excited to see what you build on top of this! Please let us know if you face any issues or have any suggestions. For the full details, please refer to our [arXiv](https://arxiv.org/abs/2406.18665) paper.

## Demo

We have built a temporary [demo](https://0c83f754b05f4a2208.gradio.live) where you can experiment with our augmented matrix factorization and causal LLM routers by seeing which model your messages are routed to. Both routers have been calibrated so that approximately 20% of calls are routed to GPT-4. Please try them out!

## Acknowledgements

We are grateful to Tyler Griggs for his valuable feedback on this post.

## Citations

```
@misc{ong2024routellmlearningroutellms,
title={RouteLLM: Learning to Route LLMs with Preference Data},
title={RouteLLM: Learning to Route LLMs with Preference Data},
author={Isaac Ong and Amjad Almahairi and Vincent Wu and Wei-Lin Chiang and Tianhao Wu and Joseph E. Gonzalez and M Waleed Kadous and Ion Stoica},
year={2024},
eprint={2406.18665},
archivePrefix={arXiv},
primaryClass={cs.LG},
url={https://arxiv.org/abs/2406.18665},
url={https://arxiv.org/abs/2406.18665},
}
@misc{chiang2024chatbot,
Expand All @@ -115,14 +119,4 @@ We are grateful to Tyler Griggs for his valuable feedback on this post.
archivePrefix={arXiv},
primaryClass={cs.AI}
}
@misc{ding2024hybridllmcostefficientqualityaware,
title={Hybrid LLM: Cost-Efficient and Quality-Aware Query Routing},
author={Dujian Ding and Ankur Mallick and Chi Wang and Robert Sim and Subhabrata Mukherjee and Victor Ruhle and Laks V. S. Lakshmanan and Ahmed Hassan Awadallah},
year={2024},
eprint={2404.14618},
archivePrefix={arXiv},
primaryClass={cs.LG},
url={https://arxiv.org/abs/2404.14618},
}
```
2 changes: 1 addition & 1 deletion content/projects.json
Original file line number Diff line number Diff line change
Expand Up @@ -53,7 +53,7 @@
"name": "RouteLLM",
"architecture": "",
"size": "",
"desc": "A framework for serving and evaluating large language model routers.",
"desc": "A framework for serving and evaluating LLM routers.",
"link": "https://github.com/lm-sys/RouteLLM"
},
{
Expand Down

0 comments on commit d9b764d

Please sign in to comment.