Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use aiohttp inside proxy server && add --disable-cache-status argument #3020

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

AllentDan
Copy link
Collaborator

No description provided.

@AllentDan
Copy link
Collaborator Author

AllentDan commented Jan 13, 2025

I tested internlm2-chat-7 on seven nodes. The performance using this PR and #2961 is:

============ Serving Benchmark Result ============
Backend:                                 lmdeploy  
Traffic request rate:                    inf       
Successful requests:                     10000     
Benchmark duration (s):                  88.44     
Total input tokens:                      2317235   
Total generated tokens:                  2007343   
Total generated tokens (retokenized):    2004019   
Request throughput (req/s):              113.07    
Input token throughput (tok/s):          26201.58  
Output token throughput (tok/s):         22697.55  
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   47196.62  
Median E2E Latency (ms):                 47467.02  
---------------Time to First Token----------------
Mean TTFT (ms):                          38858.11  
Median TTFT (ms):                        39344.03  
P99 TTFT (ms):                           63995.90  
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          45.60     
Median TPOT (ms):                        45.05     
P99 TPOT (ms):                           162.91    
---------------Inter-token Latency----------------
Mean ITL (ms):                           55.18     
Median ITL (ms):                         0.01      
P99 ITL (ms):                            740.29    
==================================================

While single api_server performance:

============ Serving Benchmark Result ============
Backend:                                 lmdeploy  
Traffic request rate:                    inf       
Successful requests:                     3000      
Benchmark duration (s):                  132.74    
Total input tokens:                      683944    
Total generated tokens:                  597386    
Total generated tokens (retokenized):    596120    
Request throughput (req/s):              22.60     
Input token throughput (tok/s):          5152.61   
Output token throughput (tok/s):         4500.51   
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   61212.61  
Median E2E Latency (ms):                 60387.67  
---------------Time to First Token----------------
Mean TTFT (ms):                          52197.99  
Median TTFT (ms):                        50757.01  
P99 TTFT (ms):                           107224.86 
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          53.10     
Median TPOT (ms):                        47.26     
P99 TPOT (ms):                           195.80    
---------------Inter-token Latency----------------
Mean ITL (ms):                           60.13     
Median ITL (ms):                         44.34     
P99 ITL (ms):                            325.68    
==================================================

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant