Use aiohttp inside proxy server && add --disable-cache-status argument #3020

AllentDan · 2025-01-13T10:09:53Z

No description provided.

AllentDan · 2025-01-13T10:11:42Z

I tested internlm2-chat-7 on seven nodes. The performance using this PR and #2961 is:

============ Serving Benchmark Result ============
Backend:                                 lmdeploy  
Traffic request rate:                    inf       
Successful requests:                     10000     
Benchmark duration (s):                  88.44     
Total input tokens:                      2317235   
Total generated tokens:                  2007343   
Total generated tokens (retokenized):    2004019   
Request throughput (req/s):              113.07    
Input token throughput (tok/s):          26201.58  
Output token throughput (tok/s):         22697.55  
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   47196.62  
Median E2E Latency (ms):                 47467.02  
---------------Time to First Token----------------
Mean TTFT (ms):                          38858.11  
Median TTFT (ms):                        39344.03  
P99 TTFT (ms):                           63995.90  
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          45.60     
Median TPOT (ms):                        45.05     
P99 TPOT (ms):                           162.91    
---------------Inter-token Latency----------------
Mean ITL (ms):                           55.18     
Median ITL (ms):                         0.01      
P99 ITL (ms):                            740.29    
==================================================

While single api_server performance:

============ Serving Benchmark Result ============
Backend:                                 lmdeploy  
Traffic request rate:                    inf       
Successful requests:                     3000      
Benchmark duration (s):                  132.74    
Total input tokens:                      683944    
Total generated tokens:                  597386    
Total generated tokens (retokenized):    596120    
Request throughput (req/s):              22.60     
Input token throughput (tok/s):          5152.61   
Output token throughput (tok/s):         4500.51   
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   61212.61  
Median E2E Latency (ms):                 60387.67  
---------------Time to First Token----------------
Mean TTFT (ms):                          52197.99  
Median TTFT (ms):                        50757.01  
P99 TTFT (ms):                           107224.86 
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          53.10     
Median TPOT (ms):                        47.26     
P99 TPOT (ms):                           195.80    
---------------Inter-token Latency----------------
Mean ITL (ms):                           60.13     
Median ITL (ms):                         44.34     
P99 ITL (ms):                            325.68    
==================================================

AllentDan added 3 commits January 13, 2025 17:01

use aiohttp

5a7703d

disable-cache-status

66d6600

remove connect limit

6a29e74

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use aiohttp inside proxy server && add --disable-cache-status argument #3020

Use aiohttp inside proxy server && add --disable-cache-status argument #3020

AllentDan commented Jan 13, 2025

AllentDan commented Jan 13, 2025 •

edited

Loading

Use aiohttp inside proxy server && add --disable-cache-status argument #3020

Are you sure you want to change the base?

Use aiohttp inside proxy server && add --disable-cache-status argument #3020

Conversation

AllentDan commented Jan 13, 2025

AllentDan commented Jan 13, 2025 • edited Loading

AllentDan commented Jan 13, 2025 •

edited

Loading