Release LMDeploy Release V0.2.2 · InternLM/lmdeploy

Highlight

English version

The allocation strategy for k/v cache is changed. The parameter cache_max_entry_count defaults to 0.8. It means the proportion of GPU FREE memory rather than TOTAL memory. The default value is updated to 0.8. It can help prevent OOM issues.
The pipeline API supports streaming inference. You may give it a try!

from lmdeploy import pipeline
pipe = pipeline('internlm/internlm2-chat-7b')
for item in pipe.stream_infer('hi, please intro yourself'):
    print(item)

Add api key and ssl to api_server

Chinese version

TurboMind engine 修改了GPU memory分配策略。k/v cache 内存比例参数 cache_max_entry_count 缺省值变更为 0.8。它表示 GPU空闲内存的比例，不再是 GPU 总内存的比例。
Pipeline 支持流式输出接口。可以尝试下如下代码：

from lmdeploy import pipeline
pipe = pipeline('internlm/internlm2-chat-7b')
for item in pipe.stream_infer('hi, please intro yourself'):
    print(item)

api_server 在接口中增加了 api_key

What's Changed

🚀 Features

add alignment tools by @grimoire in #1004
support min_length for turbomind backend by @irexyc in #961
Add stream mode function to pipeline by @AllentDan in #974
[Feature] Add api key and ssl to http server by @AllentDan in #1048

💥 Improvements

hide stop-words in output text by @grimoire in #991
optimize sleep by @grimoire in #1034
set example values to /v1/chat/completions in swagger UI by @AllentDan in #984
Update adapters cli argument by @RunningLeon in #1039
Fix turbomind end session bug. Add huggingface demo document by @AllentDan in #1017
Support linking the custom built mpi by @lvhan028 in #1025
sync mem size for tp by @lzhangzz in #1053
Remove model name when loading hf model by @irexyc in #1022
support internlm2-1_8b by @lvhan028 in #1073
Update chat template for internlm2 base model by @lvhan028 in #1079

🐞 Bug fixes

fix TorchEngine stuck when benchmarking with tp>1 by @grimoire in #942
fix module mapping error of baichuan model by @grimoire in #977
fix import error for triton server by @RunningLeon in #985
fix qwen-vl example by @irexyc in #996
fix missing init file in modules by @RunningLeon in #1013
fix tp mem usage by @grimoire in #987
update indexes_containing_token function by @AllentDan in #1050
fix flash kernel on sm 70 by @grimoire in #1027
Fix baichuan2 lora by @grimoire in #1042
Fix modelconfig in pytorch engine, support YI. by @grimoire in #1052
Fix repetition penalty for long context by @irexyc in #1037
[Fix] Support QLinear in rowwise_parallelize_linear_fn and colwise_parallelize_linear_fn by @HIT-cwh in #1072

📚 Documentations

add docs for evaluation with opencompass by @RunningLeon in #995
update docs for kvint8 by @RunningLeon in #1026
[doc] Introduce project OpenAOE by @JiaYingLii in #1049
update pipeline guide and FAQ about OOM by @lvhan028 in #1051
docs update cache_max_entry_count for turbomind config by @zhyncs in #1067

🌐 Other

update ut ci to new server node by @RunningLeon in #1024
Ete testcase update by @zhulinJulia24 in #1023
fix OOM in BlockManager by @zhyncs in #973
fix use engine_config.tp when tp is None by @zhyncs in #1057
Fix serve api by moving logger inside process for turbomind by @AllentDan in #1061
bump version to v0.2.2 by @lvhan028 in #1076

New Contributors

@zhyncs made their first contribution in #973
@JiaYingLii made their first contribution in #1049

Full Changelog: v0.2.1...v0.2.2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LMDeploy Release V0.2.2

Highlight

English version

Chinese version

What's Changed

🚀 Features

💥 Improvements

🐞 Bug fixes

📚 Documentations

🌐 Other

New Contributors

Contributors