LMDeploy Release V0.2.2
Highlight
English version
- The allocation strategy for k/v cache is changed. The parameter
cache_max_entry_count
defaults to 0.8. It means the proportion of GPU FREE memory rather than TOTAL memory. The default value is updated to 0.8. It can help prevent OOM issues. - The pipeline API supports streaming inference. You may give it a try!
from lmdeploy import pipeline
pipe = pipeline('internlm/internlm2-chat-7b')
for item in pipe.stream_infer('hi, please intro yourself'):
print(item)
- Add api key and ssl to
api_server
Chinese version
- TurboMind engine 修改了GPU memory分配策略。k/v cache 内存比例参数 cache_max_entry_count 缺省值变更为 0.8。它表示 GPU空闲内存的比例,不再是 GPU 总内存的比例。
- Pipeline 支持流式输出接口。可以尝试下如下代码:
from lmdeploy import pipeline
pipe = pipeline('internlm/internlm2-chat-7b')
for item in pipe.stream_infer('hi, please intro yourself'):
print(item)
- api_server 在接口中增加了 api_key
What's Changed
🚀 Features
- add alignment tools by @grimoire in #1004
- support min_length for turbomind backend by @irexyc in #961
- Add stream mode function to pipeline by @AllentDan in #974
- [Feature] Add api key and ssl to http server by @AllentDan in #1048
💥 Improvements
- hide stop-words in output text by @grimoire in #991
- optimize sleep by @grimoire in #1034
- set example values to /v1/chat/completions in swagger UI by @AllentDan in #984
- Update adapters cli argument by @RunningLeon in #1039
- Fix turbomind end session bug. Add huggingface demo document by @AllentDan in #1017
- Support linking the custom built mpi by @lvhan028 in #1025
- sync mem size for tp by @lzhangzz in #1053
- Remove model name when loading hf model by @irexyc in #1022
- support internlm2-1_8b by @lvhan028 in #1073
- Update chat template for internlm2 base model by @lvhan028 in #1079
🐞 Bug fixes
- fix TorchEngine stuck when benchmarking with
tp>1
by @grimoire in #942 - fix module mapping error of baichuan model by @grimoire in #977
- fix import error for triton server by @RunningLeon in #985
- fix qwen-vl example by @irexyc in #996
- fix missing init file in modules by @RunningLeon in #1013
- fix tp mem usage by @grimoire in #987
- update indexes_containing_token function by @AllentDan in #1050
- fix flash kernel on sm 70 by @grimoire in #1027
- Fix baichuan2 lora by @grimoire in #1042
- Fix modelconfig in pytorch engine, support YI. by @grimoire in #1052
- Fix repetition penalty for long context by @irexyc in #1037
- [Fix] Support QLinear in rowwise_parallelize_linear_fn and colwise_parallelize_linear_fn by @HIT-cwh in #1072
📚 Documentations
- add docs for evaluation with opencompass by @RunningLeon in #995
- update docs for kvint8 by @RunningLeon in #1026
- [doc] Introduce project OpenAOE by @JiaYingLii in #1049
- update pipeline guide and FAQ about OOM by @lvhan028 in #1051
- docs update cache_max_entry_count for turbomind config by @zhyncs in #1067
🌐 Other
- update ut ci to new server node by @RunningLeon in #1024
- Ete testcase update by @zhulinJulia24 in #1023
- fix OOM in BlockManager by @zhyncs in #973
- fix use engine_config.tp when tp is None by @zhyncs in #1057
- Fix serve api by moving logger inside process for turbomind by @AllentDan in #1061
- bump version to v0.2.2 by @lvhan028 in #1076
New Contributors
- @zhyncs made their first contribution in #973
- @JiaYingLii made their first contribution in #1049
Full Changelog: v0.2.1...v0.2.2