Skip to content

LMDeploy Release V0.2.2

Compare
Choose a tag to compare
@lvhan028 lvhan028 released this 31 Jan 09:57
· 717 commits to main since this release
4a28f12

Highlight

English version

  • The allocation strategy for k/v cache is changed. The parameter cache_max_entry_count defaults to 0.8. It means the proportion of GPU FREE memory rather than TOTAL memory. The default value is updated to 0.8. It can help prevent OOM issues.
  • The pipeline API supports streaming inference. You may give it a try!
from lmdeploy import pipeline
pipe = pipeline('internlm/internlm2-chat-7b')
for item in pipe.stream_infer('hi, please intro yourself'):
    print(item)
  • Add api key and ssl to api_server

Chinese version

  • TurboMind engine 修改了GPU memory分配策略。k/v cache 内存比例参数 cache_max_entry_count 缺省值变更为 0.8。它表示 GPU空闲内存的比例,不再是 GPU 总内存的比例。
  • Pipeline 支持流式输出接口。可以尝试下如下代码:
from lmdeploy import pipeline
pipe = pipeline('internlm/internlm2-chat-7b')
for item in pipe.stream_infer('hi, please intro yourself'):
    print(item)
  • api_server 在接口中增加了 api_key

What's Changed

🚀 Features

💥 Improvements

🐞 Bug fixes

📚 Documentations

🌐 Other

New Contributors

Full Changelog: v0.2.1...v0.2.2