Skip to content

Commit

Permalink
modify disable_sliding_window version to 0.30.0
Browse files Browse the repository at this point in the history
  • Loading branch information
hommayushi3 committed Aug 25, 2024
1 parent 33d1dff commit e07c9a0
Show file tree
Hide file tree
Showing 2 changed files with 2 additions and 2 deletions.
2 changes: 1 addition & 1 deletion serving/docs/lmi/user_guides/lmi-dist_user_guide.md
Original file line number Diff line number Diff line change
Expand Up @@ -154,4 +154,4 @@ Here are the advanced parameters that are available when using LMI-Dist.
| option.enable_chunked_prefill | \>= 0.29.0 | Pass Through | This config enables chunked prefill support. With chunked prefill, longer prompts will be chunked and batched with decode requests to reduce inter token latency. This option is EXPERIMENTAL and tested for llama and falcon models only. This does not work with LoRA and speculative decoding yet. | Default: `False` |
| option.cpu_offload_gb_per_gpu | \>= 0.29.0 | Pass Through | This config allows offloading model weights into CPU to enable large model running with limited GPU memory. | Default: `0` |
| option.enable_prefix_caching | \>= 0.29.0 | Pass Through | This config allows the engine to cache the context memory and reuse to speed up inference. | Default: `False` |
| option.disable_sliding_window | \>= 0.29.0 | Pass Through | This config disables sliding window, capping to sliding window size inference. | Default: `False` |
| option.disable_sliding_window | \>= 0.30.0 | Pass Through | This config disables sliding window, capping to sliding window size inference. | Default: `False` |
2 changes: 1 addition & 1 deletion serving/docs/lmi/user_guides/vllm_user_guide.md
Original file line number Diff line number Diff line change
Expand Up @@ -141,4 +141,4 @@ In that situation, there is nothing LMI can do until the issue is fixed in the b
| option.enable_chunked_prefill | \>= 0.29.0 | Pass Through | This config enables chunked prefill support. With chunked prefill, longer prompts will be chunked and batched with decode requests to reduce inter token latency. This option is EXPERIMENTAL and tested for llama and falcon models only. This does not work with LoRA and speculative decoding yet. | Default: `False` |
| option.cpu_offload_gb_per_gpu | \>= 0.29.0 | Pass Through | This config allows offloading model weights into CPU to enable large model running with limited GPU memory. | Default: `0` |
| option.enable_prefix_caching | \>= 0.29.0 | Pass Through | This config allows the engine to cache the context memory and reuse to speed up inference. | Default: `False` |
| option.disable_sliding_window | \>= 0.29.0 | Pass Through | This config disables sliding window, capping to sliding window size inference. | Default: `False` |
| option.disable_sliding_window | \>= 0.30.0 | Pass Through | This config disables sliding window, capping to sliding window size inference. | Default: `False` |

0 comments on commit e07c9a0

Please sign in to comment.