[Neo][vLLM] Accept quant options for awq, fp8 #2382
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Description
This PR adds additional pass-through options to configure awq & fp8 quantization.
For AWQ, we add the following options. These map to the options defined here.
option.awq_zero_point
: toggles zero point quantizationoption.awq_block_size
: (existing field) group/block size for awq quantizationoption.awq_weight_bit_width
: bit width for quantization. currently only 4 is supported.option.awq_mm_version
: awq matmul implementationoption.awq_ignore_layers
: layers to ignore during quantizationFor FP8, we add the following options. These are defined here.
option.fp8_activation_scheme
: static or dynamic activation scaling factorsoption.fp8_kv_cache_quant_targets
: modules to target for kv cache quantization (currently unused)option.fp8_ignore_patterns
: layers to ignoreoption.calib_size
: (existing field) number of samples for activation scales calibrationFor fields that are read by the underlying library as a list of strings, we accept them like so:
Also removes previously implemented configuration options for FP8 as added in the original PR (#2272)