Skip to content

3. Engine Options

AlpinDale edited this page May 13, 2024 · 22 revisions

Server options

Below are the full list of available options to use with the Aphrodite Engine API server.

Model Options

Flag Description
--model MODEL Name or path of the Hugging Face model to use.
--tokenizer TOKENIZER Name or path of the Hugging Face tokenizer to use. Defaults to model.
--revision REVISION The specific model version to use. Can be a branch, tag, or commit ID. Defaults to main.
--code-revision REVISION The code revision to use for models with remote code.
--tokenizer-revision REVISION The tokenizer revision to use.
--tokenizer-mode {auto,slow} Tokenizer mode. "auto" will use the fast tokenizer if available.
--trust-remote-code Trust the model's remote code from Hugging Face.
--download-dir DIRECTORY The directory to download the model to.
--load-format {auto,pt,safetensors,npcache,dummy} The format of the model weights. Defaults to auto.
--dtype {auto,float16,bfloat16,float32} The data type to use. "auto" will use FP16 for FP32/FP16 models.
--max-model-len LENGTH The model context size. Defaults to the model's original. If set to a higher value, will use automatic RoPE scaling.
--guided-decoding-backend {outlines,lm-format-enforcer} The engine to use for guided decoding (JSON Schema / regex etc)
--enforce-eager {true,false} If True, disable CUDA graphs. Defaults to True.
--max-context-len-to-capture LENGTH Maximum context length covered by CUDA graphs. When a sequence has context length larger than this, we fall back to eager mode.
--max-log-probs NUM The maximum number of logprobs to output for requests. Defaults to 10.

Parallel Options

Flag Description
--worker-use-ray Use Ray for distributed serving. Will be automatically set if using > 1 GPU.
--tensor-parallel-size (-tp) SIZE The number of GPUs to use for loading the model.
--pipeline-parallel-size (-pp) SIZE The number of pipeline stages to use for loading the model. Currently unsupported.
--ray-workers-use-nsight Use nsight for profiling Ray workers.
--disable-custom-all-reduce If set, disables the custom all-reduce kernels. Set this if you experience instabilities with multi-gpu setups.

Quantization Options

Flag Description
--kv-cache-dtype {auto,fp8} The data type for KV cache. If "auto", will use the model's data type. "fp8" will quantize it to 8bits, offering lower memory usage and improved throughput.
--quantization (-q) {aqlm,awq,bnb,eetq,exl2,gguf,gptq,quip,squeezellm,marlin} The method used to quantize the weights you're loading. Will try to automatically infer from the model. If unsuccessful, pass this flag manually.
--load-in-4bit Load the 32/16bit or AWQ model in 4bit, using SmoothQuant+.
--load-in-smooth Load the 32/16bit model in 8bit, using SmoothQuant+.
--load-in-8bit Load the 32/16bit model in 8bit, using BitsAndBytes.
--quantization-param-path PATH Path to the JSON file containing the KV cache scaling factor for AMD GPUs, applicable for FP8 quantization on AMD only.

KV Cache Options

Flag Description
--block-size {8,16,32} The token block size. Defaults to 16.
--context-shift Enable context shifting. Caches the previously processed prompts to be reused later.
--num-gpu-blocks-override NUM If specified, ignore the profiling result and use this number of GPU blocks.
--swap-space SIZE The amount of CPU swap space to use, in GiBs.
--gpu-memory-utilization (-gmu) FRACTION The percentage of VRAM to use per GPU. Set to 0.9 (90%) by default.

Tokenizer Options

Flag Description
--tokenizer-pool-size SIZE Size of tokenizer pool to use for async tokenization. If 0, will use synchronous tokenization.
--tokenizer-pool-type {ray} The type of tokenizer pool to use for async tokenization. Only "ray" is supported for now.
--tokenizer-pool-extra-config CONFIG Extra config for the tokenizer pool. Should be a JSON string that will be parsed into a dictionary.

Scheduler Options

Flags Description
--max-num-batched-tokens NUM The maximum number of tokens to be processed in a single iteration.
--max-num-seqs NUM The maximum number of sequences to be processed in a single iteration. Defaults to 256.
--use-v2-block-manager Wehther to use the BlockSpaceManagerV2 or not.
--delay-factor FACTOR Apply a delay (of delay factor multiplied by previous prompt latency) before scheduling the next prompt.
--policy {fcfs} The scheduling policy to use.
--enable-chunked-prefill If True, prefill requests can be chunked based on the remaining max_num_batched_tokens. Greatly reduces memory usage for GQA models.

Device Options

Flags Description
--device {auto,cuda,neuron,cpu} The device to use for the engine.

Speculative Decoding Options

Flags Description
--num-lookahead-slots NUM The number of slots to allocate per sequence per step, beyond the known token IDs. Used to store KV activations of tokens which may or may not be accepted.
--speculative-model {MODEL,"[ngram]"} The name or path of the draft model to be used. This can either be a Hugging Face model, or just "[ngram]" to use ngram prompt lookup decoding.
--num-speculative-tokens NUM The number of speculative tokens to sample from the draft model.
--speculative-max-model-len LEN The maximum sequence length supported by the draft model. Sequences over this length will skip speculation.
--ngram-prompt-lookup-max NUM Max size of window for ngram prompt lookup.
--ngram-prompt-lookup-min NUM Min size of window for ngram prompt lookup.

LoRA Options

Flags Description
--enable-lora Enable loading LoRA adapter weights.
--max-loras NUM The maximum number of LoRAs in a single batch.
--max-lora-rank {8,16,32,64} The maximum LoRA rank. Defaults to 16.
--lora-extra-vocab-size NUM Maximum size of extra vocabulary that can be present in a LoRA adapter (added to the base model vocab).
--lora-dtype {auto,float16,bfloat16,float32} Data type for LoRA. If auto, defaults to base model dtype.
--max-cpu-loras NUM Maximum number of LoRAs to store in CPU memory.

API Options

Flags Description
--host HOST The host name. Defaults to localhost.
--port PORT The port number. Defaults to 2242.
--allowed-credentials STR The allowed credentials.
--allowed-origins ALLOWED_ORIGINS The allowed origins.
--allowed-methods ALLOWED_METHODS The allowed methods.
--allowed-headers ALLOWED_HEADERS The allowed headers.
--api-keys API_KEY The API key to use, for securing the endpoint.
--admin-key ADMIN_KEY The admin API key to use, for admin operations.
--launch-kobold-api Launch the Kobold API server in addition to the OpenAI one.
--max-length LENGTH The maximum output length. Used for Horde, has no effect.
--served-model-name NAME The model name to use in the API. If unspecified, uses --model.
--lora-modules LORA_MODULES [LORA_MODULES ...] The individual LoRA modules to load for the API server. Can also be handled on-the-fly using the /v1/lora endpoint.
--chat-template PATH The file path to the chat template. By default, attempts to extract from the model if available.
--response-role The role name to return if add_generation_prompt=True.
--ssl-keyfile PATH The file path to the SSL key file.
--ssl-certfile PATH The file path to the SSL cert file.
--root-path PATH FastAPI root_path when app is behind a path-based routing proxy.
--middleware MIDDLEWARE Additional ASGI middleware to apply to the app. We accept multiple --middleware arguments. The value should be an import path. If a function is provided, Aphrodite will add it to the server using @app.middlware('http'). If a class is provided, it will be app.add_middleware().

Vision Language Options

Flags Description
--image-input-type {pixel_values,image_features} The image input type passed to Aphrodite.
--image-token-id ID Input ID for the image token.
--image-input-shape SHAPE The biggest image input shape (worst for memory footprint) given an input type. Only used for Aphrodite's profile_run.
--image-feature-size SIZE The image feature size along the context dimension.

Sampler options

  • n

The number of output sequences to return for a prompt.


  • best_of

Number of output sequences to generate from the prompt. From these best_of sequences, the top n sequences are returned. By default, it's set equal to n. If use_beam_search is True, it'll be treated as beam width.


  • seed

The random seed to use for generation.


  • presence_penalty

Penalize new tokens based on whether they appear in the generated text so far. Setting it to higher than 0 encourages the model to use new tokens, and lower than zero encourages the model to repeat tokens. Disabled: 0.


  • frequency_penalty

Penalize new tokens based on their frequency in the generated text so far. Setting it to higher than 0 encourages the model to use new tokens, and lower than zero encourages the model to repeat tokens. Disabled: 0.


  • repetition_penalty

Penalize new tokens based on whether they appear in the prompt and the generated text so far. Values higher than 1 encourage the model to use new tokens, while lower than 1 encourage the model to repeat tokens. Disabled: 1.


  • temperature

Control the randomness of the output. Lower values make the model more deterministic, while higher values make the model more random. Disabled: 1.


  • dynatemp_range

Allows the user to use a Dynamic Temperature that scales based on the entropy of token probabilities (normalized by the maximum possible entropy for a distribution so it scales well across different K values). Controls the variability of token probabilities. Dynamic Temperature takes a minimum and maximum temperature values; minimum temperature will be calculated as temperature - dynatemp_range, and maximum temperature as temperature + dynatemp_range. Disabled: 0.


  • dynatemp_exponent

The exponent value for dynamic temperature. Defaults to 1. Higher values will trend towards lower temperatures, lower values will trend toward higher temperatures.


  • smoothing_factor

The smoothing factor to use for Quadratic Sampling. Disabled: 0.0.


  • smoothing_curve

The smoothing curve to use for Cubic Sampling. Disabled: 1.0.

  • top_p

Control the cumulative probability of the top tokens to consider. Disabled: 1.


  • top_k

Control the number of top tokens to consider. Disabled: -1.


  • top_a

Controls the threshold probability for tokens, reducing randomness when AI certainty is high. Does not significantly affect output creativity. Disabled: 0.


  • min_p

Controls the minimum probability for a token to be considered, relative to the probability of the most likely token. Disable: 0.


  • tfs

Tail-Free Sampling. Eliminates low probability tokens after identifying a plateau in sorted token probabilities. It minimally affects the creativity of the output and is best used for longer texts. Disabled: 1.


  • eta_cutoff

Used in Eta sampling, it adapts the cutoff threshold based on the entropy of the token probabilities, optimizing token selection. Value is in units of 1e-4. Disabled: 0.


  • epsilon_cutoff

Used in Epsilon sampling, it sets a simple probability threshold for token selection. Value is in units of 1e-4. Disabled: 0.


  • typical_p

This method regulates the information content in the generated text by sorting tokens based on the sum of entropy and the natural logarithm of token probability. It has a strong effect on output content but still maintains creativity even at low settings. Disabled: 1.


  • mirostat_mode

The mirostat mode to use. Only 2 is currently supported. Mirostat is an adaptive decoding algorithm that generates text with a predetermined perplexity value, providing control over repetitions and thus ensuring high-quality, coherent, and fluent text. Disabled: 0.


  • mirostat_tau

The target "surprise" value that Mirostat works towards. Range is in 0 to infinity.


  • mirostat_eta

The learning rate at which Mirostat updates its internal suprise value. Range is from 0 to infinity.


  • use_beam_search

Whether to use beam search instead of normal sampling.


  • length_penalty

Penalize sequences based on their length. Used in beam search.


  • early_stopping

Controls the stopping condition for beam search. It accepts the following values: True, where the generation stops as soon as there are best_of complete candidates; False, where an heuristic is applied and the generation stops when is it very unlikely to find better candidates; "never", where the beam search procedure only stops when there cannot be better candidates (canonical beam search algorithm).


  • stop

List of strings (words) that stop the generation when they are generated. The returned output will not contain the stop strings.


  • stop_token_ids

List of token IDs that stop the generation when they are generated. The returned output will contain the stop tokens unless the stop tokens are special tokens (e.g. EOS).


  • include_stop_str_in_output

Whether to include the stop strings in output text. Default: False.


  • ignore_eos

Whether to ignore the EOS token and continue generating tokens after the EOS token is generated.


  • max_tokens

The maximum number of tokens to generate per output sequence.


  • logprobs

Number of log probabilities to return per output token. Note that the implementation follows the OpenAI API: The return result includes the log probabilities on the logprobs most likely tokens, as well as the chosen tokens. The API will always return the log probability of the sampled token, so there may be up to logprobs+1 elements in the response.


  • prompt_logprobs

Number of log probabilities to return per prompt token.


  • custom_token_bans

List of token IDs to ban from being generated.


  • skip_special_tokens

Whether to skip special tokens in the output. Default: True.


  • spaces_between_special_tokens

Whether to add spaces between special tokens in the output. Defaults: True.


  • logits_processors

List of LogitsProcessors to change the probability of token prediction at runtime. Aliased to logit_bias in the API request body.