Skip to content

server: Experimental new speculative decoding algorithm #14132

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 6 commits into from

Conversation

jukofyork
Copy link
Collaborator

@jukofyork jukofyork commented Jun 11, 2025

How to use this PR

1. Edit the parameters of the llama-batched-bench call in the BENCHMARK_COMMAND command here to match how you intend to use the model, eg:

#!/bin/bash

# Evrionment variables
#export CUDA_VISIBLE_DEVICES=0

# User-defined parameters
MAX_DRAFT_BATCH_SIZE=32
NUM_SAMPLES_PER_BATCH=10

BENCHMARK_COMMAND="~/llama.cpp/build/bin/llama-batched-bench \
    --model ./qwen-2.5-coder-Q6_K.gguf \
    --n-gpu-layers 99 \
    --flash-attn"

# Temporary files
TEMP_FILE=$(mktemp)
JSONL_FILE=$(mktemp)

# Function to generate comma-separated PP and NPL lists
generate_lists() {
    local max_pp=$1
    local num_repeats=$2
    pp_list=$(seq -s ',' 1 $max_pp)
    npl_list=$(printf '1%.0s,' $(seq 1 $num_repeats) | sed 's/,$//')
}

# Function to run the benchmark
run_benchmark() {
    echo "Running benchmark..."
    $BENCHMARK_COMMAND \
        -npp "$pp_list" \
        -ntg 0 \
        -npl "$npl_list" \
        --output-format jsonl | tee "$TEMP_FILE"
}

# Function to extract and process results
process_results() {
    echo -n "Extracting results..."
    count=$(grep '^{' "$TEMP_FILE" | tail -n +2 | tee "$JSONL_FILE" | wc -l)
    echo " Done ($count results extracted)"

    echo "Processing results:"
    jq -s --raw-output '
        (map(.pp) | max) as $max_pp |
        reduce .[] as $item (
            {};
            ($item.pp | tostring) as $pp |
            .[$pp].sum = (.[$pp].sum + $item.speed) |
            .[$pp].count = (.[$pp].count + 1)
        ) |
        [range(1; $max_pp + 1) as $pp |
            (.[($pp|tostring)]).sum / .[($pp|tostring)].count
        ] as $averages |
        $averages[0] as $base |
        [1] + [$averages[1:][] | $base / . ] |
        map(. * 1000 | round | . / 1000) |
        "y_data = np.array([ " + join(", ") + " ])"
    ' "$JSONL_FILE"
}

# Main script execution
generate_lists $MAX_DRAFT_BATCH_SIZE $NUM_SAMPLES_PER_BATCH
run_benchmark
process_results

# Clean up
rm "$TEMP_FILE" "$JSONL_FILE"

NOTE: You will need the jq tool installed for the results processing.

NOTE: Not all lama-server parameters are available for use with llama-batched-bench.

2. Run this and it will produce a line of python that looks like this:

y_data = np.array([ 1, 0.526, 0.352, 0.269, 0.229, 0.226, 0.217, 0.209, 0.137, 0.123, 0.112, 0.103, 0.095, 0.089, 0.083, 0.078, 0.075, 0.071, 0.068, 0.064, 0.061, 0.059, 0.056, 0.054, 0.051, 0.049, 0.047, 0.046, 0.044, 0.043, 0.042, 0.04 ])

3. Copy that line into this python program replacing the line under the # Your data points comment:

import numpy as np
import matplotlib.pyplot as plt
from scipy.optimize import curve_fit

# Your data points
y_data = np.array([ 1, 0.526, 0.352, 0.269, 0.229, 0.226, 0.217, 0.209, 0.137, 0.123, 0.112, 0.103, 0.095, 0.089, 0.083, 0.078, 0.075, 0.071, 0.068, 0.064, 0.061, 0.059, 0.056, 0.054, 0.051, 0.049, 0.047, 0.046, 0.044, 0.043, 0.042, 0.04 ])

x_data = np.arange(len(y_data))

# Find the first value less than 1
n_skipped = 0
for i, val in enumerate(y_data):
    if val < 1:
        n_skipped = i
        break

# Get the base value (first value less than 1)
base_value = y_data[n_skipped]

# Define the power decay function using the base value
def power_decay(x, b):
    return base_value * (x + 1)**(-b)

# Adjust the data to start from the first value < 1
x_fit = x_data[n_skipped:] - n_skipped  # Shift x to start at 0
y_fit = y_data[n_skipped:]

try:
    # Fit the function
    popt_power, _ = curve_fit(power_decay, x_fit, y_fit, p0=[0.5])
    power = popt_power[0]
    
    # Calculate fitted values
    y_fitted = power_decay(x_fit, power)
    
    # Plot results
    plt.figure(figsize=(10, 6))
    plt.scatter(x_data, y_data, label='Original Data')
    plt.scatter(x_fit + n_skipped, y_fit, color='red', label='Data used for fitting')
    plt.plot(x_fit + n_skipped, y_fitted, label=f'Power fit: {base_value:.2f}*x^(-{power:.2f})')
    plt.legend()
    plt.xlabel('Drafted Tokens')
    plt.ylabel('Relative Cost')
    plt.title('Power Law Fitting')
    plt.show()
    
    # Calculate and print RMSE
    def rmse(y_true, y_pred):
        return np.sqrt(np.mean((y_true - y_pred)**2))
    
    print("RMSE for power fit:", rmse(y_fit, y_fitted))
    
    # Print the actual line and rounded version
    print(f"\nActual fit line: {base_value}*x^(-{power})")
    print(f"Rounded fit line: {base_value:.2f}*x^(-{power:.2f})")
    
    # Print suggested PR parameters for llama-server
    print("\nSuggested PR parameters for llama-server:\n")
    print(f"--draft-min {n_skipped}")
    print(f"--draft-max {len(y_data)}")
    print(f"--draft-p-min 0.{100.0*base_value:.0f}{100.0*power:.0f}{n_skipped} (NOTE: Encoded as 0.{{base}}{{power}}{{min}} for use with this PR only!)")

except Exception as e:
    print("Error during fitting:", e)

4. Run this (eg: online here: https://python-fiddle.com/examples/matplotlib).

and it will produce some output like this:

RMSE for power fit: 0.02035426400115545

Actual fit line: 0.526*x^(-0.654553089044972)
Rounded fit line: 0.53*x^(-0.65)

Suggested PR parameters for llama-server:

--draft-min 1
--draft-max 32
--draft-p-min 0.53651 (NOTE: Encoded as 0.{base}{power}{min} for use with this PR only!)

and a graph:

image

5. Then run your llama-server using the draft parameters it has generated, eg:

#!/bin/bash

host_address=192.168.1.1
port_number=8080

# Run the main command
~/llama.cpp/build/bin/llama-server \
        --host "$host_address" \
        --port "$port_number" \
        --alias "qwen-2.5-coder" \
        --chat-template chatml \
        --model ~/models/gguf/qwen-2.5-coder-Q6_K.gguf \
        --n-gpu-layers 99 \
        --flash-attn \
        --ctx_size 16384 \
        --model-draft ~/models/gguf/draft_models/Qwen2.5-Coder-DRAFT-0.6B-Q4_0.gguf \
        --top-k 1 \
        --samplers "top_k" \
        --gpu-layers-draft 99 \
        --draft-min 1 \
        --draft-max 32 \
        --draft-p-min 0.53651

You can also manually set the parameters, eg:

--draft-p-min $${\color{white}0.\color{red}53\color{lightblue}65\color{orange}1}$$

translates to use this power-law formula:

$${\color{red}0.53}$$ * (x - $${\color{orange}1}$$)^(- $${\color{lightblue}0.65}$$)

and with the last digit set to always match your --draft-min 1.

NOTE: Don't change the --draft-min without also changing this last digit or the formula will be completely wrong!

So, in general you can set:

--draft-p-min 0.{base}{power}{min}

where base and power are always 2 digits and min is always 1 digit.

(sorry it's such a crappy way of doing this, but if this shows more promise I will add the proper command line arg(s) later...).


The discussion that led to the idea of this PR start here:

#10466 (comment)

and the basic idea is that the marginal cost of adding 1 more token to a batch goes way down as you add more and more tokens, but different models have very different cost profiles.

For example here I repeat the above for deepseek-v3-0324:

BENCHMARK_COMMAND="~/build/bin/llama-batched-bench \
    --model ./deepseek-v3-0324-Q4_K_XL.gguf \
    --n-gpu-layers 99 \
    --flash-attn \
    --numa distribute \
    --threads 80 \
    --override-tensor exps=CPU"

and we get a completely different set of values:

y_data = np.array([ 1, 1.544, 1.045, 0.823, 0.695, 0.604, 0.544, 0.494, 0.472, 0.439, 0.413, 0.391, 0.374, 0.35, 0.342, 0.33, 0.323, 0.316, 0.309, 0.301, 0.295, 0.29, 0.285, 0.281, 0.277, 0.274, 0.27, 0.268, 0.264, 0.261, 0.259, 0.257 ])

which give a different set of optimal values:

RMSE for power fit: 0.016165745898928052

Actual fit line: 0.823*x^(-0.3431006496784628)
Rounded fit line: 0.82*x^(-0.34)


Suggested PR parameters for llama-server:

--draft-min 3
--draft-max 32
--draft-p-min 0.82343 (NOTE: Encoded as 0.{base}{power}{min} for use with this PR only!)

image

So basically drafts of less than 3 tokens always have negative expectation here!

(also note in the qwen-2.5-coder:32b the effect the flash attention kernels have for the small batch sizes)


One final thing to note is that the linked discussion shows that a rational approximation fit the data much better, but actually the power-law fit is better at modelling the marginal cost here as it tends to under-estimate the costs (ie: the gradient of the line is usually steeper than the data shows), and this in turn reduces the need to recalibrate the draft models' output.

It's also:

  • Much clearer for non-technical people to see what is happening and adjust manually.
  • Should fit better with the existing --draft-p-min if it ever makes it into the code.

I'm keen to get some feedback on this, as for my use cases and models; it looks to be quite a big improvement and also seems to work really well for different levels of "draftability" without the need to reload the model for refactoring tasks, etc.

@jukofyork
Copy link
Collaborator Author

jukofyork commented Jun 11, 2025

Qwen3-235B-A22B drafted by Qwen3-0.6B example:

BENCHMARK_COMMAND="~/llama.cpp/build/bin/llama-batched-bench \
    --model ./qwen-3-Q4_K_XL.gguf \
    --n-gpu-layers 99 \ 
    --flash-attn \
    --numa distribute \
    --threads 80 \
    --override-tensor exps=CPU"

produces:

y_data = np.array([ 1, 0.652, 0.469, 0.363, 0.312, 0.278, 0.252, 0.235, 0.222, 0.213, 0.205, 0.201, 0.197, 0.196, 0.196, 0.194, 0.191, 0.189, 0.188, 0.186, 0.185, 0.183, 0.181, 0.18, 0.179, 0.179, 0.179, 0.178, 0.178, 0.177, 0.177, 0.176 ])
RMSE for power fit: 0.025337802840378353

Actual fit line: 0.652*x^(-0.44928510632787855)
Rounded fit line: 0.65*x^(-0.45)

Suggested PR parameters for llama-server:

--draft-min 1
--draft-max 32
--draft-p-min 0.65451 (NOTE: Encoded as 0.{base}{power}{min} for use with this PR only!)

image

Simple "refactor the bash script" test for comparison with better fit (see below):

prompt eval time =   37672.82 ms /  1642 tokens (   22.94 ms per token,    43.59 tokens per second)
       eval time =  134418.90 ms /  1784 tokens (   75.35 ms per token,    13.27 tokens per second)
      total time =  172091.72 ms /  3426 tokens
draft acceptance rate = 0.56556 ( 1376 accepted /  2433 generated)

This isn't such a good fit, so let's retry with MAX_DRAFT_BATCH_SIZE=16 to bias the least-squares fit to the earlier values:

y_data = np.array([ 1, 0.638, 0.459, 0.36, 0.309, 0.279, 0.254, 0.236, 0.225, 0.214, 0.208, 0.203, 0.2, 0.198, 0.198, 0.196 ])
RMSE for power fit: 0.012670542841567161

Actual fit line: 0.638*x^(-0.4865547601674651)
Rounded fit line: 0.64*x^(-0.49)

Suggested PR parameters for llama-server:

--draft-min 1
--draft-max 16
--draft-p-min 0.64491 (NOTE: Encoded as 0.{base}{power}{min} for use with this PR only!)

image

prompt eval time =   36642.58 ms /  1642 tokens (   22.32 ms per token,    44.81 tokens per second)
       eval time =  138647.57 ms /  1823 tokens (   76.05 ms per token,    13.15 tokens per second)
      total time =  175290.15 ms /  3465 tokens
draft acceptance rate = 0.65340 ( 1346 accepted /  2060 generated)

and using --draft-max 32 (which is fine to alter after the fit; unlike --draft-min which should be left alone!):

prompt eval time =   36051.84 ms /  1642 tokens (   21.96 ms per token,    45.55 tokens per second)
       eval time =  143748.35 ms /  1851 tokens (   77.66 ms per token,    12.88 tokens per second)
      total time =  179800.19 ms /  3493 tokens
draft acceptance rate = 0.53316 ( 1415 accepted /  2654 generated)

and after tweaking the decay steepness manually from 0.49 to 0.55:

--draft-min 1
--draft-max 32
--draft-p-min 0.64551

we get:

prompt eval time =   36064.83 ms /  1642 tokens (   21.96 ms per token,    45.53 tokens per second)
       eval time =  122079.05 ms /  1761 tokens (   69.32 ms per token,    14.43 tokens per second)
      total time =  158143.88 ms /  3403 tokens
draft acceptance rate = 0.56506 ( 1420 accepted /  2513 generated)

There is a lot of potential for tweaking here and hopefully this shows that the numbers from the least-squares fit should only really be used as a starting point!

I'm very interested to see some hard results from other people; comparing it to the existing algorithm - I've avoided providing my own numbers here as my use-cases and setup are probably quite unique (especially for the large MoE models I have offloaded to RAM, etc).

@jukofyork
Copy link
Collaborator Author

One final thing to note is that the linked discussion shows that a rational approximation fit the data much better, but actually the power-law fit is better at modelling the marginal cost here as it tends to under-estimate the costs (ie: the gradient of the line is usually steeper than the data shows), and this in turn reduces the need to recalibrate the draft models' output.

I had this kinda backwards: it's the break-even probabilities (ie: the reciprocals of the relative costs) that are getting under-estimated (at the tail), and that's not really what we want (but not sure it will matter that much...).

We could add an extra parameter like so:

import numpy as np
import matplotlib.pyplot as plt
from scipy.optimize import curve_fit

# Your data points
#y_data = np.array([ 1, 0.526, 0.352, 0.269, 0.229, 0.226, 0.217, 0.209, 0.137, 0.123, 0.112, 0.103, 0.095, 0.089, 0.083, 0.078, 0.075, 0.071, 0.068, 0.064, 0.061, 0.059, 0.056, 0.054, 0.051, 0.049, 0.047, 0.046, 0.044, 0.043, 0.042, 0.04 ])
#y_data = np.array([ 1, 1.544, 1.045, 0.823, 0.695, 0.604, 0.544, 0.494, 0.472, 0.439, 0.413, 0.391, 0.374, 0.35, 0.342, 0.33, 0.323, 0.316, 0.309, 0.301, 0.295, 0.29, 0.285, 0.281, 0.277, 0.274, 0.27, 0.268, 0.264, 0.261, 0.259, 0.257 ])
y_data = np.array([ 1, 0.652, 0.469, 0.363, 0.312, 0.278, 0.252, 0.235, 0.222, 0.213, 0.205, 0.201, 0.197, 0.196, 0.196, 0.194, 0.191, 0.189, 0.188, 0.186, 0.185, 0.183, 0.181, 0.18, 0.179, 0.179, 0.179, 0.178, 0.178, 0.177, 0.177, 0.176 ])
#y_data = np.array([ 1, 0.638, 0.459, 0.36, 0.309, 0.279, 0.254, 0.236, 0.225, 0.214, 0.208, 0.203, 0.2, 0.198, 0.198, 0.196 ])

x_data = np.arange(len(y_data))

# Find the first value less than 1
n_skipped = 0
for i, val in enumerate(y_data):
    if val < 1:
        n_skipped = i
        break

# Get the base value (first value less than 1)
base_value = y_data[n_skipped]

# Define the modified power decay function with asymptotic value c
def power_decay(x, b, c):
    return (base_value - c) * (x + 1)**(-b) + c

# Adjust the data to start from the first value < 1
x_fit = x_data[n_skipped:] - n_skipped  # Shift x to start at 0
y_fit = y_data[n_skipped:]

try:
    # Fit the function with initial guesses for b and c
    # c should be between 0 and base_value, b positive
    popt_power, _ = curve_fit(power_decay, x_fit, y_fit, p0=[0.5, 0.0])
    power = popt_power[0]
    offset = popt_power[1]
    
    # Calculate fitted values
    y_fitted = power_decay(x_fit, power, offset)
    
    # Plot results
    plt.figure(figsize=(10, 6))
    plt.scatter(x_data, y_data, label='Original Data')
    plt.scatter(x_fit + n_skipped, y_fit, color='red', label='Data used for fitting')
    plt.plot(x_fit + n_skipped, y_fitted, label=f'Modified Power fit: ({base_value:.2f}-{offset:.2f})*x^(-{power:.2f}) + {offset:.2f}')
    plt.legend()
    plt.xlabel('Drafted Tokens')
    plt.ylabel('Relative Cost')
    plt.title('Modified Power Law Fitting')
    plt.show()
    
    # Calculate and print RMSE
    def rmse(y_true, y_pred):
        return np.sqrt(np.mean((y_true - y_pred)**2))
    
    print("RMSE for modified power fit:", rmse(y_fit, y_fitted))
    
    # Print the actual line and rounded version
    print(f"\nActual fit line: ({base_value} - {offset})*x^(-{power}) + {offset}")
    print(f"Rounded fit line: ({base_value:.2f} - {offset:.2f})*x^(-{power:.2f}) + {offset:.2f}")
    
    # Print suggested PR parameters for llama-server
    print("\nSuggested PR parameters for llama-server:\n")
    print(f"--draft-min {n_skipped}")
    print(f"--draft-max {len(y_data)}")
    print(f"--draft-p-min 0.{100.0*base_value:.0f}{100.0*power:.0f}{100.0*offset:.0f}{n_skipped} (NOTE: Encoded as 0.{{base}}{{power}}{{offset}}{{min}} for use with this PR only!)")

except Exception as e:
    print("Error during fitting:", e)

and the fit it much better:

image

but it's starting to get much more complex and unintuitive to set by hand... :/

It may not matter that much in practice too:

  1. We will probably rarely get anywhere like as deep as where the under-estimates are.
  2. It would then take a fairly pathologic set of near-1.0 outputs to not drop below the break-even threshold 1-2 tokens later anyway.

@jukofyork
Copy link
Collaborator Author

jukofyork commented Jun 12, 2025

Just tested on an M1 Studio Ultra and it looks like an extremely poor fit when running with -fa:

Qwen-2.5-Coder:32B

y_data = np.array([ 1, 0.92, 0.882, 0.52, 0.48, 0.465, 0.488, 0.44, 0.585, 0.528, 0.481, 0.441, 0.408, 0.38, 0.355, 0.333, 0.314, 0.297, 0.283, 0.268, 0.255, 0.244, 0.234, 0.224, 0.216, 0.208, 0.2, 0.193, 0.187, 0.181, 0.176, 0.17 ])
RMSE for power fit: 0.07071359778694299

Actual fit line: 0.92*x^(-0.39133878295452745)
Rounded fit line: 0.92*x^(-0.39)

Suggested PR parameters for llama-server:

--draft-min 1
--draft-max 32
--draft-p-min 0.92391 (NOTE: Encoded as 0.{base}{power}{min} for use with this PR only!)

image

Qwen-3:32B

y_data = np.array([ 1, 0.918, 0.881, 0.527, 0.483, 0.465, 0.488, 0.44, 0.574, 0.517, 0.471, 0.433, 0.401, 0.372, 0.349, 0.327, 0.308, 0.292, 0.278, 0.262, 0.25, 0.239, 0.229, 0.22, 0.212, 0.204, 0.197, 0.19, 0.184, 0.178, 0.173, 0.167 ])
RMSE for power fit: 0.06880300235722625

Actual fit line: 0.918*x^(-0.39562070439580604)
Rounded fit line: 0.92*x^(-0.40)

Suggested PR parameters for llama-server:

--draft-min 1
--draft-max 32
--draft-p-min 0.92401 (NOTE: Encoded as 0.{base}{power}{min} for use with this PR only!)

image


This actually suggests to me that it would be much better to pass a vector of probability thresholds (eg: like my first attempt at this) rather than attempt any type of fit...

The potential gains for models like this could be huge, as it should be clear that using a fixed p_min = 0.75 for a model with this sort of cost profile is going to be giving up a ton of +EV drafts!

I'll try and see if I can find a way to pass the vector more cleanly than my const std::vector<double> p_mins = {... hack I tried yesterday...

@jukofyork
Copy link
Collaborator Author

It also shows that the const int max_lookahead = 5; from my original code is probably needed again too as the non-monotonic / "jumpy" nature means it could actually be +EV to try a larger batch with a significantly lower break-even threshold!

@jukofyork
Copy link
Collaborator Author

jukofyork commented Jun 12, 2025

I've figured out what's causing the extreme jumps for the CUDA and Metal tests:

  • For my large MoE models the main bottleneck is the experts' tensors offloaded to RAM and previous context makes almost no difference.
  • For the fully offloaded CUDA/Metal models the main bottleneck becomes to attention calculation based on the previous context!

So rerunning the tests now to see what comes out with -npp 512 and using -npl 1,2,3,... with -pps to try to simulate this effect (not 100% sure this is the same as calculating PP after a certain amount of context has been added, but it's the only set of options in llama-batched-bench I can see to do this sort of test.

There are 2 modes of operation:

  • prompt not shared - each batch has a separate prompt of size PP (i.e. N_KV = B*(PP + TG))
  • prompt is shared - there is a common prompt of size PP used by all batches (i.e. N_KV = PP + B*TG)

@jukofyork
Copy link
Collaborator Author

jukofyork commented Jun 12, 2025

image

Surprisingly, increasing the existing context doesn't really change the jaggedness and just moves up the asymptote due to the extra constant overhead.

We could easily add another parameter to model the overhead:

def power_decay(x, b, base_value):
    return (base_value - c) * (x + 1)**(-b) + c

but the jagged line is really a killer of the whole idea so gonna close this and have a rethink over the weekend if there is anything better we can do... It definitely looks like there are some serious gains to be made here, but having to run the llama-batched-bench and/or ending up with 2/3 new variables to tune isn't really worth it IMO.

@jukofyork jukofyork closed this Jun 12, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant