TokenButler: Token Importance Is Predictable

Predictor Architecture

This repository contains code to train and evaluate 'token importance' predictors.

Huggingface

We support the following models directly through huggingface-transformers:

DeepSeek-R1-Distill-Llama-8B-Butler
Llama-3.1-8B-Butler
Llama-2-7b-hf-Butler
Llama-3.2-3B-Butler
Llama-3.2-1B-Butler

The collection of models can be found here

Simply run test_hf.py or:

from transformers import AutoTokenizer, AutoModelForCausalLM, pipeline

question = "If millionaires have butlers, why don't million dollar language models have a butler too? I think its because "

model_name = "akhauriyash/DeepSeek-R1-Distill-Llama-8B-Butler"
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(model_name, trust_remote_code=True)

generator = pipeline("text-generation", model=model, tokenizer=tokenizer)
response = generator(question, max_new_tokens=200, do_sample=True, top_p=0.95, temperature=0.7)

print(response[0]['generated_text'][len(question):])

Note that the 'default' configured sparsity is 50%. Further, there is a 'sliding window' of 128 and 8 'anchor tokens'. To 'change' the sparsity, you can use the following function after loading the model. Please note that the 'fixed' is the only supported strategy at the moment, which 'fixes' the sparsity of each layer (except the first) at the 'pc' (percentage) mentioned. This can also be found at test_hf.py. Sliding window and anchor tokens can be changed in a similar manner.

def set_sparsity(model, sparsity):
    for module in model.modules():
        if module.__class__.__name__.__contains__("AttentionExperimental"):
            module.token_sparse_method = sparsity
            module.set_token_sparsity()
    return model

model = set_sparsity(model, "fixed_60pc")

All of our results, traces from experiments are located in ablation_results/

Note: Our predictor design has improved since the arXiv paper release (We added a layer-norm to stabilize training). Further, to focus on the main predictor design and training-eval scripts, we have removed the ablation scripts. To reproduce the original results and predictor models, please checkout commit 0412fc24a3b770e4d82e6d7064a8172f24c5fcd3 and download the old models from Drive Link.

For the latest, new models, try the huggingface integration. Wandb-Logs for trained models.

Installation

conda create --name TokenButler python=3.10
conda activate TokenButler
python -m pip install -r requirements.txt

Evaluation

Please download our trained (old) TokenButler predictor models from this Drive Link

To evaluate, example scripts are provided in scripts/eval_scan.sh, checkout commit 0412fc24a3b770e4d82e6d7064a8172f24c5fcd3. Decode-generation may not work at this commit.

Example script (Please update checkpoint path after downloading models):

bash eval_scan.sh L3_3B_2k_1PC.csv L3_3B_2k_1PC ExpPred meta-llama/Llama-3.2-3B 1024 16 "<PATH TO CHECKPOINT>"

Custom Synthetic Task

Modes supported:

TokenButler: ExpPred
Oracle: oracle
H2O: h2o_true (Generation not supported, prefill is 'decode simulated')
SnapKV: snapkv (Generation not supported, prefill is 'decode simulated')
Quest: quest (Generation not supported)

Important note on our evaluation strategy: To properly test token-eviction/selection methods in a longer decode setting, we simulate token eviction based strategies (SnapKV and H2O) in a purely decode setting. For example, with a 50% token budget, we simulate the entire input sequence as if it were fully decoded, allocating tokens proportionally to the sequence length at each decode step. This approach helps profile prefill-eviction-based methods by accurately emulating their token eviction policies across the full input sequence. Unfortunately, this also makes accuracy evaluation slower.

To change downstream evaluation models, modify task_list, to evaluate on a smaller subset, modify eval_subset. To modify token sparsities being evaluated, modify {10..60..10} as desired.

Training

Our training scripts are located in scripts/train_predictors.sh

We provide scripts for the following models:

deepseek-ai/DeepSeek-R1-Distill-Llama-8B
meta-llama/Llama-3.2-1B
meta-llama/Llama-3.2-3B
meta-llama/Llama-2-7b-hf
meta-llama/Llama-3.1-8B
mistralai/Mistral-7B-v0.1
Qwen/Qwen2.5-3B
Qwen/Qwen2.5-7B
microsoft/Phi-3.5-mini-instruct
microsoft/Phi-3-mini-4k-instruct

Training requires 1 A6000 GPU for these variants. Longer-context training is possible using --model_parallelism

Reasoning Model TokenButler Results

Model: DeepSeek-R1-Distill-Llama-8B-Butler

Method	Sparsity (%)	Perplexity	BBH Causal Judgement	MMLU-Pro
Dense	0	15.87	0.55	0.274
TokenButler	12.2	15.90	0.56	0.275
TokenButler	31.0	15.99	0.55	0.273
TokenButler	49.8	16.22	0.56	0.273
TokenButler	68.2	16.99	0.55	0.263
Oracle	12.2	15.85	0.56	0.273
Oracle	31.0	15.76	0.55	0.273
Oracle	49.8	15.66	0.54	0.271
Oracle	68.3	15.71	0.51	0.271

Citation

@misc{akhauri2025tokenbutlertokenimportancepredictable,
      title={TokenButler: Token Importance is Predictable}, 
      author={Yash Akhauri and Ahmed F AbouElhamayed and Yifei Gao and Chi-Chih Chang and Nilesh Jain and Mohamed S. Abdelfattah},
      year={2025},
      eprint={2503.07518},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2503.07518}, 
}

Name		Name	Last commit message	Last commit date
Latest commit History 58 Commits
ablation_results		ablation_results
accelerator_configs		accelerator_configs
config		config
figs		figs
longbench_utils		longbench_utils
modify_models		modify_models
scripts		scripts
triton_kernels		triton_kernels
.gitignore		.gitignore
README.md		README.md
TokenButler_Draft.pdf		TokenButler_Draft.pdf
base_model_eval.py		base_model_eval.py
base_sentences.py		base_sentences.py
d_sparsity_env.yml		d_sparsity_env.yml
main.py		main.py
mrcr_custom.json		mrcr_custom.json
predictor.py		predictor.py
requirements.txt		requirements.txt
test_generation.py		test_generation.py
test_hf.py		test_hf.py
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

TokenButler: Token Importance Is Predictable

Predictor Architecture

Huggingface

Installation

Evaluation

Example script (Please update checkpoint path after downloading models):

Custom Synthetic Task

Modes supported:

Training

Reasoning Model TokenButler Results

Citation

About

Releases

Packages

Contributors 3

Languages

abdelfattah-lab/TokenButler

Folders and files

Latest commit

History

Repository files navigation

TokenButler: Token Importance Is Predictable

Predictor Architecture

Huggingface

Installation

Evaluation

Example script (Please update checkpoint path after downloading models):

Custom Synthetic Task

Modes supported:

Training

Reasoning Model TokenButler Results

Citation

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

Packages