Name	Name	Last commit message	Last commit date
parent directory ..
creation	creation
data	data
maskbench	maskbench
metainfo	metainfo
plots	plots
scripts	scripts
README.md	README.md
requirements.txt	requirements.txt

MaskBench

The primary focus of the top-level repository is end-to-end performance and accuracy of JSON Schema-constrained generation. See paper for general methodology, description of data and results.

This folder, however, contains scripts and results dedicated to benchmarking mask computation in isolation, without involving an LLM. By isolating mask computation, this benchmark assesses its standalone performance, which is particularly relevant for server-side scenarios with large batch sizes.

News

2025-03-27: re-run tests with latest versions of engines; fairer timeout accounting
2025-03-26: added testcases from NousResearch/json-mode-eval
2025-03-26: added (quite easy) testcases from Gorilla BFCL v3 using improved version of script by @zanderjiang
2025-03-21: reordered properties in a few objects, to follow stable property order
2025-02-17: re-run tests with latest versions of engines
2025-01-20: initial release of the benchmark

Data Overview

Data Folder (data/): Contains ~10k schemas, with 13k valid and 23k invalid instances (total: ~2M tokens). About 1.5k schemas lack tests.
Schema Instances: Each schema includes valid and invalid examples for benchmarking and correctness testing. See Testcases for details on data generation and stats on the testcases.

Benchmark Results

Engines Benchmarked

LLGuidance
llama.cpp grammars, with json_schema_to_grammar.py with whitespace regex modified to /[ \t\n\r]*/ to match JSON definition (by default it puts limits on the amount of whitespace, which slows down the engine).
XGrammar in default configuration.
"XGrammar.cpp": XGrammar with the llama.cpp script above.
Outlines Core

Test Environment

Hardware: Azure NC96ads_A100_v4 with 96 threads (48 cores), 880 GiB RAM, 4x A100 GPUs (GPUs not utilized).
Constraints:
- Time: 15 minutes per schema.
- Memory: 40 GiB resident set size.
- Threads: 40-thread limit.
Engines were executed single-threaded to emulate large batch scenarios (where batch size is larger than the number of available cores).
XGrammar was set to only use a single thread per sequence, other always do that.
~~Outlines normally uses several threads per sequence, so it was run with 90 parallel threads, so it doesn't get more CPU time than the other engines.~~

Approximate times to run the benchmark with 40-way parallelism:

LLGuidance: ~45 seconds (Python overhead; Rust benchmark takes ~7 seconds)
llama.cpp: ~20 minutes
XGrammar: ~80 minutes
XGrammar.cpp: ~70 minutes
Outlines: ~130 minutes

Measurements

TTFM p75 is computed as follows: take all grammar compilation times that did not raise exception for a given engine (including timeouts, which count for the timeout value (900s)), sort them, and take the 75th percentile.

TBM p75 is computed as follows: take all successful mask computation times for a given engine (accross all schemas), sort them, and take the 75th percentile.

Note that an engine that only supports "easy" schemas may have artificially good scores.

Key Findings

Grammar Compilation Time (TTFM):
- LLGuidance and llama.cpp had near-instantaneous compilation (skewed by the single llama.cpp timeout).
- XGrammar is now the slowest (unlike in previous runs of this benchmark), likely because it now supports more JSON Schema features.
- Outlines has improved dramatically, and is now slightly faster than XGrammar (though the feature comparison is unclear).
- XGrammar.cpp is now somewhat faster than regular XGrammar for the more complex schemas.
Mask Computation Time (TBM):
- XGrammar outperforms LLGuidance on simple cases (p25–p90), but becomes 4x and 11x slower at p95 and p99, respectively.
- LLGuidance is thus 11x faster on average due to better tail performance. The average dropped in 0.7.10 because computing the mask now drops the GIL, which takes a few microseconds.
- Outlines has fully pre-computed results, so it's unclear why it's so slow at p75 and above.
- XGrammar.cpp lags significantly (2–3 orders of magnitude slower than LLGuidance from p50 onward).

Random notes

for TBM, with batch size 100 and forward pass time of 20ms, the p99 happens 50 times per second, and p99.9 happens 5 times per second; unless handled specially, these mask computations will hold the entire batch
the TTFM is cut off at 900s due to timeout
while LLGuidance has the biggest number of compile errors, it has almost no validation errors nor crashes; in other words it's upfront about what it cannot do
the "invalidation errors" are cases where a generation should be rejected, but wasn't; these are clear bugs
"validation errors" may be more tricky due to object property ordering; however, all engines stick to definition order in properties (except for llama.cpp, which puts required properties first), and engines other than LLGuidance don't support allOf and sibling properties (which introduces complications to the ordering)
by default, XGrammar uses fixed white-space and assumes "additionalProperties": false; it also ignores keywords like minItems and maxItems, as well as allOf, sibling properties, etc.; this all significantly reduces complexity of the grammar
OTOH, grammars used in XGrammar.cpp configuration are similar in feature coverage and flexibility to the ones used in LLGuidance; thus it provides a more apples-to-apples comparison of the grammar engines (as opposed to grammars)

Performance Metrics

metric	LLGuidance	XGrammar	llama.cpp	XGrammar.cpp	Outlines
TBM avg	64	728	16,565	42,158	1,858
TBM p25	29	3	11,032	8	2
TBM p50	46	9	17,021	1,677	34
TBM p75	55	39	22,656	74,994	4,110
TBM p90	77	99	26,818	106,677	6,043
TBM p95	119	499	29,722	137,862	6,572
TBM p99	533	5,687	60,371	381,153	7,184
TBM p99.9	1,695	100,267	297,552	2,043,383	7,820
TBM p100	33,055	6,283,834	1,129,560	7,416,822	1,121,673
TTFM avg	1,850	27,062,414	92,371	7,040,547	18,351,800
TTFM p25	800	683,313	220	434,100	300,638
TTFM p50	1,037	919,459	312	856,231	587,900
TTFM p75	1,534	2,399,718	621	1,445,446	4,281,360
TTFM p90	3,030	11,645,219	1,744	4,099,282	26,703,408
TTFM p95	5,457	35,854,415	3,649	9,323,880	58,638,471
TTFM p99	17,988	900,000,000	28,558	65,476,826	413,111,222
TTFM p99.9	40,092	900,000,000	112,446	900,000,000	900,000,000
TTFM p100	229,367	900,000,000	900,000,000	900,000,000	900,000,000
tokens	2,607,859	2,929,227	2,069,143	1,511,558	2,349,083
schemas	11,306	11,306	11,306	11,306	11,306
passing	8,909	8,122	6,505	6,387	7,050
compile error	2,377	90	1,301	1,718	1,773
segmentation fault	0	37	0	1	0
out of memory	0	0	0	0	188
timeout	0	241	1	38	60
validation error	20	1,408	2,850	2,837	1,415
invalidation error	0	1,368	649	325	822

Versions

llguidance: 0.7.10
xgrammar: 0.1.17
llama-cpp-python: 0.3.8
outlines-core: 0.2.3

Reproducing Results

Run Masks: Use scripts/run_maskbench.py. Example:
./scripts/run_maskbench.py --xgr-compliant data/
Results are saved in tmp/out--xgr-compliant. See ./scripts/run_maskbench.py --help for more options, in particular resource limits.
Analyze Results: Generate tables and plots with
./scripts/maskbench_results.py.

Debugging engine

Run python -m maskbench.runner data/Github_easy---o13947.json --debug --llg. Replace --llg with other engine options as needed.

Testcases

The schema instances were generated using the Meta Llama 3.1 70B instruct model. The output was constrained to produce valid JSON, though not strictly conforming to the schema. For valid instances, the model was further prompted to modify them into invalid ones.

Prompts for invalid instances were adjusted to emphasize specific schema features (e.g., maxItems, pattern, minLength, if etc.), while some instances were generated without such focus. The data generation scripts are located in the creation/ folder.

Both valid and invalid instances were validated using Python and Rust jsonschema libraries.

Tests are categorized by origin and complexity. The table below summarizes the number of schemas, the percentage with generated instances, and the count of valid and invalid instances (some schemas have multiple valid/invalid instances).

For valid instances only (since invalid instances are not generated in production), the following metrics are computed:

Average tokens per instance (as counted by the Llama3 tokenizer).
Fast-forward token share, measured for two cases:
1. Regular JSON (allowing whitespace wherever permitted by the spec).
2. Compacted JSON (no whitespace anywhere).

Compacted JSON is preferred unless the model is fine-tuned on indented JSON.

Fast-forward tokens are additional tokens that can be appended to the model's context window after sampling. These tokens are 3-10x faster to compute than regular tokens. For example, a 15% share of fast-forward tokens corresponds to a 10-13% increase in throughput.

split	schemas	has tests	valid inst.	invalid inst.	tok/inst.	FF	FF compact
Github_trivial	444	73%	460	771	41	3%	5%
Github_easy	1943	87%	2641	4611	46	11%	14%
Github_hard	1240	68%	1493	3405	339	16%	19%
Github_medium	1976	87%	3091	6119	141	11%	13%
Github_ultra	164	54%	160	302	768	19%	21%
Glaiveai2K	1707	61%	1634	1104	30	21%	25%
Kubernetes	1064	89%	1680	2908	86	9%	10%
Snowplow	403	95%	670	1730	142	9%	11%
WashingtonPost	125	78%	146	330	95	12%	14%
MCPspec	45	78%	44	44	45	20%	29%
JsonSchemaStore	492	73%	679	1405	295	7%	7%
TOTAL	10163	75%	12821	23047	133	13%	15%

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

maskbench

maskbench

README.md

MaskBench

News

Data Overview

Benchmark Results

Engines Benchmarked

Test Environment

Measurements

Key Findings

Random notes

Performance Metrics

Versions

Reproducing Results

Debugging engine

Testcases

Files

maskbench

Directory actions

More options

Directory actions

More options

Latest commit

History

maskbench

Folders and files

parent directory

README.md

MaskBench

News

Data Overview

Benchmark Results

Engines Benchmarked

Test Environment

Measurements

Key Findings

Random notes

Performance Metrics

Versions

Reproducing Results

Debugging engine

Testcases