support for llguidance grammars #10224

mmoskal · 2024-11-09T01:53:40Z

This is an experimental, very draft PR which adds support for Rust-based llguidance constrained sampling library. This is mostly meant to elicit comments from users and maintainers on if and how it could be integrated.

llguidance provides features similar to llama.cpp grammars (full context-free grammar parsing and JSON schemas), however it takes a somewhat different approach to parsing - it splits the allowed token set (mask) computation between a lexer and parser. The lexer use uses derivatives of regular expressions, while the parser uses Earley algorithm. Due to lexer usage and lots of low-level optimizations, llguidance can compute the token mask for 100k tokens in about 1ms for all typical JSON schemas and most other grammars as well. Just as in llama.cpp grammars, there is no significant pre-computation at startup.

llguidance can also "fast-forward" tokens, for example in case of a JSON schema, after {" is generated, the full key name (consisting of a few tokens) can be processed in a parallel prefill step. This is however not yet hooked up to llama.cpp in this patch. If you're interested, this is hooked up in Guidance via llama.cpp python bindings.

This patch adds llama_sampler_init_llg() which takes two strings: grammar type and the grammar. Following types are supported:

"regex" - regular expressions (following Rust regex crate syntax)
"json" or "json_schema" - a large subset of JSON schemas (but see issue)
"lark" - context-free grammars in (a subset of) Lark format
"llguidance" or "guidance" - internal (JSON-based) format

Supporting llama.cpp grammars as they currently are would be difficult, since they do not distinguish between lexer and parser. The lark format is conceptually very similar though.

I also hacked common/sampling.cpp to recognize "llg:" prefix in the grammar string. You can for example pass "llg:regex:[A-Z ]+". I also hacked common/json-schema-to-grammar.cpp to just return "llg:json_schema:" + schema, so the -j option to llama-cli and JSON mode options to llama-server use llguidance (when enabled).

Trying it out

# build llguidance
git clone https://github.com/microsoft/llguidance
cd llguidance/parser
cargo build --release
# build llama.cpp with llguidance support
cd ../../llama.cpp
make LLGUIDANCE_PATH=../llguidance/parser/target/release -j llama-server

vonjackustc · 2024-11-12T14:20:34Z

Awesome. Is it an llama.cpp implementation for dottxt-ai/outlines ?

mmoskal · 2024-11-12T17:05:13Z

No. The approach of llguidance is more like the current llama.cpp grammars, in that both compute the token mask on the fly. Outlines pre-computes the token masks for all states of the automaton resulting from compiling the constraint. This limits the expressiveness of constraints and has high startup costs (though of course near-zero sampling costs).

I added some notes to llguidance readme.

HanClinto · 2024-11-13T16:54:48Z

Really excited to see this -- great work!

Currently Guidance has the best json schema coverage of all constrained encoding libraries.

Out of curiosity, how does llguidance compare to llama.cpp? Current JSON schema -> GBNF limitations are documented here: https://github.com/ggerganov/llama.cpp/tree/master/grammars#json-schemas--gbnf

I would love to do some speed tests with similarly complex grammars to see how llguidance compares against our stock implementation. Have you happened to run any of those tests yourself?

mmoskal · 2024-11-14T05:42:59Z

I believe llguidance json schema coverage will be more comprehensive once this is merged guidance-ai/llguidance#48 We should have more data on this soon.

As for performance, on the example I run there doesn't seem to be much difference between llama.cpp grammar, llguidance and unconstrained. I think this is due to llama.cpp grammars using rejection sampling (resampling). Llguidance currently always computes the full mask (though I want to also allow rejection sampling). @HanClinto can you point me to some examples where the current llama.cpp grammars has performance issues?

HanClinto · 2024-11-14T06:48:18Z

@HanClinto can you point me to some examples where the current llama.cpp grammars has performance issues?

#7810 is one issue where the stacks get so large that we SIGSEGV out (GBNF is available here). But that's CFG, so I'm not sure how useful it would be to you.

Sadly, while I've collected a number of particularly brutal GBNF grammars, I don't have any particularly juicy JSON schema specs for you to try. #7703 has some tricky bits re: required and oneOf both being set, and it's a bit of a conflict / edge-case in the spec, but the performance of that particular spec seemed to parse pretty fine.

That said, I'm interested in learning how I can best help to put this through its paces -- I'm excited by the option of having competing grammar engines in llama.cpp!

HanClinto · 2024-11-14T06:50:18Z

@ochafik, do you have any JSON schemas available that caused the GBNF engine to bog down?

mmoskal · 2024-11-14T18:42:53Z

I tried the grammar from #7810 converted to lark, see xml-lark.txt

I needed to change ::= to :, \x22 to \", add /.../ around regexps. Now, there is one more optimization which is making sure text is a single terminal. This is done by calling it TEXT. If you keep it text, the parsing will be slow, and may eventually fail.

I run it with

make LLGUIDANCE_PATH=../llguidance/parser/target/release/  -j10 llama-cli
./llama-cli -m ../models/Meta-Llama-3.1-8B-Instruct-Q6_K_L.gguf -p "Here an RFC about email formats in XML:\n" --grammar-file tmp/xml.lark  -n 200

The generation with and without grammar is the same (35t/s).

The grammar without the TEXT optimization is quite slow (20t/s) and eventually fails due to Earley sets growing too large. This is actually a nice test case - we can try to optimize it a bit but mainly I just want to fail it at runtime instead of taking this long.

HanClinto · 2024-11-14T21:42:52Z

I tried the grammar from #7810 converted to lark, see xml-lark.txt

I needed to change ::= to :, \x22 to \", add /.../ around regexps.

Very nice!!!

Is there not support for hex characters in LLG grammars? It's relatively common in GBNF grammars to use \x00 for null (you can see it in use in json.gbnf as an example), or use it to specify unicode ranges.

If I were looking to dive into llguidance, that I suppose that feature might make a good first contribution.

Now, there is one more optimization which is making sure text is a single terminal. This is done by calling it TEXT. If you keep it text, the parsing will be slow, and may eventually fail.

Fascinating. I guess I don't know enough about the vagaries of grammar parsers, but I wonder if making this same change would let the GBNF parser successfully work through this grammar as well? Or is this a quirk of how llguidance handles things?

I run it with

make LLGUIDANCE_PATH=../llguidance/parser/target/release/  -j10 llama-cli
./llama-cli -m ../models/Meta-Llama-3.1-8B-Instruct-Q6_K_L.gguf -p "Here an RFC about email formats in XML:\n" --grammar-file tmp/xml.lark  -n 200

The generation with and without grammar is the same (35t/s).

Very nice!!

How does the speed compare if you make the same text / TEXT change in the GBNF version? This feels like it would be a very nice apples-to-apples comparison?

The grammar without the TEXT optimization is quite slow (20t/s) and eventually fails due to Earley sets growing too large. This is actually a nice test case - we can try to optimize it a bit but mainly I just want to fail it at runtime instead of taking this long.

Agreed. Fundamentally, I'm not sure that a stack-based parser is going to be able to parse a grammar like this (though neither do I think a precompiled grammar engine like Outlines would be able to handle it) -- I wonder if the only way to handle this would be an engine that uses backtracking (and so would have some wasteful token generation), but at least it would be able to work its way through it.

The other option that I've considered with grammars that grow this large is to simply do random culling of the tree once it grows above a certain limit (possibly with some heuristic that prioritizes shorter / simpler stacks over longer / more-complex ones). You would lose the advantage of generating across every conceivable valid path, but at least what you generate would still be grammatically correct, and it could actually finish without crashing.

But that's an optimization for another day.

For now, I'm interested in seeing how LLG performance stacks up against the GBNF grammar engine (so far I'm really optimistic!) and secondly, seeing what we can do to make the coupling / separation cleaner. One issue is the compilation / linking issue, and the second is how to specify LLG vs. GBNF.

I actually like the way that you use "llg:" as a special prefix. The alternative would be to specify --grammar as one command line option, and --grammar-llg (or maybe just --llg) as a second command-line option, and whatever parameter is set determines which grammar sampler is used.

A little bit of duct tape is acceptable (especially because I don't currently feel strongly about which path is best), but I would like us to be thinking ahead about the ideal scenario for incorporating multiple grammar engines alongside each other. I'm very interested in hearing others' thoughts on this topic.

mmoskal · 2024-11-14T23:26:33Z

Is there not support for hex characters in LLG grammars? It's relatively common in GBNF grammars to use \x00 for null (you can see it in use in json.gbnf as an example), or use it to specify unicode ranges.

I currently use JSON string parsing for string literals; it's easy to fix (it's just the lark frontend issue). You can use \u0000 no problem and similar for unicode ranges. For regexes, I use regex-syntax rust crate which does support \x00 and similar (though I think it requires these to be valid utf8 (\x00 is but eg \xFF isn't))

Created issue guidance-ai/llguidance#54

TLDR: you can't do that TEXT trick with llama.cpp grammars.

Back in the 1970s when computers were slow, people figured out that you can first deal with words (also called tokens (not be confused with LLM tokens) or lexemes) and only then you deal with syntax. This is because splitting text into words is cheaper than parsing it. And so regular expressions were used for "lexing" (splitting into words or lexemes) and context-free grammars were used for the higher-level parsing. Now, this is theoretically unnecessary, since regular languages are subset of context-free languages. It's just that doing lexing first and parsing on top of larger items just happens to be quite a bit faster.

Llama.cpp grammars do not have this separation between lexing and parsing, which is definitely cleaner and easier to understand. However, while computers are much faster now, the token masking is this specific problem where you have to do lots of parsing in a very short time. I don't think you can do it fast enough without lexer.

Also, typically the LLM tokens are somewhat aligned with lexemes, meaning that when you walk the prefix tree of all tokens, you do lexer operations 99.5+% of the time, and they are much cheaper.

On the plus side, virtually all programming language definitions (including JSON) have this lexer/parser separation.

Oh, and BTW, if you have lexer (and you mostly do), the size of the linked grammar is absolutely not a problem. We've been successfully running 4MB JSON schemas through LLG.

ibehnam · 2025-01-08T20:54:11Z

I really look forward to seeing this merge. So far I've used llama.cpp grammars and guidance, but I found it redundant that we have to convert EBNF grammars to guidance grammar functions syntax—with complex grammars this gets ugly quickly.

mmoskal · 2025-01-22T03:18:58Z

We have just released a large JSON Schema benchmark and a paper. Of particular interest might be isolated mask-generation benchmarks - comparing LLGuidance, Outlines, XGrammar and llama.cpp grammars.

If there is any interest merging this, I can try to get this PR back in shape - it would still likely require building llguidance with cargo, which I understand is a big hurdle.

ggerganov

The conditional API in llama.h is not an option. I think a better approach is to implement this as part of the libcommon - the llama_sampler API allows user code to implement new samplers. This way the llguidance support will ship with the llama.cpp examples and people will be able to use it. If we see that it is really useful, later on we can think of ways to integrate it more tightly with libllama.

mmoskal · 2025-01-26T04:56:34Z

@ggerganov I re-did the code structure to move it to libcommon, and force-pushed to clean up history.

I also generally cleaned up (including detection of llg grammars with a string that is valid in them) and added docs

I now build with cmake's ExternalProject_Add() - seems to work quite well, provided cargo is installed.

mmoskal · 2025-01-26T20:43:02Z

@ggerganov sorry for build failures; just run the build on linux myself, should be good now

mmoskal · 2025-01-27T16:14:33Z

FWIW I'm working on more cross-comaptibility between Lark syntax used in llguidance and GBNF. These are only changes in llguidance though, so nothing preventing from review and merge of this PR.

mmoskal · 2025-01-28T22:52:36Z

I ended up not adding GBNF compat to the Lark syntax - they conflict too much. Instead I added a conversion script. This has the advantage that it tries to identify which symbols are terminals and which are non-terminals. Works for all samples in llama.cpp/grammars/*.gbnf.

I also added more docs in the llguidance repo about the lark syntax.

ggerganov · 2025-01-29T09:19:46Z

common/sampling.h

+
+#ifdef LLAMA_USE_LLGUIDANCE
+struct llama_sampler * llama_sampler_init_llg(const llama_vocab * vocab,
+        const char * grammar_kind, const char * grammar_data);
+#endif // LLAMA_USE_LLGUIDANCE


This ifdef should be moved inside the implementation. Generally, we want to avoid ifdefs in the headers.

mmoskal · 2025-01-29T18:17:09Z

@ggerganov I made it return nullptr if not enabled, let me know if you meant something else.

I also linked the conversion script from llguidance.md

ggerganov

Thank you for the patience - will merge this after the #9639 PR is merged. Will be tagging you for support in case any issues appear with the functionality. Thanks!

ggerganov · 2025-01-30T12:44:08Z

common/llguidance.cpp

+#include "common.h"
+#include "sampling.h"
+#include "log.h"
+#include "llama.h"
+
+#include <cmath>


Remove unused headers:

Suggested change

#include "common.h"

#include "sampling.h"

#include "log.h"

#include "llama.h"

#include <cmath>

#include "sampling.h"

#include "log.h"

#include "llama.h"

<cmath> is actually required on Linux for INFINITY constant; I did remove common.h though, thank you!

ggerganov · 2025-01-30T12:48:33Z

CMakeLists.txt

@@ -79,6 +79,7 @@ option(LLAMA_BUILD_SERVER   "llama: build server example" ${LLAMA_STANDALONE})

 # 3rd party libs
 option(LLAMA_CURL "llama: use libcurl to download model from an URL" OFF)
+option(LLAMA_LLGUIDANCE "llama: build LLGuidance library for structured output" OFF)


This should somehow be indicated that it is used by the common library. I am not sure what would be the best way. Maybe rename it to LLAMA_COMMON_LLGUIDANCE? But even if we leave it like this, it's ok - just making a note.

I re-worded the help string, maybe that is enough for now?

ggerganov · 2025-01-31T08:58:00Z

Waiting for things to stabilize after the #9639 change and will merge after. There also some merge conflicts to be resolved here.

mmoskal · 2025-01-31T19:40:46Z

@ggerganov I fixed merge conflicts and also updated llguidance to latest 0.6.12 - it supports {m,n} repeatition syntax and has a few bug fixes

ShelbyJenkins · 2025-02-03T04:51:09Z

@mmoskal I'd like to add support for this to https://github.com/ShelbyJenkins/llm_client.

It's a rust crate that wraps llama.cpp (think Ollama), but focuses on using llms as control flow vs chat bots.

If I'm understanding this thread I just need to:

Add LLAMA_LLGUIDANCE to the build scripts.
Convert the existing gbnf grammars to lark

In the future:
I am currently working on a GUI that can be distributed to as a binary, but I plan on having llama.cpp build for the platform of the users device. Am I understanding correctly that your integration requires the Rust compiler to build llama.cpp?

mmoskal · 2025-02-03T15:14:35Z

@ShelbyJenkins definetely yes on grammar conversion; there might be some things you were doing that are now easier (eg. the llguidance syntax supports embedded JSON and special tokens)

As for the integration you have a few options:

just enable the feature as you're saying; you may end up with double copy of various Rust crates (which may be a good thing from versioning perspective)
don't enable the feature but re-implement llguidance.cpp file either in Rust or just copy C++ into your crate, and then plug it in using the external llama.cpp APIs - this way the user might not need Rust to build llama.cpp locally
as above, but take over sampling in Rust

Note that mistral.rs that you also seem to be using already has llguidance integrated, but I guess some old version. Shold be easy to bump.

For any more questions, feel free to open issue on your (or llguidance) repo and tag me!

phil-scott-78 · 2025-02-04T04:24:50Z

hey @mmoskal, I ran into a hiccup trying to test this out on my windows machine using MSVC. when running I get a bunch of errors trying to find libllguidance.a

are .a files created with MSVC? my experience with rust is nearly zero, but my understanding that with MSVC we'd see a .lib file created?

I adjusted the cmakelist to be the following to get it working for me, but this was strictly "works on my machine solution" - there might be a cleverer solution out there

if (LLAMA_LLGUIDANCE)
    include(ExternalProject)
    set(LLGUIDANCE_SRC ${CMAKE_BINARY_DIR}/llguidance/source)
    set(LLGUIDANCE_PATH ${LLGUIDANCE_SRC}/target/release)

    # Set the correct library file extension based on platform
    if (WIN32)
        set(LLGUIDANCE_LIB_NAME "llguidance.lib")
        # Add Windows-specific libraries
        set(PLATFORM_LIBS 
            ws2_32    # Windows Sockets API
            userenv   # For GetUserProfileDirectoryW
            ntdll     # For NT functions
            bcrypt    # For BCryptGenRandom
        )
    else()
        set(LLGUIDANCE_LIB_NAME "libllguidance.a")
        set(PLATFORM_LIBS "")
    endif()

    ExternalProject_Add(llguidance_ext
        GIT_REPOSITORY https://github.com/guidance-ai/llguidance
        GIT_TAG ced1c9023d47ec194fa977932d35ce65c2ebfc09
        PREFIX ${CMAKE_BINARY_DIR}/llguidance
        SOURCE_DIR ${LLGUIDANCE_SRC}
        BUILD_IN_SOURCE TRUE
        CONFIGURE_COMMAND ""
        BUILD_COMMAND cargo build --release
        INSTALL_COMMAND ""
        BUILD_BYPRODUCTS ${LLGUIDANCE_PATH}/${LLGUIDANCE_LIB_NAME} ${LLGUIDANCE_PATH}/llguidance.h
        UPDATE_COMMAND ""
    )
    target_compile_definitions(${TARGET} PUBLIC LLAMA_USE_LLGUIDANCE)

    add_library(llguidance STATIC IMPORTED)
    set_target_properties(llguidance PROPERTIES IMPORTED_LOCATION ${LLGUIDANCE_PATH}/${LLGUIDANCE_LIB_NAME})
    add_dependencies(llguidance llguidance_ext)

    target_include_directories(${TARGET} PRIVATE ${LLGUIDANCE_PATH})
    # Add platform libraries to the main target
    set(LLAMA_COMMON_EXTRA_LIBS ${LLAMA_COMMON_EXTRA_LIBS} llguidance ${PLATFORM_LIBS})
endif ()

mmoskal mentioned this pull request Nov 9, 2024

llama : speed-up grammar sampling #4218

Open

HanClinto mentioned this pull request Dec 11, 2024

Bug: llama-gbnf-validator parses grammar but gets a seg fault when validating an input string against the grammar #10321

Open

ggerganov reviewed Jan 22, 2025

View reviewed changes

mmoskal added 8 commits January 25, 2025 14:43

initial porting of previous LLG patch

76290d9

update for new APIs

f19655c

build: integrate llguidance as an external project

f4dc4b8

use '%llguidance' as marker to enable llg lark syntax

afb6cac

add some docs

b5399d4

clarify docs

adc4aed

code style fixes

2a92bfb

remove llguidance.h from .gitignore

8cb12d4

mmoskal force-pushed the llg branch from 20ac793 to 8cb12d4 Compare January 26, 2025 04:50

github-actions bot added documentation Improvements or additions to documentation build Compilation issues labels Jan 26, 2025

mmoskal marked this pull request as ready for review January 26, 2025 04:56

mmoskal requested a review from ggerganov January 26, 2025 04:56

fix tests when llg is enabled

de269a1

github-actions bot added the testing Everything test related label Jan 26, 2025

mmoskal added 7 commits January 26, 2025 10:09

llama_tokenizer() in fact requires valid utf8

ca88ce7

update llg

44e1973

format file

c9e9853

add $LLGUIDANCE_LOG_LEVEL support

efc36c9

fix whitespace

08fefd1

fix warning

1afc53a

include <cmath> for INFINITY

00fcd98

add final newline

437ff31

ggerganov reviewed Jan 29, 2025

View reviewed changes

mmoskal added 2 commits January 29, 2025 08:53

fail llama_sampler_init_llg() at runtime

5475357

Link gbnf_to_lark.py script; fix links; refer to llg docs for lexemes

d06448a

ggerganov approved these changes Jan 30, 2025

View reviewed changes

phil-scott-78 mentioned this pull request Jan 30, 2025

[Feature]: Support JSON Schema from llama.cpp SciSharp/LLamaSharp#798

Open

mmoskal added 2 commits January 30, 2025 11:01

simplify #includes

59da969

improve doc string for LLAMA_LLGUIDANCE

d59d939

mmoskal added 3 commits January 31, 2025 11:23

Merge branch 'master' into llg

6b2de55

typo in merge

a049afb

bump llguidance to 0.6.12

7057589

ggerganov merged commit ff22770 into ggerganov:master Feb 2, 2025
46 checks passed

mmoskal deleted the llg branch February 3, 2025 15:07

mmoskal mentioned this pull request Feb 4, 2025

llguidance build fixes for Windows #11664

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

support for llguidance grammars #10224

support for llguidance grammars #10224

mmoskal commented Nov 9, 2024

vonjackustc commented Nov 12, 2024

mmoskal commented Nov 12, 2024

HanClinto commented Nov 13, 2024

mmoskal commented Nov 14, 2024

HanClinto commented Nov 14, 2024

HanClinto commented Nov 14, 2024

mmoskal commented Nov 14, 2024

HanClinto commented Nov 14, 2024

mmoskal commented Nov 14, 2024 •

edited

Loading

ibehnam commented Jan 8, 2025

mmoskal commented Jan 22, 2025 •

edited

Loading

ggerganov left a comment

mmoskal commented Jan 26, 2025

mmoskal commented Jan 26, 2025

mmoskal commented Jan 27, 2025

mmoskal commented Jan 28, 2025

ggerganov Jan 29, 2025

mmoskal commented Jan 29, 2025

ggerganov left a comment •

edited

Loading

ggerganov Jan 30, 2025

mmoskal Jan 30, 2025

ggerganov Jan 30, 2025

mmoskal Jan 30, 2025

ggerganov commented Jan 31, 2025

mmoskal commented Jan 31, 2025

ShelbyJenkins commented Feb 3, 2025

mmoskal commented Feb 3, 2025

phil-scott-78 commented Feb 4, 2025

support for llguidance grammars #10224

support for llguidance grammars #10224

Conversation

mmoskal commented Nov 9, 2024

Trying it out

vonjackustc commented Nov 12, 2024

mmoskal commented Nov 12, 2024

HanClinto commented Nov 13, 2024

mmoskal commented Nov 14, 2024

HanClinto commented Nov 14, 2024

HanClinto commented Nov 14, 2024

mmoskal commented Nov 14, 2024

HanClinto commented Nov 14, 2024

mmoskal commented Nov 14, 2024 • edited Loading

ibehnam commented Jan 8, 2025

mmoskal commented Jan 22, 2025 • edited Loading

ggerganov left a comment

Choose a reason for hiding this comment

mmoskal commented Jan 26, 2025

mmoskal commented Jan 26, 2025

mmoskal commented Jan 27, 2025

mmoskal commented Jan 28, 2025

ggerganov Jan 29, 2025

Choose a reason for hiding this comment

mmoskal commented Jan 29, 2025

ggerganov left a comment • edited Loading

Choose a reason for hiding this comment

ggerganov Jan 30, 2025

Choose a reason for hiding this comment

mmoskal Jan 30, 2025

Choose a reason for hiding this comment

ggerganov Jan 30, 2025

Choose a reason for hiding this comment

mmoskal Jan 30, 2025

Choose a reason for hiding this comment

ggerganov commented Jan 31, 2025

mmoskal commented Jan 31, 2025

ShelbyJenkins commented Feb 3, 2025

mmoskal commented Feb 3, 2025

phil-scott-78 commented Feb 4, 2025

mmoskal commented Nov 14, 2024 •

edited

Loading

mmoskal commented Jan 22, 2025 •

edited

Loading

ggerganov left a comment •

edited

Loading