Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

support for llguidance grammars #10224

Merged
merged 34 commits into from
Feb 2, 2025
Merged

support for llguidance grammars #10224

merged 34 commits into from
Feb 2, 2025

Conversation

mmoskal
Copy link
Contributor

@mmoskal mmoskal commented Nov 9, 2024

This is an experimental, very draft PR which adds support for Rust-based llguidance constrained sampling library. This is mostly meant to elicit comments from users and maintainers on if and how it could be integrated.

llguidance provides features similar to llama.cpp grammars (full context-free grammar parsing and JSON schemas), however it takes a somewhat different approach to parsing - it splits the allowed token set (mask) computation between a lexer and parser. The lexer use uses derivatives of regular expressions, while the parser uses Earley algorithm. Due to lexer usage and lots of low-level optimizations, llguidance can compute the token mask for 100k tokens in about 1ms for all typical JSON schemas and most other grammars as well. Just as in llama.cpp grammars, there is no significant pre-computation at startup.

llguidance can also "fast-forward" tokens, for example in case of a JSON schema, after {" is generated, the full key name (consisting of a few tokens) can be processed in a parallel prefill step. This is however not yet hooked up to llama.cpp in this patch. If you're interested, this is hooked up in Guidance via llama.cpp python bindings.

This patch adds llama_sampler_init_llg() which takes two strings: grammar type and the grammar. Following types are supported:

  • "regex" - regular expressions (following Rust regex crate syntax)
  • "json" or "json_schema" - a large subset of JSON schemas (but see issue)
  • "lark" - context-free grammars in (a subset of) Lark format
  • "llguidance" or "guidance" - internal (JSON-based) format

Supporting llama.cpp grammars as they currently are would be difficult, since they do not distinguish between lexer and parser. The lark format is conceptually very similar though.

I also hacked common/sampling.cpp to recognize "llg:" prefix in the grammar string. You can for example pass "llg:regex:[A-Z ]+". I also hacked common/json-schema-to-grammar.cpp to just return "llg:json_schema:" + schema, so the -j option to llama-cli and JSON mode options to llama-server use llguidance (when enabled).

Trying it out

# build llguidance
git clone https://github.com/microsoft/llguidance
cd llguidance/parser
cargo build --release
# build llama.cpp with llguidance support
cd ../../llama.cpp
make LLGUIDANCE_PATH=../llguidance/parser/target/release -j llama-server

@vonjackustc
Copy link

Awesome. Is it an llama.cpp implementation for dottxt-ai/outlines ?

@mmoskal
Copy link
Contributor Author

mmoskal commented Nov 12, 2024

No. The approach of llguidance is more like the current llama.cpp grammars, in that both compute the token mask on the fly. Outlines pre-computes the token masks for all states of the automaton resulting from compiling the constraint. This limits the expressiveness of constraints and has high startup costs (though of course near-zero sampling costs).

I added some notes to llguidance readme.

@HanClinto
Copy link
Collaborator

Really excited to see this -- great work!

Currently Guidance has the best json schema coverage of all constrained encoding libraries.

Out of curiosity, how does llguidance compare to llama.cpp? Current JSON schema -> GBNF limitations are documented here: https://github.com/ggerganov/llama.cpp/tree/master/grammars#json-schemas--gbnf

I would love to do some speed tests with similarly complex grammars to see how llguidance compares against our stock implementation. Have you happened to run any of those tests yourself?

@mmoskal
Copy link
Contributor Author

mmoskal commented Nov 14, 2024

I believe llguidance json schema coverage will be more comprehensive once this is merged guidance-ai/llguidance#48 We should have more data on this soon.

As for performance, on the example I run there doesn't seem to be much difference between llama.cpp grammar, llguidance and unconstrained. I think this is due to llama.cpp grammars using rejection sampling (resampling). Llguidance currently always computes the full mask (though I want to also allow rejection sampling). @HanClinto can you point me to some examples where the current llama.cpp grammars has performance issues?

@HanClinto
Copy link
Collaborator

@HanClinto can you point me to some examples where the current llama.cpp grammars has performance issues?

#7810 is one issue where the stacks get so large that we SIGSEGV out (GBNF is available here). But that's CFG, so I'm not sure how useful it would be to you.

Sadly, while I've collected a number of particularly brutal GBNF grammars, I don't have any particularly juicy JSON schema specs for you to try. #7703 has some tricky bits re: required and oneOf both being set, and it's a bit of a conflict / edge-case in the spec, but the performance of that particular spec seemed to parse pretty fine.

That said, I'm interested in learning how I can best help to put this through its paces -- I'm excited by the option of having competing grammar engines in llama.cpp!

@HanClinto
Copy link
Collaborator

@ochafik, do you have any JSON schemas available that caused the GBNF engine to bog down?

@mmoskal
Copy link
Contributor Author

mmoskal commented Nov 14, 2024

I tried the grammar from #7810 converted to lark, see xml-lark.txt

I needed to change ::= to :, \x22 to \", add /.../ around regexps. Now, there is one more optimization which is making sure text is a single terminal. This is done by calling it TEXT. If you keep it text, the parsing will be slow, and may eventually fail.

I run it with

make LLGUIDANCE_PATH=../llguidance/parser/target/release/  -j10 llama-cli
./llama-cli -m ../models/Meta-Llama-3.1-8B-Instruct-Q6_K_L.gguf -p "Here an RFC about email formats in XML:\n" --grammar-file tmp/xml.lark  -n 200

The generation with and without grammar is the same (35t/s).

The grammar without the TEXT optimization is quite slow (20t/s) and eventually fails due to Earley sets growing too large. This is actually a nice test case - we can try to optimize it a bit but mainly I just want to fail it at runtime instead of taking this long.

@HanClinto
Copy link
Collaborator

I tried the grammar from #7810 converted to lark, see xml-lark.txt

I needed to change ::= to :, \x22 to \", add /.../ around regexps.

Very nice!!!

Is there not support for hex characters in LLG grammars? It's relatively common in GBNF grammars to use \x00 for null (you can see it in use in json.gbnf as an example), or use it to specify unicode ranges.

If I were looking to dive into llguidance, that I suppose that feature might make a good first contribution.

Now, there is one more optimization which is making sure text is a single terminal. This is done by calling it TEXT. If you keep it text, the parsing will be slow, and may eventually fail.

Fascinating. I guess I don't know enough about the vagaries of grammar parsers, but I wonder if making this same change would let the GBNF parser successfully work through this grammar as well? Or is this a quirk of how llguidance handles things?

I run it with

make LLGUIDANCE_PATH=../llguidance/parser/target/release/  -j10 llama-cli
./llama-cli -m ../models/Meta-Llama-3.1-8B-Instruct-Q6_K_L.gguf -p "Here an RFC about email formats in XML:\n" --grammar-file tmp/xml.lark  -n 200

The generation with and without grammar is the same (35t/s).

Very nice!!

How does the speed compare if you make the same text / TEXT change in the GBNF version? This feels like it would be a very nice apples-to-apples comparison?

The grammar without the TEXT optimization is quite slow (20t/s) and eventually fails due to Earley sets growing too large. This is actually a nice test case - we can try to optimize it a bit but mainly I just want to fail it at runtime instead of taking this long.

Agreed. Fundamentally, I'm not sure that a stack-based parser is going to be able to parse a grammar like this (though neither do I think a precompiled grammar engine like Outlines would be able to handle it) -- I wonder if the only way to handle this would be an engine that uses backtracking (and so would have some wasteful token generation), but at least it would be able to work its way through it.

The other option that I've considered with grammars that grow this large is to simply do random culling of the tree once it grows above a certain limit (possibly with some heuristic that prioritizes shorter / simpler stacks over longer / more-complex ones). You would lose the advantage of generating across every conceivable valid path, but at least what you generate would still be grammatically correct, and it could actually finish without crashing.

But that's an optimization for another day.

For now, I'm interested in seeing how LLG performance stacks up against the GBNF grammar engine (so far I'm really optimistic!) and secondly, seeing what we can do to make the coupling / separation cleaner. One issue is the compilation / linking issue, and the second is how to specify LLG vs. GBNF.

I actually like the way that you use "llg:" as a special prefix. The alternative would be to specify --grammar as one command line option, and --grammar-llg (or maybe just --llg) as a second command-line option, and whatever parameter is set determines which grammar sampler is used.

A little bit of duct tape is acceptable (especially because I don't currently feel strongly about which path is best), but I would like us to be thinking ahead about the ideal scenario for incorporating multiple grammar engines alongside each other. I'm very interested in hearing others' thoughts on this topic.

@mmoskal
Copy link
Contributor Author

mmoskal commented Nov 14, 2024

Is there not support for hex characters in LLG grammars? It's relatively common in GBNF grammars to use \x00 for null (you can see it in use in json.gbnf as an example), or use it to specify unicode ranges.

I currently use JSON string parsing for string literals; it's easy to fix (it's just the lark frontend issue). You can use \u0000 no problem and similar for unicode ranges. For regexes, I use regex-syntax rust crate which does support \x00 and similar (though I think it requires these to be valid utf8 (\x00 is but eg \xFF isn't))

Created issue guidance-ai/llguidance#54

TLDR: you can't do that TEXT trick with llama.cpp grammars.

Back in the 1970s when computers were slow, people figured out that you can first deal with words (also called tokens (not be confused with LLM tokens) or lexemes) and only then you deal with syntax. This is because splitting text into words is cheaper than parsing it. And so regular expressions were used for "lexing" (splitting into words or lexemes) and context-free grammars were used for the higher-level parsing. Now, this is theoretically unnecessary, since regular languages are subset of context-free languages. It's just that doing lexing first and parsing on top of larger items just happens to be quite a bit faster.

Llama.cpp grammars do not have this separation between lexing and parsing, which is definitely cleaner and easier to understand. However, while computers are much faster now, the token masking is this specific problem where you have to do lots of parsing in a very short time. I don't think you can do it fast enough without lexer.

Also, typically the LLM tokens are somewhat aligned with lexemes, meaning that when you walk the prefix tree of all tokens, you do lexer operations 99.5+% of the time, and they are much cheaper.

On the plus side, virtually all programming language definitions (including JSON) have this lexer/parser separation.

Oh, and BTW, if you have lexer (and you mostly do), the size of the linked grammar is absolutely not a problem. We've been successfully running 4MB JSON schemas through LLG.

@ibehnam
Copy link
Contributor

ibehnam commented Jan 8, 2025

I really look forward to seeing this merge. So far I've used llama.cpp grammars and guidance, but I found it redundant that we have to convert EBNF grammars to guidance grammar functions syntax—with complex grammars this gets ugly quickly.

@mmoskal
Copy link
Contributor Author

mmoskal commented Jan 22, 2025

We have just released a large JSON Schema benchmark and a paper. Of particular interest might be isolated mask-generation benchmarks - comparing LLGuidance, Outlines, XGrammar and llama.cpp grammars.

hero

If there is any interest merging this, I can try to get this PR back in shape - it would still likely require building llguidance with cargo, which I understand is a big hurdle.

Copy link
Owner

@ggerganov ggerganov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The conditional API in llama.h is not an option. I think a better approach is to implement this as part of the libcommon - the llama_sampler API allows user code to implement new samplers. This way the llguidance support will ship with the llama.cpp examples and people will be able to use it. If we see that it is really useful, later on we can think of ways to integrate it more tightly with libllama.

@github-actions github-actions bot added documentation Improvements or additions to documentation build Compilation issues labels Jan 26, 2025
@mmoskal
Copy link
Contributor Author

mmoskal commented Jan 26, 2025

@ggerganov I re-did the code structure to move it to libcommon, and force-pushed to clean up history.

I also generally cleaned up (including detection of llg grammars with a string that is valid in them) and added docs

I now build with cmake's ExternalProject_Add() - seems to work quite well, provided cargo is installed.

@mmoskal mmoskal marked this pull request as ready for review January 26, 2025 04:56
@mmoskal mmoskal requested a review from ggerganov January 26, 2025 04:56
@github-actions github-actions bot added the testing Everything test related label Jan 26, 2025
@mmoskal
Copy link
Contributor Author

mmoskal commented Jan 26, 2025

@ggerganov sorry for build failures; just run the build on linux myself, should be good now

@mmoskal
Copy link
Contributor Author

mmoskal commented Jan 27, 2025

FWIW I'm working on more cross-comaptibility between Lark syntax used in llguidance and GBNF. These are only changes in llguidance though, so nothing preventing from review and merge of this PR.

@mmoskal
Copy link
Contributor Author

mmoskal commented Jan 28, 2025

I ended up not adding GBNF compat to the Lark syntax - they conflict too much. Instead I added a conversion script. This has the advantage that it tries to identify which symbols are terminals and which are non-terminals. Works for all samples in llama.cpp/grammars/*.gbnf.

I also added more docs in the llguidance repo about the lark syntax.

Comment on lines 105 to 109

#ifdef LLAMA_USE_LLGUIDANCE
struct llama_sampler * llama_sampler_init_llg(const llama_vocab * vocab,
const char * grammar_kind, const char * grammar_data);
#endif // LLAMA_USE_LLGUIDANCE
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This ifdef should be moved inside the implementation. Generally, we want to avoid ifdefs in the headers.

@mmoskal
Copy link
Contributor Author

mmoskal commented Jan 29, 2025

@ggerganov I made it return nullptr if not enabled, let me know if you meant something else.

I also linked the conversion script from llguidance.md

Copy link
Owner

@ggerganov ggerganov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the patience - will merge this after the #9639 PR is merged. Will be tagging you for support in case any issues appear with the functionality. Thanks!

Comment on lines 1 to 6
#include "common.h"
#include "sampling.h"
#include "log.h"
#include "llama.h"

#include <cmath>
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove unused headers:

Suggested change
#include "common.h"
#include "sampling.h"
#include "log.h"
#include "llama.h"
#include <cmath>
#include "sampling.h"
#include "log.h"
#include "llama.h"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

<cmath> is actually required on Linux for INFINITY constant; I did remove common.h though, thank you!

CMakeLists.txt Outdated
@@ -79,6 +79,7 @@ option(LLAMA_BUILD_SERVER "llama: build server example" ${LLAMA_STANDALONE})

# 3rd party libs
option(LLAMA_CURL "llama: use libcurl to download model from an URL" OFF)
option(LLAMA_LLGUIDANCE "llama: build LLGuidance library for structured output" OFF)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should somehow be indicated that it is used by the common library. I am not sure what would be the best way. Maybe rename it to LLAMA_COMMON_LLGUIDANCE? But even if we leave it like this, it's ok - just making a note.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I re-worded the help string, maybe that is enough for now?

@ggerganov
Copy link
Owner

Waiting for things to stabilize after the #9639 change and will merge after. There also some merge conflicts to be resolved here.

@mmoskal
Copy link
Contributor Author

mmoskal commented Jan 31, 2025

@ggerganov I fixed merge conflicts and also updated llguidance to latest 0.6.12 - it supports {m,n} repeatition syntax and has a few bug fixes

@ggerganov ggerganov merged commit ff22770 into ggerganov:master Feb 2, 2025
46 checks passed
@ShelbyJenkins
Copy link
Contributor

@mmoskal I'd like to add support for this to https://github.com/ShelbyJenkins/llm_client.

It's a rust crate that wraps llama.cpp (think Ollama), but focuses on using llms as control flow vs chat bots.

If I'm understanding this thread I just need to:

  • Add LLAMA_LLGUIDANCE to the build scripts.
  • Convert the existing gbnf grammars to lark

In the future:
I am currently working on a GUI that can be distributed to as a binary, but I plan on having llama.cpp build for the platform of the users device. Am I understanding correctly that your integration requires the Rust compiler to build llama.cpp?

@mmoskal mmoskal deleted the llg branch February 3, 2025 15:07
@mmoskal
Copy link
Contributor Author

mmoskal commented Feb 3, 2025

@ShelbyJenkins definetely yes on grammar conversion; there might be some things you were doing that are now easier (eg. the llguidance syntax supports embedded JSON and special tokens)

As for the integration you have a few options:

  • just enable the feature as you're saying; you may end up with double copy of various Rust crates (which may be a good thing from versioning perspective)
  • don't enable the feature but re-implement llguidance.cpp file either in Rust or just copy C++ into your crate, and then plug it in using the external llama.cpp APIs - this way the user might not need Rust to build llama.cpp locally
  • as above, but take over sampling in Rust

Note that mistral.rs that you also seem to be using already has llguidance integrated, but I guess some old version. Shold be easy to bump.

For any more questions, feel free to open issue on your (or llguidance) repo and tag me!

@phil-scott-78
Copy link

hey @mmoskal, I ran into a hiccup trying to test this out on my windows machine using MSVC. when running I get a bunch of errors trying to find libllguidance.a

are .a files created with MSVC? my experience with rust is nearly zero, but my understanding that with MSVC we'd see a .lib file created?

I adjusted the cmakelist to be the following to get it working for me, but this was strictly "works on my machine solution" - there might be a cleverer solution out there

if (LLAMA_LLGUIDANCE)
    include(ExternalProject)
    set(LLGUIDANCE_SRC ${CMAKE_BINARY_DIR}/llguidance/source)
    set(LLGUIDANCE_PATH ${LLGUIDANCE_SRC}/target/release)

    # Set the correct library file extension based on platform
    if (WIN32)
        set(LLGUIDANCE_LIB_NAME "llguidance.lib")
        # Add Windows-specific libraries
        set(PLATFORM_LIBS 
            ws2_32    # Windows Sockets API
            userenv   # For GetUserProfileDirectoryW
            ntdll     # For NT functions
            bcrypt    # For BCryptGenRandom
        )
    else()
        set(LLGUIDANCE_LIB_NAME "libllguidance.a")
        set(PLATFORM_LIBS "")
    endif()

    ExternalProject_Add(llguidance_ext
        GIT_REPOSITORY https://github.com/guidance-ai/llguidance
        GIT_TAG ced1c9023d47ec194fa977932d35ce65c2ebfc09
        PREFIX ${CMAKE_BINARY_DIR}/llguidance
        SOURCE_DIR ${LLGUIDANCE_SRC}
        BUILD_IN_SOURCE TRUE
        CONFIGURE_COMMAND ""
        BUILD_COMMAND cargo build --release
        INSTALL_COMMAND ""
        BUILD_BYPRODUCTS ${LLGUIDANCE_PATH}/${LLGUIDANCE_LIB_NAME} ${LLGUIDANCE_PATH}/llguidance.h
        UPDATE_COMMAND ""
    )
    target_compile_definitions(${TARGET} PUBLIC LLAMA_USE_LLGUIDANCE)

    add_library(llguidance STATIC IMPORTED)
    set_target_properties(llguidance PROPERTIES IMPORTED_LOCATION ${LLGUIDANCE_PATH}/${LLGUIDANCE_LIB_NAME})
    add_dependencies(llguidance llguidance_ext)

    target_include_directories(${TARGET} PRIVATE ${LLGUIDANCE_PATH})
    # Add platform libraries to the main target
    set(LLAMA_COMMON_EXTRA_LIBS ${LLAMA_COMMON_EXTRA_LIBS} llguidance ${PLATFORM_LIBS})
endif ()

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
build Compilation issues devops improvements to build systems and github actions documentation Improvements or additions to documentation testing Everything test related
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants