-
Notifications
You must be signed in to change notification settings - Fork 10.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
support for llguidance grammars #10224
Conversation
Awesome. Is it an llama.cpp implementation for dottxt-ai/outlines ? |
No. The approach of llguidance is more like the current llama.cpp grammars, in that both compute the token mask on the fly. Outlines pre-computes the token masks for all states of the automaton resulting from compiling the constraint. This limits the expressiveness of constraints and has high startup costs (though of course near-zero sampling costs). I added some notes to llguidance readme. |
Really excited to see this -- great work!
Out of curiosity, how does llguidance compare to llama.cpp? Current JSON schema -> GBNF limitations are documented here: https://github.com/ggerganov/llama.cpp/tree/master/grammars#json-schemas--gbnf I would love to do some speed tests with similarly complex grammars to see how llguidance compares against our stock implementation. Have you happened to run any of those tests yourself? |
I believe llguidance json schema coverage will be more comprehensive once this is merged guidance-ai/llguidance#48 We should have more data on this soon. As for performance, on the example I run there doesn't seem to be much difference between llama.cpp grammar, llguidance and unconstrained. I think this is due to llama.cpp grammars using rejection sampling (resampling). Llguidance currently always computes the full mask (though I want to also allow rejection sampling). @HanClinto can you point me to some examples where the current llama.cpp grammars has performance issues? |
#7810 is one issue where the stacks get so large that we SIGSEGV out (GBNF is available here). But that's CFG, so I'm not sure how useful it would be to you. Sadly, while I've collected a number of particularly brutal GBNF grammars, I don't have any particularly juicy JSON schema specs for you to try. #7703 has some tricky bits re: That said, I'm interested in learning how I can best help to put this through its paces -- I'm excited by the option of having competing grammar engines in llama.cpp! |
@ochafik, do you have any JSON schemas available that caused the GBNF engine to bog down? |
I tried the grammar from #7810 converted to lark, see xml-lark.txt I needed to change I run it with
The generation with and without grammar is the same (35t/s). The grammar without the TEXT optimization is quite slow (20t/s) and eventually fails due to Earley sets growing too large. This is actually a nice test case - we can try to optimize it a bit but mainly I just want to fail it at runtime instead of taking this long. |
Very nice!!! Is there not support for hex characters in LLG grammars? It's relatively common in GBNF grammars to use \x00 for null (you can see it in use in json.gbnf as an example), or use it to specify unicode ranges. If I were looking to dive into llguidance, that I suppose that feature might make a good first contribution.
Fascinating. I guess I don't know enough about the vagaries of grammar parsers, but I wonder if making this same change would let the GBNF parser successfully work through this grammar as well? Or is this a quirk of how llguidance handles things?
Very nice!! How does the speed compare if you make the same
Agreed. Fundamentally, I'm not sure that a stack-based parser is going to be able to parse a grammar like this (though neither do I think a precompiled grammar engine like Outlines would be able to handle it) -- I wonder if the only way to handle this would be an engine that uses backtracking (and so would have some wasteful token generation), but at least it would be able to work its way through it. The other option that I've considered with grammars that grow this large is to simply do random culling of the tree once it grows above a certain limit (possibly with some heuristic that prioritizes shorter / simpler stacks over longer / more-complex ones). You would lose the advantage of generating across every conceivable valid path, but at least what you generate would still be grammatically correct, and it could actually finish without crashing. But that's an optimization for another day. For now, I'm interested in seeing how LLG performance stacks up against the GBNF grammar engine (so far I'm really optimistic!) and secondly, seeing what we can do to make the coupling / separation cleaner. One issue is the compilation / linking issue, and the second is how to specify LLG vs. GBNF. I actually like the way that you use "llg:" as a special prefix. The alternative would be to specify A little bit of duct tape is acceptable (especially because I don't currently feel strongly about which path is best), but I would like us to be thinking ahead about the ideal scenario for incorporating multiple grammar engines alongside each other. I'm very interested in hearing others' thoughts on this topic. |
I currently use JSON string parsing for string literals; it's easy to fix (it's just the lark frontend issue). You can use Created issue guidance-ai/llguidance#54 TLDR: you can't do that TEXT trick with llama.cpp grammars. Back in the 1970s when computers were slow, people figured out that you can first deal with words (also called tokens (not be confused with LLM tokens) or lexemes) and only then you deal with syntax. This is because splitting text into words is cheaper than parsing it. And so regular expressions were used for "lexing" (splitting into words or lexemes) and context-free grammars were used for the higher-level parsing. Now, this is theoretically unnecessary, since regular languages are subset of context-free languages. It's just that doing lexing first and parsing on top of larger items just happens to be quite a bit faster. Llama.cpp grammars do not have this separation between lexing and parsing, which is definitely cleaner and easier to understand. However, while computers are much faster now, the token masking is this specific problem where you have to do lots of parsing in a very short time. I don't think you can do it fast enough without lexer. Also, typically the LLM tokens are somewhat aligned with lexemes, meaning that when you walk the prefix tree of all tokens, you do lexer operations 99.5+% of the time, and they are much cheaper. On the plus side, virtually all programming language definitions (including JSON) have this lexer/parser separation. Oh, and BTW, if you have lexer (and you mostly do), the size of the linked grammar is absolutely not a problem. We've been successfully running 4MB JSON schemas through LLG. |
I really look forward to seeing this merge. So far I've used llama.cpp grammars and guidance, but I found it redundant that we have to convert EBNF grammars to guidance grammar functions syntax—with complex grammars this gets ugly quickly. |
We have just released a large JSON Schema benchmark and a paper. Of particular interest might be isolated mask-generation benchmarks - comparing LLGuidance, Outlines, XGrammar and llama.cpp grammars. If there is any interest merging this, I can try to get this PR back in shape - it would still likely require building llguidance with cargo, which I understand is a big hurdle. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The conditional API in llama.h
is not an option. I think a better approach is to implement this as part of the libcommon
- the llama_sampler
API allows user code to implement new samplers. This way the llguidance support will ship with the llama.cpp
examples and people will be able to use it. If we see that it is really useful, later on we can think of ways to integrate it more tightly with libllama
.
@ggerganov I re-did the code structure to move it to libcommon, and force-pushed to clean up history. I also generally cleaned up (including detection of llg grammars with a string that is valid in them) and added docs I now build with cmake's ExternalProject_Add() - seems to work quite well, provided |
@ggerganov sorry for build failures; just run the build on linux myself, should be good now |
FWIW I'm working on more cross-comaptibility between Lark syntax used in llguidance and GBNF. These are only changes in llguidance though, so nothing preventing from review and merge of this PR. |
I ended up not adding GBNF compat to the Lark syntax - they conflict too much. Instead I added a conversion script. This has the advantage that it tries to identify which symbols are terminals and which are non-terminals. Works for all samples in I also added more docs in the llguidance repo about the lark syntax. |
common/sampling.h
Outdated
|
||
#ifdef LLAMA_USE_LLGUIDANCE | ||
struct llama_sampler * llama_sampler_init_llg(const llama_vocab * vocab, | ||
const char * grammar_kind, const char * grammar_data); | ||
#endif // LLAMA_USE_LLGUIDANCE |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This ifdef
should be moved inside the implementation. Generally, we want to avoid ifdef
s in the headers.
@ggerganov I made it return nullptr if not enabled, let me know if you meant something else. I also linked the conversion script from |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the patience - will merge this after the #9639 PR is merged. Will be tagging you for support in case any issues appear with the functionality. Thanks!
common/llguidance.cpp
Outdated
#include "common.h" | ||
#include "sampling.h" | ||
#include "log.h" | ||
#include "llama.h" | ||
|
||
#include <cmath> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove unused headers:
#include "common.h" | |
#include "sampling.h" | |
#include "log.h" | |
#include "llama.h" | |
#include <cmath> | |
#include "sampling.h" | |
#include "log.h" | |
#include "llama.h" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
<cmath>
is actually required on Linux for INFINITY
constant; I did remove common.h though, thank you!
CMakeLists.txt
Outdated
@@ -79,6 +79,7 @@ option(LLAMA_BUILD_SERVER "llama: build server example" ${LLAMA_STANDALONE}) | |||
|
|||
# 3rd party libs | |||
option(LLAMA_CURL "llama: use libcurl to download model from an URL" OFF) | |||
option(LLAMA_LLGUIDANCE "llama: build LLGuidance library for structured output" OFF) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should somehow be indicated that it is used by the common
library. I am not sure what would be the best way. Maybe rename it to LLAMA_COMMON_LLGUIDANCE
? But even if we leave it like this, it's ok - just making a note.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I re-worded the help string, maybe that is enough for now?
Waiting for things to stabilize after the #9639 change and will merge after. There also some merge conflicts to be resolved here. |
@ggerganov I fixed merge conflicts and also updated llguidance to latest 0.6.12 - it supports |
@mmoskal I'd like to add support for this to https://github.com/ShelbyJenkins/llm_client. It's a rust crate that wraps llama.cpp (think Ollama), but focuses on using llms as control flow vs chat bots. If I'm understanding this thread I just need to:
In the future: |
@ShelbyJenkins definetely yes on grammar conversion; there might be some things you were doing that are now easier (eg. the llguidance syntax supports embedded JSON and special tokens) As for the integration you have a few options:
Note that mistral.rs that you also seem to be using already has llguidance integrated, but I guess some old version. Shold be easy to bump. For any more questions, feel free to open issue on your (or llguidance) repo and tag me! |
hey @mmoskal, I ran into a hiccup trying to test this out on my windows machine using MSVC. when running I get a bunch of errors trying to find libllguidance.a are .a files created with MSVC? my experience with rust is nearly zero, but my understanding that with MSVC we'd see a .lib file created? I adjusted the cmakelist to be the following to get it working for me, but this was strictly "works on my machine solution" - there might be a cleverer solution out there
|
This is an experimental, very draft PR which adds support for Rust-based llguidance constrained sampling library. This is mostly meant to elicit comments from users and maintainers on if and how it could be integrated.
llguidance
provides features similar to llama.cpp grammars (full context-free grammar parsing and JSON schemas), however it takes a somewhat different approach to parsing - it splits the allowed token set (mask) computation between a lexer and parser. The lexer use uses derivatives of regular expressions, while the parser uses Earley algorithm. Due to lexer usage and lots of low-level optimizations, llguidance can compute the token mask for 100k tokens in about 1ms for all typical JSON schemas and most other grammars as well. Just as in llama.cpp grammars, there is no significant pre-computation at startup.llguidance
can also "fast-forward" tokens, for example in case of a JSON schema, after{"
is generated, the full key name (consisting of a few tokens) can be processed in a parallel prefill step. This is however not yet hooked up to llama.cpp in this patch. If you're interested, this is hooked up in Guidance via llama.cpp python bindings.This patch adds
llama_sampler_init_llg()
which takes two strings: grammar type and the grammar. Following types are supported:"regex"
- regular expressions (following Rust regex crate syntax)"json"
or"json_schema"
- a large subset of JSON schemas (but see issue)"lark"
- context-free grammars in (a subset of) Lark format"llguidance"
or"guidance"
- internal (JSON-based) formatSupporting llama.cpp grammars as they currently are would be difficult, since they do not distinguish between lexer and parser. The
lark
format is conceptually very similar though.I also hacked
common/sampling.cpp
to recognize"llg:"
prefix in the grammar string. You can for example pass"llg:regex:[A-Z ]+"
. I also hackedcommon/json-schema-to-grammar.cpp
to just return"llg:json_schema:" + schema
, so the-j
option tollama-cli
and JSON mode options tollama-server
use llguidance (when enabled).Trying it out