Unify Model Interfaces #695
Replies: 3 comments
-
What bug?
The difference between llama.cpp and transformers was temporary; it was quicker to integrate via logits processors than fixing the existing implementation. Eventually it will use
That could be easily added. However all model calls in Outlines are synchronous for now, so we wouldn't benefit from vLLM's continuous batching feature.
Aren't you moving complexity from one part of the codebase to another? As long as
I am not opposed to using logits processors in We can always "clean" the
Ok if there's an easy way to do this.
It does as far as the user is concerned.
That's planned. Again, using
How does that work with OpenAI?
I'm fine with that, but we'll need to do something to handle the ever-growing test suite duration. |
Beta Was this translation helpful? Give feedback.
-
Ultimately, we have such diversity because the backends used by the different models are different. The main My first suggestion is to drop API-based models. This is a tough one, but whatever direction we choose to go with Outlines there will always be some level of discrepancy between what we can do with, say, OpenAI and OSS models. At the end of the way we will never be able to reconcile the irreconciliable, and there are good libraries out there such as instructor that handle API-based models perfectly fine. My second suggestion comes from my experience with another multiple-backend library, and discussions with the author of outlines-mlx. We should dispatch functions based on the backend used, and not the type of model. We would be able to support any backend in In
Every other function would have to be implemented for each backend. Of course, feature parity between the different backends cannot be guaranteed, but so is life in a world with many tensor libraries. When it comes to model integrations we will need to design a clean(er) interface that every model needs to obey, but that shouldn't be too difficult. |
Beta Was this translation helpful? Give feedback.
-
The fact that https://github.com/outlines-dev/outlines/blob/main/outlines/generate/regex.py#L49-L53
Not really, the idea is to get their tokenizers and models to have identical interfaces and not change the SequenceGenerator logic at all, rather to seamlessly allow their use within SequenceGenerator.
It will allow us to simplify sequence_generator such that it only applies logits processors. The logic of applying any type of repetition penalty, an automata-based-constrainer, and in the future things like mirostat, and other logit biasing will be as simple as passing a union of logits processors. Currently the way it's designed, sequence_generator needs to be aware of the logic for handling FSM, and would need to be aware of the logic for handling repetition penalties, etc.
It's easy for vLLM, however for
It doesn't as I mentioned, "Ignoring OpenAI / Azure". OpenAI will remain a special case, all other locally-run inference engines would share interfaces.
Yes, GPU integration tests would need a
This is a mild inconvenience IMO, the only cost is wrapping llamacpp with logic which ensures it uses torch tensors. I wrote some logic for this in the base logits processor. This should be moved to the model itself however. Again, what's important is ensuring models and tokenizers have identical interfaces, and then sequence_generator can utilize any model or tokenizer without considering their implementation details.
They will never be able to fully benefit from the features available in Outlines. They probably should be dropped, but only after understanding user needs.
This is exactly what I want to avoid. We should have no special logic for any inference engine anywhere outside of its |
Beta Was this translation helpful? Give feedback.
-
What behavior of the library made you think about the improvement?
Outlines supports 5 inference engines (Ignoring OpenAI / Azure). We should ensure their interfaces and behaviors are identical. We are accumulating technical debt because of their inconsistencies.
For example, I'm hesitant to implement #644 / #647 because I would have to implement it for three distinct interfaces. Another example is the fact that #614 doesn't work with vLLM or LlamaCpp. Additionally,
outlines.generate.regex_llamacpp
exists to enforce a sampler constraint onllamacpp
, however we have a bug resulting from the fact thatoutlines.generate.json_llamacpp
doesn't exist and doesn't enforce the same sampler constraint.A breakdown of the discrepancies follows.
Logit modification:
SequenceGenerator
utilizing a passed FSM.outlines.models.llamacpp
.outlines.serve.vllm
. vLLMs logits processors repeat substantial portions of LlamaCpps logits processors.Sequence Generation:
model()
generates the logits, andSequenceGenerator
samples and manages the generation.SequenceGenerator
, andoutlines.generate.regex_llamacpp
patches the model to apply logits processors.SequenceGenerator
is not used.outlines.models.vllm
. Outlines provides a web server and the logits processors.vllm
does all the work.Tokenizer:
outlines.models.tokenizer.Tokenizer
How would you like it to behave?
Unification should look as follows:
Step 1: #676
outlines.models.tokenizer.Tokenizer
Result: logits processors have a single canonical implementation
Step 2:
sequence_generator
appliesLogitsProcessors
, and doesn't manage FSMsoutlines.generate
functionResult: all 5 models use logits processors, the business logic for constraining the next token based on an FSM is encapsulated in one place
Step 3:
SequenceGenerator
Step 3a:
outlines.models.vllm
outlines.models.vllm.__call__
returns logits and kv cache, and can be used withSequenceGenerator
outlines.models.llamacpp.__call__
returns logits and kv cache, and can be used withSequenceGenerator
Step 3b:
outlines.generate.regex_llamacpp
(same for otheroutlines.generate.*_llamacpp
), interface should work withoutlines.generate.regex
llama_cpp.Llama
such thatllama.Llama.generate()
doesn't callLlama.sample()
but instead allows Outlines to handle the sampling logic.Result: All 5 models compatible with
SequenceGenerator
, model agnosticoutlines.generate
,outlines.models.vllm
availableStep 3c:
Step 4:
Result: Any model servable via
outlines.serve
Beta Was this translation helpful? Give feedback.
All reactions