Unify Model Interfaces #695

lapp0 · 2024-02-17T19:43:18Z

lapp0
Feb 17, 2024

What behavior of the library made you think about the improvement?

Outlines supports 5 inference engines (Ignoring OpenAI / Azure). We should ensure their interfaces and behaviors are identical. We are accumulating technical debt because of their inconsistencies.

For example, I'm hesitant to implement #644 / #647 because I would have to implement it for three distinct interfaces. Another example is the fact that #614 doesn't work with vLLM or LlamaCpp. Additionally, outlines.generate.regex_llamacpp exists to enforce a sampler constraint on llamacpp, however we have a bug resulting from the fact that outlines.generate.json_llamacpp doesn't exist and doesn't enforce the same sampler constraint.

A breakdown of the discrepancies follows.

Logit modification:

For Transformers / ExLlamaV2 / Mamba logits are modified via logic in SequenceGenerator utilizing a passed FSM.
For LlamaCpp logits are modified via logits processors defined in outlines.models.llamacpp.
For vLLM logits are modified via logits processors defined in outlines.serve.vllm. vLLMs logits processors repeat substantial portions of LlamaCpps logits processors.

Sequence Generation:

For Transformers / ExLlamaV2 / Mamba model() generates the logits, and SequenceGenerator samples and manages the generation.
For LlamaCpp, the model is functionally the SequenceGenerator, and outlines.generate.regex_llamacpp patches the model to apply logits processors. SequenceGenerator is not used.
There is no outlines.models.vllm. Outlines provides a web server and the logits processors. vllm does all the work.

Tokenizer:

Transformers / ExLlamaV2 / Mamba use a TransformersTokenizer
LlamaCpp uses a custom tokenizer compatible with outlines.models.tokenizer.Tokenizer
vLLM patches the tokenizer within the logits processor to be compatible

How would you like it to behave?

Unification should look as follows:

Step 1: #676

Ensure LlamaCpp and vLLM use the same logits processors
Ensure LlamaCpp and vLLM both use a tokenizer subclassing outlines.models.tokenizer.Tokenizer
- don't monkeypatch vLLM tokenizer in the logits processor itself

Result: logits processors have a single canonical implementation

Step 2:

Ensure sequence_generator applies LogitsProcessors, and doesn't manage FSMs
- The appropriate logits processor is passed based on the outlines.generate function

Result: all 5 models use logits processors, the business logic for constraining the next token based on an FSM is encapsulated in one place

Step 3:

Implement an abstract model with interfaces defined such that they're compatible with SequenceGenerator

Step 3a:

Implement outlines.models.vllm
- outlines.models.vllm.__call__ returns logits and kv cache, and can be used with SequenceGenerator
outlines.models.llamacpp.__call__ returns logits and kv cache, and can be used with SequenceGenerator

Step 3b:

Remove outlines.generate.regex_llamacpp (same for other outlines.generate.*_llamacpp), interface should work with outlines.generate.regex
- Involves subclassing llama_cpp.Llama such that llama.Llama.generate() doesn't call Llama.sample() but instead allows Outlines to handle the sampling logic.

Result: All 5 models compatible with SequenceGenerator, model agnostic outlines.generate, outlines.models.vllm available

Step 3c:

Inference engine integration tests: every inference engine X every sampler (related: beam search doesn't work with ExLlamaV2)

Step 4:

"Serve Any Model": Serve Any Model #655

Result: Any model servable via outlines.serve

rlouf · 2024-02-21T12:17:32Z

rlouf
Feb 21, 2024
Maintainer

Additionally, outlines.generate.regex_llamacpp exists to enforce a sampler constraint on llamacpp, however we have a bug resulting from the fact that outlines.generate.json_llamacpp doesn't exist and doesn't enforce the same sampler constraint.

What bug? outlines.generate.json is the default function so it should work for both Transformers and LlamaCpp.

For Transformers / ExLlamaV2 / Mamba model() generates the logits, and SequenceGenerator samples and manages the generation.

For LlamaCpp, the model is functionally the SequenceGenerator, and outlines.generate.regex_llamacpp patches the model to apply logits processors. SequenceGenerator is not used.

The difference between llama.cpp and transformers was temporary; it was quicker to integrate via logits processors than fixing the existing implementation. Eventually it will use SequenceGenerator.

There is no outlines.models.vllm. Outlines provides a web server and the logits processors. vllm does all the work.

That could be easily added. However all model calls in Outlines are synchronous for now, so we wouldn't benefit from vLLM's continuous batching feature.

Ensure LlamaCpp and vLLM use the same logits processors

Ensure LlamaCpp and vLLM both use a tokenizer subclassing outlines.models.tokenizer.Tokenizer

Aren't you moving complexity from one part of the codebase to another? As long as transformers, llama-cpp-python and vLLM use different interfaces for their tokenizer there will be some special casing.

Result: all 5 models use logits processors, the business logic for constraining the next token based on an FSM is encapsulated in one place

I am not opposed to using logits processors in sequence_generator instead of having it manage the FSM. However I doubt this will solve anything? The OpenAI API encapsulates the text generation process, so do vLLM and llama.cpp (for now). You will always need to dispatch depending on the model at the generation level.

We can always "clean" the generate module by dispatching in the way aesara does dispatch Ops for different backends. That is the dispatching logic is implemented where the model is instead of closer to the generation. It comes with trade-offs, but I think it would indeed be cleaner.

outlines.models.vllm.__call__ returns logits and kv cache, and can be used with SequenceGenerator

outlines.models.llamacpp.__call__ returns logits and kv cache, and can be used with SequenceGenerator

Ok if there's an easy way to do this.

Remove outlines.generate.regex_llamacpp (same for other outlines.generate.*_llamacpp), interface should work with outlines.generate.regex

It does as far as the user is concerned.

Involves subclassing llama_cpp.Llama such that llama.Llama.generate() doesn't call Llama.sample() but instead allows Outlines to handle the sampling logic.

That's planned. Again, using sample was a temporary fix for problems that many users were having. There are subtleties linked to KV-cache management when it comes to integrating llama-cpp-python and that will require some work.

Result: All 5 models compatible with SequenceGenerator, model agnostic outlines.generate, outlines.models.vllm available

How does that work with OpenAI?

Inference engine integration tests: every inference engine X every sampler (related: beam search doesn't work with ExLlamaV2)

I'm fine with that, but we'll need to do something to handle the ever-growing test suite duration.

0 replies

rlouf · 2024-02-21T15:46:05Z

rlouf
Feb 21, 2024
Maintainer

Ultimately, we have such diversity because the backends used by the different models are different. The main sequence_generator function only supports PyTorch for now. This means integrating with llama-cpp-python, and transformers models with a different backend will always be painful.

My first suggestion is to drop API-based models. This is a tough one, but whatever direction we choose to go with Outlines there will always be some level of discrepancy between what we can do with, say, OpenAI and OSS models. At the end of the way we will never be able to reconcile the irreconciliable, and there are good libraries out there such as instructor that handle API-based models perfectly fine.

My second suggestion comes from my experience with another multiple-backend library, and discussions with the author of outlines-mlx. We should dispatch functions based on the backend used, and not the type of model. We would be able to support any backend in transformers, llama.cpp and any model-providing library.

In outlines.generate the following would remain:

SequenceGenerator, without the tensor initialization part
The base sequence_generator function
The shortcuts generate.regex, generate.json, etc.

Every other function would have to be implemented for each backend. Of course, feature parity between the different backends cannot be guaranteed, but so is life in a world with many tensor libraries.

When it comes to model integrations we will need to design a clean(er) interface that every model needs to obey, but that shouldn't be too difficult.

0 replies

lapp0 · 2024-02-22T20:21:13Z

lapp0
Feb 22, 2024
Author

What bug? outlines.generate.json is the default function so it should work for both Transformers and LlamaCpp.

The fact that outlines.generate.json allows greedy and beam samplers with llama.cpp, and this isn't handled explicitly with an exception. See the linked check within regex.py, which isn't present in json.py:

https://github.com/outlines-dev/outlines/blob/main/outlines/generate/regex.py#L49-L53

Aren't you moving complexity from one part of the codebase to another? As long as transformers, llama-cpp-python and vLLM use different interfaces for their tokenizer there will be some special casing.

Not really, the idea is to get their tokenizers and models to have identical interfaces and not change the SequenceGenerator logic at all, rather to seamlessly allow their use within SequenceGenerator.

I am not opposed to using logits processors in sequence_generator instead of having it manage the FSM. However I doubt this will solve anything? The OpenAI API encapsulates the text generation process, so do vLLM and llama.cpp (for now). You will always need to dispatch depending on the model at the generation level.

It will allow us to simplify sequence_generator such that it only applies logits processors. The logic of applying any type of repetition penalty, an automata-based-constrainer, and in the future things like mirostat, and other logit biasing will be as simple as passing a union of logits processors. Currently the way it's designed, sequence_generator needs to be aware of the logic for handling FSM, and would need to be aware of the logic for handling repetition penalties, etc.

Ok if there's an easy way to do this.

It's easy for vLLM, however for class LlamaCppTokenizer(Tokenizer): I had to manually add the attention mask

     def encode(
         self, prompt: Union[str, List[str]]
-    ) -> Tuple[NDArray[np.int64], NDArray[np.int64]]:
-        return self.tokenizer.encode(prompt)
+    ) -> Tuple[torch.LongTensor, torch.LongTensor]:
+        """
+        Returns padded input_ids and attention_mask
+        Unlike PreTrainedTokenizer, this manually pads and manually creates the attention_mask
+        """
+        if isinstance(prompt, str):
+            prompts = [prompt]
+        else:
+            prompts = prompt
+
+        # Populate the padded_input_ids and attention_mask
+        padded_input_ids = torch.full((len(prompts), max_length), self.pad_token_id, dtype=torch.long)
+        attention_mask = torch.zeros((len(prompts), max_length), dtype=torch.long)
+
+        for i, p in enumerate(prompts):
+            encoded_ids = torch.tensor(self.tokenizer.encode(p), dtype=torch.long)
+            length = min(encoded_ids.size(0), max_length)
+            padded_input_ids[i, :length] = encoded_ids[:length]
+            attention_mask[i, :length] = 1
+
+        return padded_input_ids, attention_mask

How does that work with OpenAI?

It doesn't as I mentioned, "Ignoring OpenAI / Azure". OpenAI will remain a special case, all other locally-run inference engines would share interfaces.

I'm fine with that, but we'll need to do something to handle the ever-growing test suite duration.

Yes, GPU integration tests would need a --integration-test argument to run. They wouldn't run by default, but would serve as a useful tool ensuring changes impacting inference engines actually work.

Ultimately, we have such diversity because the backends used by the different models are different. The main sequence_generator function only supports PyTorch for now. This means integrating with llama-cpp-python, and transformers models with a different backend will always be painful.

This is a mild inconvenience IMO, the only cost is wrapping llamacpp with logic which ensures it uses torch tensors. I wrote some logic for this in the base logits processor. This should be moved to the model itself however. Again, what's important is ensuring models and tokenizers have identical interfaces, and then sequence_generator can utilize any model or tokenizer without considering their implementation details.

    def __call__(
        self,
        input_ids: Union[NDArray[np.int64], List[int], torch.Tensor],
        logits: Union[NDArray[np.float32], torch.Tensor],
    ) -> Union[np.ndarray, torch.Tensor]:
        """
        Apply logits processor

        Unify type
        - convert input_ids: either ndarray, List[int], or Tensor -> List[int]
        - convert logits: either ndarray or Tensor -> Tensor

        Call process_logits() to perform business logic
        """
        if not isinstance(input_ids, list):
            input_ids = input_ids.tolist()

        if isinstance(logits, np.ndarray):
            # Unify type, convert numpy array to Tensor
            # from_numpy and .numpy() don't copy the data, it uses the same memory address
            torch_logits = torch.from_numpy(logits)
            processed_torch_logits = self.process_logits(input_ids, torch_logits)
            return processed_torch_logits.detach().numpy()
        else:
            return self.process_logits(input_ids, logits)

My first suggestion is to drop API-based models. This is a tough one, but whatever direction we choose to go with Outlines there will always be some level of discrepancy between what we can do with, say, OpenAI and OSS models.

They will never be able to fully benefit from the features available in Outlines. They probably should be dropped, but only after understanding user needs.

Every other function would have to be implemented for each backend. Of course, feature parity between the different backends cannot be guaranteed, but so is life in a world with many tensor libraries.

This is exactly what I want to avoid. We should have no special logic for any inference engine anywhere outside of its outlines.models.* definition. Any logic specific to the inference engine needs to be encapsulated within its Model and Tokenizer class to ensure it conforms to a standard interface.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unify Model Interfaces #695

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 3 comments

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Unify Model Interfaces #695

lapp0 Feb 17, 2024

What behavior of the library made you think about the improvement?

How would you like it to behave?

Step 1: #676

Step 2:

Step 3:

Step 4:

Replies: 3 comments

rlouf Feb 21, 2024 Maintainer

rlouf Feb 21, 2024 Maintainer

lapp0 Feb 22, 2024 Author

lapp0
Feb 17, 2024

rlouf
Feb 21, 2024
Maintainer

rlouf
Feb 21, 2024
Maintainer

lapp0
Feb 22, 2024
Author