-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Multimodal support with Phi 3 Vision + Transformers #1020
base: main
Are you sure you want to change the base?
Conversation
guidance/models/_model.py
Outdated
@@ -72,7 +88,7 @@ def get_chat_template(self): # TODO [HN]: Add more logic here...should we instan | |||
def reset_metrics(self): | |||
self.metrics = GuidanceEngineMetrics() | |||
|
|||
def start(self, prompt, grammar, ensure_bos_token=True) -> TokenParser: | |||
def start(self, prompt, grammar, media: dict, ensure_bos_token=True) -> TokenParser: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we make "media" optional?
"""The current prompt in bytes (which is the state without the context close tags).""" | ||
return format_pattern.sub("", self._state) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@slundberg do you mind if I get rid of this?
guidance/models/_model.py
Outdated
def append_image(self, image): | ||
""" | ||
Appends image bytes to the model's state. | ||
""" | ||
return self._append_multimodal(image, Modality.IMAGE) | ||
|
||
|
||
def append_audio_bytes(self, audio): | ||
""" | ||
Appends audio bytes to the model's state. | ||
""" | ||
return self._append_multimodal(audio, Modality.AUDIO) | ||
|
||
|
||
def append_video_bytes(self, video): | ||
""" | ||
Appends video bytes to the model's state. | ||
""" | ||
return self._append_multimodal(video, Modality.VIDEO) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we can just collapse these into the OG _append_multimodal?
if isinstance(prompt, bytes): | ||
prompt = prompt.decode("utf-8") | ||
elif isinstance(prompt, str): | ||
prompt = prompt |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe missing a str -> byte conversion, or skip conditional altogether?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed. Side note, but I want us to be a bit more careful about strings vs bytes in general -- maybe some annotations would help
model_inputs = self.processor( | ||
text=processed_prompt, | ||
images=images if len(images) > 0 else None, | ||
return_tensors="pt", | ||
).to(self.device) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's make sure that this gets called if images are added again later, after the model starts generating. Otherwise we should trigger a "re-process" upon the next user image being inserted right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could be good to add some tests using unittest.patch
if there are any assertions around how many times this should be called for a given image, what should be reused, etc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe we'll have to re-process when new images are added. I'll look into how to tell if re-processing is necessary. For the moment, I actually haven't seen a Transformers model yet that was trained to support multiple images during inference. As far as I can tell, Phi 3 vision and llama 3.2 were trained for 1 image per context. That being said, we should still support multi-image scenarios here.
return self._cached_logits | ||
|
||
|
||
class TransformersPhi3Vision(Model): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hope we can get rid of this 🥺
@@ -37,6 +37,7 @@ | |||
"openai": ["openai>=1.0"], | |||
"schemas": ["jsonschema"], | |||
"server": ["fastapi-slim", "uvicorn"], | |||
"image": ["pillow"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
☁️
): | ||
# add the beginning of sequence token if needed | ||
processed_tokens = [bos_token_id] + processed_tokens | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These tokens will likely need to be recoded before being sent to the LLM for logits (I think only in the case that we just added a BOS token).
You could probably just throw that line in create_token_parser
after calling process_prompt
..? Just since you'll have access to a tokenizer there.
See tests/model_integration/test_model.py::test_associativity
guidance/_parser.py
Outdated
if ensure_bos_token and tokenizer.bos_token_id is not None: | ||
bos_token_id = tokenizer.bos_token_id | ||
else: | ||
bos_token_id = None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tiny nitpick -- you don't need to check the bos_token_id
against None
in the if
statement if you're assigning to None
in the else clause
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you think we should throw an error in the case where ensure_bos_token is True and tokenizer.bos_token_id is None?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mmm good question. To my knowledge, we're never calling this with ensure_bos_token set to False... In other words, I'm not sure the kwarg is really providing any value. I don't think an exception is really necessary in this case. What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm just going to log a warning in that case for now
guidance/_parser.py
Outdated
return processed_tokens | ||
|
||
|
||
def process_grammar(grammar: Union[GrammarFunction, str]) -> str: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we give this a more informative name? E.g. serialize_grammar
match_str = match.group(0) | ||
modality_type = match.group(1) | ||
if modality_type != Modality.IMAGE.name: | ||
logger.debug("Skipping non-image modality: %s", match_str) | ||
continue | ||
media_id = match.group(2) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Named groups and a call to groupdict
may yield something a little more readable and less fragile to changes in the pattern
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's an interesting suggestion, I will look into that
Thanks for your reviews Hudson & Harsha! I am picking this PR back up now and will make revisions based on your feedback. I'm also going to work on integrating Llama 3.2 to come up with a more general solution. |
Adds a general framework for supporting multimodal models in Guidance, as well as an implementation of Phi 3 Vision, using the transformers library.
_parser.py
. Changed the__init__
function for TokenParser so that it contains less logic. This was necessary to implement more flexible processing of the prompt to get the token IDs when dealing with various data formats. Now engines can do more specific preparation of the TokenParser if they. need to.Modality
enum which is used to indicate the modality of data in a prompt segment. Some types have been created to indicate that the data points to a URL. This might enable us to support APIs like Gemini in the future, which needs users to supply a Google Cloud URL for the blob data in the API request.<|_{modality.name}:{str(id(data))}|>
. It's essentially a placeholder for where the larger blob data should be inserted later on when prompting the model.append_image()
,append_audio_bytes()
, etc. These functions are meant to be used by user-facing guidance library functions such asimage()
so those guidance functions can load the data into the model state.media
dict parameter was added to__call__()
,get_next_token()
, andget_logits()
in Engine. There is a default implementation of these provided in the Engine base class. Subclasses of Engine can override these functions as necessary to parse the prompt string and pack the media data as needed depending on the API.prompt
string parameter sent to engine will still contain the placeholders formatted like<|_{modality.name}:{str(id(data))}|>
. Engines will parse this string and extract the ID part using a regex. This ID is used to map to the actual blob data in the media dict:{id: blob_data}
get_next_token()
andget_logits()
also receive the media dict parameter because sometimes engines will need the media data, along with the prompt string, at that particular point in time to prepare an API request. The idea is to ensure there's enough flexibility to handle various kinds of models.sp_model
itselfTransformersPhi3VisionEngine
. I found it difficult to subclass the existing Transformers classes to add the needed new code. I think it's best to consider this newTransformersPhi3VisionEngine
as a prototype for what a multimodal Transformers engine looks like. For now, there were some Phi 3 behaviors I had to account for. In the future as we support more multimodal Transformers models, we might notice specific patterns arising that we can use to make the implementation cleaner and more generalAutoProcessor
instead ofAutoTokenizer
to prepare model inputs. The model inputs for Phi 3 include token ids, image pixel values, and an attention mask"Hello <image_1> what is this <image_2>?
:[30, 50, 22, -1, -1, -1, -1, -1, -1, ..., -1, 54, 893, 250, -2, -2, -2, -2, -2, ..., -2, 542]
process_prompt()
function which does the initial token healing and grammar forcing for a prompt string. For multimodal prompts, there are boundaries between text data and multimodal data in the token space. Token healing or forcing cannot be applied across those boundaries. So, we will preserve the existing tokens provided by the initial tokenization, then only send the text tokens at the end of the prompt, after all multimodal data, toprocess_prompt()
. If the prompt ends with a multimodal input, we will not useprocess_prompt()
. We might improve on this later on, but it seems to work for now.TODO: