-
Notifications
You must be signed in to change notification settings - Fork 794
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Vision] Support Phi-3.5-vision, the first VLM in WebLLM #563
Merged
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
- Enable content being an array, which can have image_url - Introduce ModelType.VLM so that only VLM can handle non-string message content - Thus pass in loadedModelType to postInitCheck, hence add loadedModelIdToModelType in Engine - Change unit tests correspondingly
- getPromptArray element can now be an array itself - Conversation.messages message can either be a string or an array of content parts - Update compareConversationObject - Next step is to update llm_chat.ts getInputTokens
- Implement getInputData to replace getInputTokens - Instead of returning a list of tokens, return a list of mixture of number[] and imageUrl - Implement getChunkedPrefillInputData that transforms output of getInputData into chunks, thoroughly tested - Replace forward with embedAndForward, which takes in a chunk, embeds it, and forward - Within embedAndForward, we first embed all components in the chunk - Note chunking is taken care of in getChunkedPrefillInputData, so embedAndForward has data length less than prefill chunk size - TODOs: implement getImageEmbeddings, concatenation of embeddings, E2E test with image input - Tested: - simple chat, logit processor, get_started, multi-round example - prefill multimple messages - 32 prefill chunk size, prefill 300 tokens
- Impelment getImageEmbedding that loads into ImageData and call embed_image kernel - Use TVM global function concatEmbeddings to combine text and image embeddings - Imlpement helper function that loads ImageData from url, either http or base64
CharlieFRuan
changed the title
[WIP][Vision] Support Phi-3.5-vision
[Vision] Support Phi-3.5-vision, the first VLM in WebLLM
Sep 23, 2024
CharlieFRuan
added a commit
that referenced
this pull request
Sep 23, 2024
### Changes - The only change is the support of Phi-3.5-vision: - #563 - Added `Phi-3.5-vision-instruct-q4f16_1-MLC` and `Phi-3.5-vision-instruct-q4f32_1-MLC` to prebuilt model list - See `examples/vision-model` on how to use vision language model in WebLLM ### TVMjs - Compiled at apache/tvm@931efc7 - Cherry-picked apache/tvm#17404 on top - Note this does not require us to recompile non-vision models because text-only inputs will not need embeddings concatenation - WASMs: still the same `v0_2_48` WASMs
excellent work @CharlieFRuan works really nicely, please continue the same good work and include Qwen2-vl also. 👍 Eagerly waiting. |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This PR supports the first Vision Language Model, Phi-3.5-vision. For a full example, see
examples/vision-model
. Overall usage follows OpenAI API and is shown below. We addPhi-3.5-vision-instruct-q4f16_1-MLC
andPhi-3.5-vision-instruct-q4f16_1-MLC
to prebuilt model list.Implementation details
To support vision model, we need various internal changes:
Conversation
to non-string message:Conversation.messages
can not only be a string, but alsoArray<ChatCompletionContentPart>
, the exact two types ofChatCompletionUserMessageParam.content
getPromptArray()
, which converts conversation history to a list of messages to prefill, returnsArray<string | Array<string | ImageURL>>
rather than justArray<string>
getChunkedPrefillInputData()
getInputData()
to work withImageURL
sgetChunkedPrefillInputData()
to chunk the output ofgetInputData()
, where each chunk will be embed and forwarded, then the next chunk is processedgetChunkedPrefillInputData()
is tested thoroughly in unit testforward()
withembedAndForward()
, which takes in a chunk, embeds it, and prefill / decodegetImageEmbeddings()
andgetTokensEmbeddings()
concatEmbedding()
in TVMjs to concatenate tokens and image embeddings: [WASM] Implement concat embeddings apache/tvm#17404Tested
Since we changed logics in crucial methods like
Conversations.getPromptArray()
,LLMChatPipeline
'sgetInputData()
,getChunkedPrefillInputData()
,forward()
, we need to test thoroughly for both vision and non-vision usecases. The following are tested E2E:Conversations
andgetChunkedPrefillInputData()
TODO
IMAGE_EMBED_SIZE
to 1921. That is, all images' embeddings are(1921, hidden_size)
. We also hardcodenum_crops
to be 16 in the kernel. We should exposenum_crops
in the future and generalizeIMAGE_EMBED_SIZE
.embed_image
should have input of typeuint8
, but the current workaround is usinguint32
input since there is noArray<u8>
in WGSL. This should be fixed.uint8
in TIR kernel results in error in runtime:@group(0) @binding(0) var<storage, read_write> T_transpose : array<u8>
-->error : unresolved type 'u8'
embeddings
memory where we keep writing to the same pieceNDArray[]
, then callConcatEmbedding
to concatenate them into a newNDArray
, as implemented here: [WASM] Implement concat embeddings apache/tvm#17404embeddings
and the current chunk'sembeddings
exist at the same time (i.e. previous chunk'sembeddings
did not get released promptly), causing extra overhead, compared to having static memoryRelated PRs