[Vision] Support Phi-3.5-vision, the first VLM in WebLLM #563

- Enable content being an array, which can have image_url - Introduce ModelType.VLM so that only VLM can handle non-string message content - Thus pass in loadedModelType to postInitCheck, hence add loadedModelIdToModelType in Engine - Change unit tests correspondingly

- getPromptArray element can now be an array itself - Conversation.messages message can either be a string or an array of content parts - Update compareConversationObject - Next step is to update llm_chat.ts getInputTokens

- Implement getInputData to replace getInputTokens - Instead of returning a list of tokens, return a list of mixture of number[] and imageUrl - Implement getChunkedPrefillInputData that transforms output of getInputData into chunks, thoroughly tested - Replace forward with embedAndForward, which takes in a chunk, embeds it, and forward - Within embedAndForward, we first embed all components in the chunk - Note chunking is taken care of in getChunkedPrefillInputData, so embedAndForward has data length less than prefill chunk size - TODOs: implement getImageEmbeddings, concatenation of embeddings, E2E test with image input - Tested: - simple chat, logit processor, get_started, multi-round example - prefill multimple messages - 32 prefill chunk size, prefill 300 tokens

- Impelment getImageEmbedding that loads into ImageData and call embed_image kernel - Use TVM global function concatEmbeddings to combine text and image embeddings - Imlpement helper function that loads ImageData from url, either http or base64

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Vision] Support Phi-3.5-vision, the first VLM in WebLLM #563

[Vision] Support Phi-3.5-vision, the first VLM in WebLLM #563

Commits on Sep 18, 2024

Commits on Sep 20, 2024

Commits on Sep 22, 2024

Commits on Sep 23, 2024