Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Vision] Support Phi-3.5-vision, the first VLM in WebLLM #563

Merged
merged 7 commits into from
Sep 23, 2024

Commits on Sep 18, 2024

  1. Support image input in chatCompletionRequest

    - Enable content being an array, which can have image_url
    - Introduce ModelType.VLM so that only VLM can handle non-string message content
    - Thus pass in loadedModelType to postInitCheck, hence add loadedModelIdToModelType in Engine
    - Change unit tests correspondingly
    CharlieFRuan committed Sep 18, 2024
    Configuration menu
    Copy the full SHA
    fad3df9 View commit details
    Browse the repository at this point in the history

Commits on Sep 20, 2024

  1. Support formatting image input in Conversation.getPromptArray

    - getPromptArray element can now be an array itself
    - Conversation.messages message can either be a string or an array of content parts
    - Update compareConversationObject
    - Next step is to update llm_chat.ts getInputTokens
    CharlieFRuan committed Sep 20, 2024
    Configuration menu
    Copy the full SHA
    19c9991 View commit details
    Browse the repository at this point in the history
  2. Trivial

    CharlieFRuan committed Sep 20, 2024
    Configuration menu
    Copy the full SHA
    b398412 View commit details
    Browse the repository at this point in the history

Commits on Sep 22, 2024

  1. Separate embed from forward, add getChunkedPrefillInputData

    - Implement getInputData to replace getInputTokens
      - Instead of returning a list of tokens, return a list of mixture of number[] and imageUrl
    - Implement getChunkedPrefillInputData that transforms output of getInputData into chunks, thoroughly tested
    - Replace forward with embedAndForward, which takes in a chunk, embeds it, and forward
    - Within embedAndForward, we first embed all components in the chunk
    - Note chunking is taken care of in getChunkedPrefillInputData, so embedAndForward has data length less than prefill chunk size
    - TODOs: implement getImageEmbeddings, concatenation of embeddings, E2E test with image input
    - Tested:
      - simple chat, logit processor, get_started, multi-round example
      - prefill multimple messages
      - 32 prefill chunk size, prefill 300 tokens
    CharlieFRuan committed Sep 22, 2024
    Configuration menu
    Copy the full SHA
    db4bb25 View commit details
    Browse the repository at this point in the history
  2. Implement getImageEmbedding, make phi3_5-vision work E2E

    - Impelment getImageEmbedding that loads into ImageData and call embed_image kernel
    - Use TVM global function concatEmbeddings to combine text and image embeddings
    - Imlpement helper function that loads ImageData from url, either http or base64
    CharlieFRuan committed Sep 22, 2024
    Configuration menu
    Copy the full SHA
    fd57f7c View commit details
    Browse the repository at this point in the history

Commits on Sep 23, 2024

  1. Trivial

    CharlieFRuan committed Sep 23, 2024
    Configuration menu
    Copy the full SHA
    576a5c7 View commit details
    Browse the repository at this point in the history
  2. Trivial

    CharlieFRuan committed Sep 23, 2024
    Configuration menu
    Copy the full SHA
    453f9c9 View commit details
    Browse the repository at this point in the history