[Embeddings][OpenAI] Support embeddings via engine.embeddings.create() #538

CharlieFRuan · 2024-08-12T06:32:04Z

This PR supports embedding model with engine.embeddings.create().

For running example, see examples/embeddings, where we can run with OpenAI API using engine.embeddings.create(), and we can also integrate with Langchain's EmbeddingsInterface and MemoryVectorStore.

  const documents = ["[CLS] The Data Cloud! [SEP]", "[CLS] Mexico City of Course! [SEP]"];
  const engine: webllm.MLCEngineInterface = await webllm.CreateMLCEngine(
    "snowflake-arctic-embed-m-q0f32-MLC-b4",
  );
  const docReply = await engine.embeddings.create({ input: documents });

Currently, only snowflake-arctic-embed-s and snowflake-arctic-embed-m are supported. We add the following models to the prebuilt model list:

"snowflake-arctic-embed-m-q0f32-MLC-b32"
"snowflake-arctic-embed-m-q0f32-MLC-b4"
"snowflake-arctic-embed-s-q0f32-MLC-b32"
"snowflake-arctic-embed-s-q0f32-MLC-b4"

b32 means the model is compiled to support a maximum batch size of 32. If an input with more than 32 entries are provided, we will call multiple forward() (e.g. if input has 67 entries, we forward 3 times). The larger the maximum batch size, the more memory it takes to load the model. See vram_required_MB in config.ts for specifics.

Besides, we currently do not allow loading multiple models in a single engine, making it a bit inconvenient for usecases like RAG. Engine with multiple models loaded will be supported soon.

Internal code changes

We implement EmbeddingPipeline in src/embedding.ts, parallel to LLMChatPipeline in llm_chat.ts
In engine.ts, we determine which pipeline to load based on ModelRecord.model_type
Implemented embedding() in MLCEngineInterface, hence supporting both MLCEngine and WebWorkerMLCEngine
Implemented API specifications in src/openai_api_protocols/embedding.ts

Tested

Input of size 64 with 512 tokens using b32 model, finishes with 2 iterations
Ensured output in examples/embedding is consistent with transformers in Python
Tested with WebWorkerMLCEngine

### Change - Supports embedding via OpenAI API `engine.embeddings.create()`: - #538 - Currently, only `snowflake-arctic-embed-s` and `snowflake-arctic-embed-m` are supported. We add the following models to the prebuilt model list: - `snowflake-arctic-embed-m-q0f32-MLC-b32` - `snowflake-arctic-embed-m-q0f32-MLC-b4` - `snowflake-arctic-embed-s-q0f32-MLC-b32` - `snowflake-arctic-embed-s-q0f32-MLC-b4` - `b32` means the model is compiled to support a maximum batch size of 32. If an input with more than 32 entries are provided, we will call multiple `forward()` (e.g. if input has 67 entries, we forward 3 times). The larger the maximum batch size, the more memory it takes to load the model. See `ModelRecord.vram_required_MB` in `config.ts` for specifics. ### TVMjs Still compiled at apache/tvm@1fcb620, no change

mlc-ai#538) This PR supports embedding model with `engine.embeddings.create()`. - We implement `EmbeddingPipeline` in `src/embedding.ts`, parallel to `LLMChatPipeline` in `llm_chat.ts` - In `engine.ts`, we determine which pipeline to load based on `ModelRecord.model_type` - Implemented `embedding()` in `MLCEngineInterface`, hence supporting both `MLCEngine` and `WebWorkerMLCEngine` - Implemented API specifications in `src/openai_api_protocols/embedding.ts`

### Change - Supports embedding via OpenAI API `engine.embeddings.create()`: - mlc-ai#538 - Currently, only `snowflake-arctic-embed-s` and `snowflake-arctic-embed-m` are supported. We add the following models to the prebuilt model list: - `snowflake-arctic-embed-m-q0f32-MLC-b32` - `snowflake-arctic-embed-m-q0f32-MLC-b4` - `snowflake-arctic-embed-s-q0f32-MLC-b32` - `snowflake-arctic-embed-s-q0f32-MLC-b4` - `b32` means the model is compiled to support a maximum batch size of 32. If an input with more than 32 entries are provided, we will call multiple `forward()` (e.g. if input has 67 entries, we forward 3 times). The larger the maximum batch size, the more memory it takes to load the model. See `ModelRecord.vram_required_MB` in `config.ts` for specifics. ### TVMjs Still compiled at apache/tvm@1fcb620, no change

[Embeddings][OpenAI] Support embeddings via engine.embeddings.create()

ef27ebc

CharlieFRuan force-pushed the pr-0812-embedding branch from a75650a to 1f8728c Compare August 12, 2024 06:39

Fix lint

53f7d04

CharlieFRuan force-pushed the pr-0812-embedding branch from 1f8728c to 53f7d04 Compare August 12, 2024 06:43

CharlieFRuan merged commit 1690aa6 into mlc-ai:main Aug 12, 2024
1 check passed

CharlieFRuan mentioned this pull request Aug 12, 2024

[Version] Bump version to 0.2.58, support embedding #539

Merged

CharlieFRuan mentioned this pull request Aug 12, 2024

[Tracking][WebLLM] Function calling (beta) and Embeddings #526

Open

7 tasks

CharlieFRuan mentioned this pull request May 15, 2025

Embedding models does not match between HF repo and .wasm files #690

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Embeddings][OpenAI] Support embeddings via engine.embeddings.create() #538

[Embeddings][OpenAI] Support embeddings via engine.embeddings.create() #538

Uh oh!

CharlieFRuan commented Aug 12, 2024 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

[Embeddings][OpenAI] Support embeddings via engine.embeddings.create() #538

[Embeddings][OpenAI] Support embeddings via engine.embeddings.create() #538

Uh oh!

Conversation

CharlieFRuan commented Aug 12, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Internal code changes

Tested

Uh oh!

Uh oh!

Uh oh!

CharlieFRuan commented Aug 12, 2024 •

edited

Loading