Support concurrent requests to a single model instance #522

LEXNY · 2024-08-03T02:30:58Z

If you make multiple requests with the same engine without awaiting, you get garbage.

I would like to make multiple concurrent (ideally parallel) requests to the same engine, without loading the same model into memory multiple times.

Even with stream: false, the engine uses streaming internally, those streams interleave, and the engine gets confused.

Reproduction:

    import {CreateMLCEngine} from '@mlc-ai/web-llm';

    (async () => {
      const engine = await CreateMLCEngine("Phi-3-mini-4k-instruct-q4f16_1-MLC")

      const query = async (content) => {
        await engine.chatCompletion({
          stream: false,
          messages: [
            { role: "user", content }
          ],
        });
        console.log(await engine.getMessage())
      }

      query('respond only in json')
      query('tell me something obscure about botany')
    })()

Output:

I In { C' thege_s fieldographybotcience of,any, rare oneoneun phenomen is,commonon the the knowledge
called occurrence observation of ",, plantsFplth exhiblorantsatsit a knownuating specific location location
environment location over for for, time time timeperiod time. times. This.. This occursobserv t, whenationsion
often plant ofan, species species species un of are orcommon a, arelyis that noticed exhib species exhiblyit
init species in a. location a location Loc location location,ale but, thatad usually often. an un an

The text was updated successfully, but these errors were encountered:

CharlieFRuan · 2024-08-04T17:30:36Z

Thanks for reporting this! I'll look into fixing this, perhaps blocking subsequent chatCompletion() calls until the previous one is finished, maintaining FCFS. Currently the engine does not support continuous batching, so this may be the only way resolving it as of now. That is, despite you can call multiple chatCompletion()s concurrently, to ensure correctness, they have to be executed sequentially.

However, if you instantiate multiple engines, two requests can be processed concurrently. We will soon support having multiple models loaded in a single engine, so in that case same thing applies.

LEXNY · 2024-08-04T18:48:22Z

Thank you.

I don't strictly need this to run in parallel (though that would be nice). The concurrency bug is very nonintuitive though and worth fixing.

I did some investigation and this is a gross start, but I think if you separate out this.outputIds.push per completion, that could fix it.

LEXNY@463d075

Some sort of key to identify the specific request and keep track of its specific outputIds.

A model cannot handle > 1 concurrent request (e.g. >1 calls to `chat.completions.create()`) since we do not support continuous batching, and each request requires its own resources such as the KV cache. (Though "concurrent" requests to different models in the same engine is supported) As a result, as pointed out in #522, when users try something like the following code: ```typescript const engine = await CreateMLCEngine("Phi-3-mini-4k-instruct-q4f16_1-MLC") async function sendRequest() { const reply = await engine.chat.completions.create({ messages: [{ role: "user", content: "Hello!" }], max_tokens: 64, }); console.log(reply.choices[0].message.content); } await Promise.all([sendRequest(), sendRequest()]); ``` the model's state and the generation result are messed up. To resolve this, we implement `CustomLock` using Promise, maintaining a queue to ensure FCFS for incoming requests to a model, such that for a single model, a request only starts when all previous requests are finished. The code above now works. ### Implementation Details - We add `loadedModelIdToLock` to MLCEngine, maintaining a lock for each loaded engine - Reminder: the need for a critical section is only per model, since each loaded model has its own `LLMChatPipeline` / `EmbeddingPipeline` - `loadedModelIdToLock` is cleared in `unload()`, set in `reloadInternal()` - We acquire lock at the very beginning of `completion()`, `chatCompletion()` and `embedding()`, after knowing which model this current call will use - We release lock at the end of `embedding()`, `completion()` and `chatCompletion()` (for non-streaming cases), and `asyncGenerate()` (for streaming cases) - Since we also want to release the lock when errors occur, we wrap the code with a big `try` `finally` - Since `asyncGenerate()` is an async generator, we add `try` `catch` fine-grainedly, only in places that can throw errors - This makes the code less readable, but not sure if there is a better solution. - For WebWorkerMLCEngine, no special handling is needed, since the WebWorkerMLCEngineHandler calls the underlying engine's APIs (e.g. `chatCompletion()`), which will block ### Tested - Tested `CustomLock` implementation with unit test (implementation follows [this blog post](https://jackpordi.com/posts/locks-in-js-because-why-not)) - Above example now works - [get-started, get-started-web-worker] x [streaming, non-streaming] x [concurrent requests, single request] - examples/simple-chat-ts - examples/multi-models - WebLLMChat (with generation interrupts, manual termination of service worker) - Opening two tabs WebLLMChat, sending concurrent request, the latter request will wait for the previous one to finish (prior to this PR, garbage output will be generated just like the above simple example, since the two WebLLMChat shares the same service worker, hence the same engine).

CharlieFRuan · 2024-08-19T21:03:51Z

Hi @LEXNY this should be fixed in #549 and reflected in npm 0.2.61. You can check out the PR description for the specifics of the problem and the solution.

Your example now works, though the second request does not start until the first request is finished, as we maintain a FCFS schedule, with only one request running per-model. However, there can be multiple models running in an engine, hence multiple requests can be running per-engine. For more, you can try examples/multi-models.

LEXNY · 2024-08-19T21:41:17Z

Powerhouse!

CharlieFRuan · 2024-08-23T17:43:47Z

Closing this issue as completed. Feel free to reopen/open new ones if issues arise!

LEXNY changed the title ~~Support concurrent requests to the *same* Engine~~ Support concurrent requests to a single model instance Aug 4, 2024

CharlieFRuan mentioned this issue Aug 4, 2024

[Tracking][WebLLM] Function calling (beta) and Embeddings #526

Open

7 tasks

CharlieFRuan mentioned this issue Aug 14, 2024

[Fix] Implement lock to ensure FCFS of requests to same model #549

Merged

CharlieFRuan closed this as completed Aug 23, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support concurrent requests to a single model instance #522

Support concurrent requests to a single model instance #522

LEXNY commented Aug 3, 2024 •

edited

Loading

CharlieFRuan commented Aug 4, 2024 •

edited

Loading

LEXNY commented Aug 4, 2024 •

edited

Loading

CharlieFRuan commented Aug 19, 2024

LEXNY commented Aug 19, 2024

CharlieFRuan commented Aug 23, 2024

Support concurrent requests to a single model instance #522

Support concurrent requests to a single model instance #522

Comments

LEXNY commented Aug 3, 2024 • edited Loading

CharlieFRuan commented Aug 4, 2024 • edited Loading

LEXNY commented Aug 4, 2024 • edited Loading

CharlieFRuan commented Aug 19, 2024

LEXNY commented Aug 19, 2024

CharlieFRuan commented Aug 23, 2024

LEXNY commented Aug 3, 2024 •

edited

Loading

CharlieFRuan commented Aug 4, 2024 •

edited

Loading

LEXNY commented Aug 4, 2024 •

edited

Loading