[Fix] Implement lock to ensure FCFS of requests to same model #549

CharlieFRuan · 2024-08-14T06:05:09Z

A model cannot handle > 1 concurrent request (e.g. >1 calls to chat.completions.create()) since we do not support continuous batching, and each request requires its own resources such as the KV cache. (Though "concurrent" requests to different models in the same engine is supported)

As a result, as pointed out in #522, when users try something like the following code:

const engine = await CreateMLCEngine("Phi-3-mini-4k-instruct-q4f16_1-MLC")
async function sendRequest() {
  const reply = await engine.chat.completions.create({
    messages: [{ role: "user", content: "Hello!" }],
    max_tokens: 64,
  });
  console.log(reply.choices[0].message.content);
}
await Promise.all([sendRequest(), sendRequest()]);

the model's state and the generation result are messed up.

To resolve this, we implement CustomLock using Promise, maintaining a queue to ensure FCFS for incoming requests to a model, such that for a single model, a request only starts when all previous requests are finished. The code above now works.

Implementation Details

We add loadedModelIdToLock to MLCEngine, maintaining a lock for each loaded engine
- Reminder: the need for a critical section is only per model, since each loaded model has its own LLMChatPipeline / EmbeddingPipeline
loadedModelIdToLock is cleared in unload(), set in reloadInternal()
We acquire lock at the very beginning of completion(), chatCompletion() and embedding(), after knowing which model this current call will use
We release lock at the end of embedding(), completion() and chatCompletion() (for non-streaming cases), and asyncGenerate() (for streaming cases)
Since we also want to release the lock when errors occur, we wrap the code with a big try finally
Since asyncGenerate() is an async generator, we add try catch fine-grainedly, only in places that can throw errors
- This makes the code less readable, but not sure if there is a better solution.
For WebWorkerMLCEngine, no special handling is needed, since the WebWorkerMLCEngineHandler calls the underlying engine's APIs (e.g. chatCompletion()), which will block

Tested

Tested CustomLock implementation with unit test (implementation follows this blog post)
Above example now works
[get-started, get-started-web-worker] x [streaming, non-streaming] x [concurrent requests, single request]
examples/simple-chat-ts
examples/multi-models
WebLLMChat (with generation interrupts, manual termination of service worker)
- Opening two tabs WebLLMChat, sending concurrent request, the latter request will wait for the previous one to finish (prior to this PR, garbage output will be generated just like the above simple example, since the two WebLLMChat shares the same service worker, hence the same engine).

Neet-Nestor · 2024-08-14T12:25:11Z

This is a great work. Thanks Charlie!

One suggestion is that maybe we can use the official Web Locks API instead of implementing a custom lock as it has more features like cross tabs and workers synchronization and support for AbortSignal. It already got pretty good compatibility across browsers so maybe it's worth having a look at.

CharlieFRuan · 2024-08-14T20:34:05Z

@Neet-Nestor Thanks for the review, this is a good point! After taking a look at the Lock Web API a bit, it seems that the following usage:

await navigator.locks.request("my_resource", async (lock) => {
  // The lock has been acquired.
  await do_something_with_lock();
  await do_something_else_with_lock();
  // Now the lock will be released.
});

may not fit our needs well, compared to explicit lock.acquire() and lock.release().

For instance, for the streaming request, we need to acquire the lock in chatCompletion(), but release it inside asyncGenerate() only when the generation ends. I am not exactly sure how the callback format of the Lock Web API can achieve this. Similarly, I am not sure how the callback format works with yield in async generators.

I am not quite familiar with the Web Lock API, but it seems the CustomLock can provide more fine-grained control.

Edit: found this https://developer.mozilla.org/en-US/docs/Web/API/Web_Locks_API#advanced_use:

// Capture promise control functions:
let resolve, reject;
const p = new Promise((res, rej) => {
  resolve = res;
  reject = rej;
});

// Request the lock:
navigator.locks.request(
  "my_resource",
  // Lock is acquired.
  (lock) => p, // Now lock will be held until either resolve() or reject() is called.
);

CharlieFRuan · 2024-08-14T21:35:49Z

Discussed with @Neet-Nestor offline. The conclusion is that we will use CustomLock.

Though Web Lock is the official API while CustomLock is self-implemented, the latter is simple enough and can fix if issues encountered (also provides all we need as of now)
Consider the case when two identical WebWorker-based apps are opened (e.g. opening simple-chat-ts twice)
- Each app has its own web worker, hence its own engine
- However, using Web Lock will sync them, since the locks share the same name
  - "While a lock is held, requests for the same lock from this execution context, or from other tabs/workers, will be queued."
- Such implicit synchronization of two engines is not what we want (though can solve it with giving each engine a unique identifier, but unnecessary burden)
- On the other hand, the scope of CustomLock is strictly per-engine
On a side note, for service worker apps, e.g. opening two tabs of chat.webllm.ai, will use the same service worker, hence same engine

CharlieFRuan · 2024-08-14T21:38:17Z

Opening two tabs WebLLMChat, sending concurrent request, the latter request will wait for the previous one to finish (prior to this PR, garbage output will be generated just like the above simple example, since the two WebLLMChat shares the same service worker, hence the same engine)

Demonstration on WebLLMChat:

webllm-chat-concurent-req.mov

### Changes - #549 - In Engine, internally, a `CustomLock` is required to run `completion()`, `chatCompletion()`, and `embedding()` - A lock is mapped in `loadedModelIdToLock` ### TVMjs Still compiled at apache/tvm@1fcb620, no change

CharlieFRuan added 2 commits August 14, 2024 02:04

[Fix] Implement lock to ensure FCFS of requests to same model

d9a266e

Add comments

fc56e58

CharlieFRuan marked this pull request as ready for review August 14, 2024 06:53

Neet-Nestor self-assigned this Aug 14, 2024

Neet-Nestor self-requested a review August 14, 2024 12:25

Neet-Nestor removed their assignment Aug 14, 2024

Neet-Nestor approved these changes Aug 19, 2024

View reviewed changes

Neet-Nestor merged commit 1b691a5 into mlc-ai:main Aug 19, 2024
1 check passed

Neet-Nestor deleted the pr-0814-lock branch August 19, 2024 20:00

CharlieFRuan mentioned this pull request Aug 19, 2024

[Version] Bump version to 0.2.61 #552

Merged

This was referenced Aug 19, 2024

[Tracking][WebLLM] Function calling (beta) and Embeddings #526

Open

Support concurrent requests to a single model instance #522

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Fix] Implement lock to ensure FCFS of requests to same model #549

[Fix] Implement lock to ensure FCFS of requests to same model #549

CharlieFRuan commented Aug 14, 2024 •

edited

Loading

Neet-Nestor commented Aug 14, 2024

CharlieFRuan commented Aug 14, 2024 •

edited

Loading

CharlieFRuan commented Aug 14, 2024

CharlieFRuan commented Aug 14, 2024

[Fix] Implement lock to ensure FCFS of requests to same model #549

[Fix] Implement lock to ensure FCFS of requests to same model #549

Conversation

CharlieFRuan commented Aug 14, 2024 • edited Loading

Implementation Details

Tested

Neet-Nestor commented Aug 14, 2024

CharlieFRuan commented Aug 14, 2024 • edited Loading

CharlieFRuan commented Aug 14, 2024

CharlieFRuan commented Aug 14, 2024

CharlieFRuan commented Aug 14, 2024 •

edited

Loading

CharlieFRuan commented Aug 14, 2024 •

edited

Loading