Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Fix] Implement lock to ensure FCFS of requests to same model #549

Merged
merged 2 commits into from
Aug 19, 2024

Conversation

CharlieFRuan
Copy link
Contributor

@CharlieFRuan CharlieFRuan commented Aug 14, 2024

A model cannot handle > 1 concurrent request (e.g. >1 calls to chat.completions.create()) since we do not support continuous batching, and each request requires its own resources such as the KV cache. (Though "concurrent" requests to different models in the same engine is supported)

As a result, as pointed out in #522, when users try something like the following code:

const engine = await CreateMLCEngine("Phi-3-mini-4k-instruct-q4f16_1-MLC")
async function sendRequest() {
  const reply = await engine.chat.completions.create({
    messages: [{ role: "user", content: "Hello!" }],
    max_tokens: 64,
  });
  console.log(reply.choices[0].message.content);
}
await Promise.all([sendRequest(), sendRequest()]);

the model's state and the generation result are messed up.

To resolve this, we implement CustomLock using Promise, maintaining a queue to ensure FCFS for incoming requests to a model, such that for a single model, a request only starts when all previous requests are finished. The code above now works.

Implementation Details

  • We add loadedModelIdToLock to MLCEngine, maintaining a lock for each loaded engine
    • Reminder: the need for a critical section is only per model, since each loaded model has its own LLMChatPipeline / EmbeddingPipeline
  • loadedModelIdToLock is cleared in unload(), set in reloadInternal()
  • We acquire lock at the very beginning of completion(), chatCompletion() and embedding(), after knowing which model this current call will use
  • We release lock at the end of embedding(), completion() and chatCompletion() (for non-streaming cases), and asyncGenerate() (for streaming cases)
  • Since we also want to release the lock when errors occur, we wrap the code with a big try finally
  • Since asyncGenerate() is an async generator, we add try catch fine-grainedly, only in places that can throw errors
    • This makes the code less readable, but not sure if there is a better solution.
  • For WebWorkerMLCEngine, no special handling is needed, since the WebWorkerMLCEngineHandler calls the underlying engine's APIs (e.g. chatCompletion()), which will block

Tested

  • Tested CustomLock implementation with unit test (implementation follows this blog post)
  • Above example now works
  • [get-started, get-started-web-worker] x [streaming, non-streaming] x [concurrent requests, single request]
  • examples/simple-chat-ts
  • examples/multi-models
  • WebLLMChat (with generation interrupts, manual termination of service worker)
    • Opening two tabs WebLLMChat, sending concurrent request, the latter request will wait for the previous one to finish (prior to this PR, garbage output will be generated just like the above simple example, since the two WebLLMChat shares the same service worker, hence the same engine).

@CharlieFRuan CharlieFRuan marked this pull request as ready for review August 14, 2024 06:53
@Neet-Nestor
Copy link
Contributor

This is a great work. Thanks Charlie!

One suggestion is that maybe we can use the official Web Locks API instead of implementing a custom lock as it has more features like cross tabs and workers synchronization and support for AbortSignal. It already got pretty good compatibility across browsers so maybe it's worth having a look at.

@Neet-Nestor Neet-Nestor self-assigned this Aug 14, 2024
@Neet-Nestor Neet-Nestor self-requested a review August 14, 2024 12:25
@Neet-Nestor Neet-Nestor removed their assignment Aug 14, 2024
@CharlieFRuan
Copy link
Contributor Author

CharlieFRuan commented Aug 14, 2024

@Neet-Nestor Thanks for the review, this is a good point! After taking a look at the Lock Web API a bit, it seems that the following usage:

await navigator.locks.request("my_resource", async (lock) => {
  // The lock has been acquired.
  await do_something_with_lock();
  await do_something_else_with_lock();
  // Now the lock will be released.
});

may not fit our needs well, compared to explicit lock.acquire() and lock.release().

For instance, for the streaming request, we need to acquire the lock in chatCompletion(), but release it inside asyncGenerate() only when the generation ends. I am not exactly sure how the callback format of the Lock Web API can achieve this. Similarly, I am not sure how the callback format works with yield in async generators.

I am not quite familiar with the Web Lock API, but it seems the CustomLock can provide more fine-grained control.

Edit: found this https://developer.mozilla.org/en-US/docs/Web/API/Web_Locks_API#advanced_use:

// Capture promise control functions:
let resolve, reject;
const p = new Promise((res, rej) => {
  resolve = res;
  reject = rej;
});

// Request the lock:
navigator.locks.request(
  "my_resource",
  // Lock is acquired.
  (lock) => p, // Now lock will be held until either resolve() or reject() is called.
);

@CharlieFRuan
Copy link
Contributor Author

Discussed with @Neet-Nestor offline. The conclusion is that we will use CustomLock.

  • Though Web Lock is the official API while CustomLock is self-implemented, the latter is simple enough and can fix if issues encountered (also provides all we need as of now)
  • Consider the case when two identical WebWorker-based apps are opened (e.g. opening simple-chat-ts twice)
    • Each app has its own web worker, hence its own engine
    • However, using Web Lock will sync them, since the locks share the same name
      • "While a lock is held, requests for the same lock from this execution context, or from other tabs/workers, will be queued."
    • Such implicit synchronization of two engines is not what we want (though can solve it with giving each engine a unique identifier, but unnecessary burden)
    • On the other hand, the scope of CustomLock is strictly per-engine
  • On a side note, for service worker apps, e.g. opening two tabs of chat.webllm.ai, will use the same service worker, hence same engine

@CharlieFRuan
Copy link
Contributor Author

Opening two tabs WebLLMChat, sending concurrent request, the latter request will wait for the previous one to finish (prior to this PR, garbage output will be generated just like the above simple example, since the two WebLLMChat shares the same service worker, hence the same engine)

Demonstration on WebLLMChat:

webllm-chat-concurent-req.mov

@Neet-Nestor Neet-Nestor merged commit 1b691a5 into mlc-ai:main Aug 19, 2024
1 check passed
@Neet-Nestor Neet-Nestor deleted the pr-0814-lock branch August 19, 2024 20:00
CharlieFRuan added a commit that referenced this pull request Aug 19, 2024
### Changes
- #549
- In Engine, internally, a `CustomLock` is required to run
`completion()`, `chatCompletion()`, and `embedding()`
  - A lock is mapped in `loadedModelIdToLock`

### TVMjs
Still compiled at
apache/tvm@1fcb620,
no change
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants