-
Notifications
You must be signed in to change notification settings - Fork 797
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Fix] Implement lock to ensure FCFS of requests to same model #549
Conversation
This is a great work. Thanks Charlie! One suggestion is that maybe we can use the official Web Locks API instead of implementing a custom lock as it has more features like cross tabs and workers synchronization and support for |
@Neet-Nestor Thanks for the review, this is a good point! After taking a look at the Lock Web API a bit, it seems that the following usage: await navigator.locks.request("my_resource", async (lock) => {
// The lock has been acquired.
await do_something_with_lock();
await do_something_else_with_lock();
// Now the lock will be released.
}); may not fit our needs well, compared to explicit For instance, for the streaming request, we need to acquire the lock in I am not quite familiar with the Web Lock API, but it seems the Edit: found this https://developer.mozilla.org/en-US/docs/Web/API/Web_Locks_API#advanced_use: // Capture promise control functions:
let resolve, reject;
const p = new Promise((res, rej) => {
resolve = res;
reject = rej;
});
// Request the lock:
navigator.locks.request(
"my_resource",
// Lock is acquired.
(lock) => p, // Now lock will be held until either resolve() or reject() is called.
); |
Discussed with @Neet-Nestor offline. The conclusion is that we will use
|
Demonstration on WebLLMChat: webllm-chat-concurent-req.mov |
### Changes - #549 - In Engine, internally, a `CustomLock` is required to run `completion()`, `chatCompletion()`, and `embedding()` - A lock is mapped in `loadedModelIdToLock` ### TVMjs Still compiled at apache/tvm@1fcb620, no change
A model cannot handle > 1 concurrent request (e.g. >1 calls to
chat.completions.create()
) since we do not support continuous batching, and each request requires its own resources such as the KV cache. (Though "concurrent" requests to different models in the same engine is supported)As a result, as pointed out in #522, when users try something like the following code:
the model's state and the generation result are messed up.
To resolve this, we implement
CustomLock
using Promise, maintaining a queue to ensure FCFS for incoming requests to a model, such that for a single model, a request only starts when all previous requests are finished. The code above now works.Implementation Details
loadedModelIdToLock
to MLCEngine, maintaining a lock for each loaded engineLLMChatPipeline
/EmbeddingPipeline
loadedModelIdToLock
is cleared inunload()
, set inreloadInternal()
completion()
,chatCompletion()
andembedding()
, after knowing which model this current call will useembedding()
,completion()
andchatCompletion()
(for non-streaming cases), andasyncGenerate()
(for streaming cases)try
finally
asyncGenerate()
is an async generator, we addtry
catch
fine-grainedly, only in places that can throw errorschatCompletion()
), which will blockTested
CustomLock
implementation with unit test (implementation follows this blog post)