-
Notifications
You must be signed in to change notification settings - Fork 877
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feature request: engine.preload() #529
Comments
I think you can simply achieve this by creating a second instance of |
Interesting idea, thanks. The thing is, I don't always need to also start the model. For example, a user might want to go on a long airplane trip and pre-download some models from a list (kind of like pre-loading the map for Spain into OSMAND (or your map-app of choice) before going on a holiday. But maybe I can just forego switching to the new engine instance? Then the files will still be downloaded anyway, right? For comparison, this is how Wllama does it. It's just a helper function that loads the chunks into cache, and.. stops there. |
@CharlieFRuan Follow up on this, if I do something like this to create and load additional engine instance but doesn't actually do completion, would this achieve the result of downloading additional models without causing GPU memory issues?
|
Thanks for the thoughts and discussions @Neet-Nestor @flatsiedatsie! The code above will work fine: Therefore, one way to "only download a model, without touching WebGPU" is to:
On webllm side:
|
I ended up coding a custom function that manually loads the files into the cache. I didn't expect splitting the loading from the inference to have such a big effect, but it has helped simplify my code. And it's now possible for users to use and load models they have already downloaded while they are waiting for the new one to download. |
Perhaps related to this PR, but opposite:
I'd like to be able to easily ask WebLLM to download a second (or third, etc) model to cache, while continuing to use the existing, already loaded model. Then get a callback when the second model has loaded, so that I can inform the user they can now switch to the other model if they prefer.
Or is there an optimal way to do this already?
Currently my idea is to create a separate function to load the new shards into the cache manually, separately/outside of WebLLM. But I'd prefer to use WebLLM for this if there is a feature for this already (I searched the repo but couldn't find any).
The text was updated successfully, but these errors were encountered: