This repository contains an example of running Phi-3-mini-4k-instruct in your browser using ONNX Runtime Web with WebGPU.
You can try out the live demo here.
We keep this example simple and use the onnxruntime-web api directly. ONNX Runtime Web has been powering higher level frameworks like transformers.js.
Ensure that you have Node.js installed on your machine.
Install the required dependencies:
npm install
Build the project:
npm run build
The output can be found in the dist directory.
npm run dev
This will build the project and start a dev server. Point your browser to http://localhost:8080/.
The model used in this example is hosted on Hugging Face. It is an optimized ONNX version specific to Web and slightly different than the ONNX model for CUDA or CPU:
- The model output 'logits' is kept as float32 (even for float16 models) since Javascript does not support float16.
- Our WebGPU implementation uses the custom Multiheaded Attention operator instread of Group Query Attention.
- Phi3 is larger then 2GB and we need to use external data files. To keep them cacheable in the browser, both model.onnx and model.onnx.data are kept under 2GB.
If you like to optimize your fine-tuned pytorch Phi-3-min model, you can use Olive which supports float data type conversion and ONNX genai model builder toolkit. An example how to optimize Phi-3-min model for ONNX Runtime Web with Olive can be found here.