You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: README.md
+139-24
Original file line number
Diff line number
Diff line change
@@ -173,6 +173,65 @@ console.log(await promptWithCalculator("What is 2 + 2?"));
173
173
174
174
We'll likely explore more specific APIs for tool- and function-calling in the future; follow along in [issue #7](https://github.com/webmachinelearning/prompt-api/issues/7).
175
175
176
+
### Multimodal inputs
177
+
178
+
All of the above examples have been of text prompts. Some language models also support other inputs. Our design initially includes the potential to support images and audio clips as inputs. This is done by using objects in the form `{ type: "image", content }` and `{ type: "audio", content }` instead of strings. The `content` values can be the following:
179
+
180
+
* For image inputs: [`ImageBitmapSource`](https://html.spec.whatwg.org/#imagebitmapsource), i.e. `Blob`, `ImageData`, `ImageBitmap`, `VideoFrame`, `OffscreenCanvas`, `HTMLImageElement`, `SVGImageElement`, `HTMLCanvasElement`, or `HTMLVideoElement` (will get the current frame). Also raw bytes via `BufferSource` (i.e. `ArrayBuffer` or typed arrays).
181
+
182
+
* For audio inputs: for now, `Blob`, `AudioBuffer`, or raw bytes via `BufferSource`. Other possibilities we're investigating include `HTMLAudioElement`, `AudioData`, and `MediaStream`, but we're not yet sure if those are suitable to represent "clips": most other uses of them on the web platform are able to handle streaming data.
183
+
184
+
Sessions that will include these inputs need to be created using the `expectedInputs` option, to ensure that any necessary downloads are done as part of session creation, and that if the model is not capable of such multimodal prompts, the session creation fails. (See also the below discussion of [expected input languages](#multilingual-content-and-expected-languages), not just expected input types.)
185
+
186
+
A sample of using these APIs:
187
+
188
+
```js
189
+
constsession=awaitai.languageModel.create({
190
+
// { type: "text" } is not necessary to include explicitly, unless
191
+
// you also want to include expected input languages for text.
Future extensions may include more ambitious multimodal inputs, such as video clips, or realtime audio or video. (Realtime might require a different API design, more based around events or streams instead of messages.)
218
+
219
+
Details:
220
+
221
+
* Cross-origin data that has not been exposed using the `Access-Control-Allow-Origin` header cannot be used with the prompt API, and will reject with a `"SecurityError"``DOMException`. This applies to `HTMLImageElement`, `SVGImageElement`, `HTMLVideoElement`, `HTMLCanvasElement`, and `OffscreenCanvas`. Note that this is more strict than `createImageBitmap()`, which has a tainting mechanism which allows creating opaque image bitmaps from unexposed cross-origin resources. For the prompt API, such resources will just fail. This includes attempts to use cross-origin-tainted canvases.
222
+
223
+
* Raw-bytes cases (`Blob` and `BufferSource`) will apply the appropriate sniffing rules ([for images](https://mimesniff.spec.whatwg.org/#rules-for-sniffing-images-specifically), [for audio](https://mimesniff.spec.whatwg.org/#rules-for-sniffing-audio-and-video-specifically)) and reject with a `"NotSupportedError"``DOMException` if the format is not supported. This behavior is similar to that of `createImageBitmap()`.
224
+
225
+
* Animated images will be required to snapshot the first frame (like `createImageBitmap()`). In the future, animated image input may be supported via some separate opt-in, similar to video clip input. But we don't want interoperability problems from some implementations supporting animated images and some not, in the initial version.
226
+
227
+
* For `HTMLVideoElement`, even a single frame might not yet be downloaded when the prompt API is called. In such cases, calling into the prompt API will force at least a single frame's worth of video to download. (The intent is to behave the same as `createImageBitmap(videoEl)`.)
228
+
229
+
* Text prompts can also be done via `{ type: "text", content: aString }`, instead of just `aString`. This can be useful for generic code.
230
+
231
+
* Attempting to supply an invalid combination, e.g. `{ type: "audio", content: anImageBitmap }`, `{ type: "image", content: anAudioBuffer }`, or `{ type: "text", content: anArrayBuffer }`, will reject with a `TypeError`.
232
+
233
+
* As described [above](#customizing-the-role-per-prompt), you can also supply a `role` value in these objects, so that the full form is `{ role, type, content }`. However, for now, using any role besides the default `"user"` role with an image or audio prompt will reject with a `"NotSupportedError"``DOMException`. (As we explore multimodal outputs, this restriction might be lifted in the future.)
234
+
176
235
### Structured output or JSON output
177
236
178
237
To help with programmatic processing of language model responses, the prompt API supports structured outputs defined by a JSON schema.
prefer speaking in Japanese, and return to the Japanese conversation once any sidebars are
370
429
concluded.
371
430
`,
372
-
expectedInputLanguages: ["en"/* for the system prompt */, "ja", "ko"]
431
+
expectedInputs: [{
432
+
type:"text",
433
+
languages: ["en"/* for the system prompt */, "ja", "ko"]
434
+
}]
435
+
});
436
+
```
437
+
438
+
The expected input languages are supplied alongside the [expected input types](#multimodal-inputs), and can vary per type. Our above example assumes the default of `type: "text"`, but more complicated combinations are possible, e.g.:
439
+
440
+
```js
441
+
constsession=awaitai.languageModel.create({
442
+
expectedInputs: [
443
+
// Be sure to download any material necessary for English and Japanese text
444
+
// prompts, or fail-fast if the model cannot support that.
445
+
{ type:"text", languages: ["en", "ja"] },
446
+
447
+
// `languages` omitted: audio input processing will be best-effort based on
448
+
// the base model's capability.
449
+
{ type:"audio" },
450
+
451
+
// Be sure to download any material necessary for OCRing French text in
452
+
// images, or fail-fast if the model cannot support that.
453
+
{ type:"image", languages: ["fr"] }
454
+
]
373
455
});
374
456
```
375
457
@@ -391,7 +473,13 @@ The method will return a promise that fulfills with one of the following availab
If the download fails, then `downloadprogress` events will stop being emitted, and the promise returned by `create()` will be rejected with a "`NetworkError`" `DOMException`.
426
514
427
-
Note that in the case that multiple entities are downloaded (e.g., a base model plus a [LoRA fine-tuning](https://arxiv.org/abs/2106.09685) for the `expectedInputLanguages`) web developers do not get the ability to monitor the individual downloads. All of them are bundled into the overall `downloadprogress` events, and the `create()` promise is not fulfilled until all downloads and loads are successful.
515
+
Note that in the case that multiple entities are downloaded (e.g., a base model plus [LoRA fine-tunings](https://arxiv.org/abs/2106.09685) for the `expectedInputs`) web developers do not get the ability to monitor the individual downloads. All of them are bundled into the overall `downloadprogress` events, and the `create()` promise is not fulfilled until all downloads and loads are successful.
428
516
429
517
The event is a [`ProgressEvent`](https://developer.mozilla.org/en-US/docs/Web/API/ProgressEvent) whose `loaded` property is between 0 and 1, and whose `total` property is always 1. (The exact number of total or downloaded bytes are not exposed; see the discussion in [webmachinelearning/writing-assistance-apis issue #15](https://github.com/webmachinelearning/writing-assistance-apis/issues/15).)
0 commit comments