Skip to content

Commit 331914a

Browse files
authored
Add image and audio prompting API
Closes #40. Somewhat helps with #70.
1 parent cbd111e commit 331914a

File tree

1 file changed

+139
-24
lines changed

1 file changed

+139
-24
lines changed

README.md

+139-24
Original file line numberDiff line numberDiff line change
@@ -173,6 +173,65 @@ console.log(await promptWithCalculator("What is 2 + 2?"));
173173

174174
We'll likely explore more specific APIs for tool- and function-calling in the future; follow along in [issue #7](https://github.com/webmachinelearning/prompt-api/issues/7).
175175

176+
### Multimodal inputs
177+
178+
All of the above examples have been of text prompts. Some language models also support other inputs. Our design initially includes the potential to support images and audio clips as inputs. This is done by using objects in the form `{ type: "image", content }` and `{ type: "audio", content }` instead of strings. The `content` values can be the following:
179+
180+
* For image inputs: [`ImageBitmapSource`](https://html.spec.whatwg.org/#imagebitmapsource), i.e. `Blob`, `ImageData`, `ImageBitmap`, `VideoFrame`, `OffscreenCanvas`, `HTMLImageElement`, `SVGImageElement`, `HTMLCanvasElement`, or `HTMLVideoElement` (will get the current frame). Also raw bytes via `BufferSource` (i.e. `ArrayBuffer` or typed arrays).
181+
182+
* For audio inputs: for now, `Blob`, `AudioBuffer`, or raw bytes via `BufferSource`. Other possibilities we're investigating include `HTMLAudioElement`, `AudioData`, and `MediaStream`, but we're not yet sure if those are suitable to represent "clips": most other uses of them on the web platform are able to handle streaming data.
183+
184+
Sessions that will include these inputs need to be created using the `expectedInputs` option, to ensure that any necessary downloads are done as part of session creation, and that if the model is not capable of such multimodal prompts, the session creation fails. (See also the below discussion of [expected input languages](#multilingual-content-and-expected-languages), not just expected input types.)
185+
186+
A sample of using these APIs:
187+
188+
```js
189+
const session = await ai.languageModel.create({
190+
// { type: "text" } is not necessary to include explicitly, unless
191+
// you also want to include expected input languages for text.
192+
expectedInputs: [
193+
{ type: "audio" },
194+
{ type: "image" }
195+
]
196+
});
197+
198+
const referenceImage = await (await fetch("/reference-image.jpeg")).blob();
199+
const userDrawnImage = document.querySelector("canvas");
200+
201+
const response1 = await session.prompt([
202+
"Give a helpful artistic critique of how well the second image matches the first:",
203+
{ type: "image", content: referenceImage },
204+
{ type: "image", content: userDrawnImage }
205+
]);
206+
207+
console.log(response1);
208+
209+
const audioBlob = await captureMicrophoneInput({ seconds: 10 });
210+
211+
const response2 = await session.prompt([
212+
"My response to your critique:",
213+
{ type: "audio", content: audioBlob }
214+
]);
215+
```
216+
217+
Future extensions may include more ambitious multimodal inputs, such as video clips, or realtime audio or video. (Realtime might require a different API design, more based around events or streams instead of messages.)
218+
219+
Details:
220+
221+
* Cross-origin data that has not been exposed using the `Access-Control-Allow-Origin` header cannot be used with the prompt API, and will reject with a `"SecurityError"` `DOMException`. This applies to `HTMLImageElement`, `SVGImageElement`, `HTMLVideoElement`, `HTMLCanvasElement`, and `OffscreenCanvas`. Note that this is more strict than `createImageBitmap()`, which has a tainting mechanism which allows creating opaque image bitmaps from unexposed cross-origin resources. For the prompt API, such resources will just fail. This includes attempts to use cross-origin-tainted canvases.
222+
223+
* Raw-bytes cases (`Blob` and `BufferSource`) will apply the appropriate sniffing rules ([for images](https://mimesniff.spec.whatwg.org/#rules-for-sniffing-images-specifically), [for audio](https://mimesniff.spec.whatwg.org/#rules-for-sniffing-audio-and-video-specifically)) and reject with a `"NotSupportedError"` `DOMException` if the format is not supported. This behavior is similar to that of `createImageBitmap()`.
224+
225+
* Animated images will be required to snapshot the first frame (like `createImageBitmap()`). In the future, animated image input may be supported via some separate opt-in, similar to video clip input. But we don't want interoperability problems from some implementations supporting animated images and some not, in the initial version.
226+
227+
* For `HTMLVideoElement`, even a single frame might not yet be downloaded when the prompt API is called. In such cases, calling into the prompt API will force at least a single frame's worth of video to download. (The intent is to behave the same as `createImageBitmap(videoEl)`.)
228+
229+
* Text prompts can also be done via `{ type: "text", content: aString }`, instead of just `aString`. This can be useful for generic code.
230+
231+
* Attempting to supply an invalid combination, e.g. `{ type: "audio", content: anImageBitmap }`, `{ type: "image", content: anAudioBuffer }`, or `{ type: "text", content: anArrayBuffer }`, will reject with a `TypeError`.
232+
233+
* As described [above](#customizing-the-role-per-prompt), you can also supply a `role` value in these objects, so that the full form is `{ role, type, content }`. However, for now, using any role besides the default `"user"` role with an image or audio prompt will reject with a `"NotSupportedError"` `DOMException`. (As we explore multimodal outputs, this restriction might be lifted in the future.)
234+
176235
### Structured output or JSON output
177236

178237
To help with programmatic processing of language model responses, the prompt API supports structured outputs defined by a JSON schema.
@@ -369,7 +428,30 @@ const session = await ai.languageModel.create({
369428
prefer speaking in Japanese, and return to the Japanese conversation once any sidebars are
370429
concluded.
371430
`,
372-
expectedInputLanguages: ["en" /* for the system prompt */, "ja", "ko"]
431+
expectedInputs: [{
432+
type: "text",
433+
languages: ["en" /* for the system prompt */, "ja", "ko"]
434+
}]
435+
});
436+
```
437+
438+
The expected input languages are supplied alongside the [expected input types](#multimodal-inputs), and can vary per type. Our above example assumes the default of `type: "text"`, but more complicated combinations are possible, e.g.:
439+
440+
```js
441+
const session = await ai.languageModel.create({
442+
expectedInputs: [
443+
// Be sure to download any material necessary for English and Japanese text
444+
// prompts, or fail-fast if the model cannot support that.
445+
{ type: "text", languages: ["en", "ja"] },
446+
447+
// `languages` omitted: audio input processing will be best-effort based on
448+
// the base model's capability.
449+
{ type: "audio" },
450+
451+
// Be sure to download any material necessary for OCRing French text in
452+
// images, or fail-fast if the model cannot support that.
453+
{ type: "image", languages: ["fr"] }
454+
]
373455
});
374456
```
375457

@@ -391,7 +473,13 @@ The method will return a promise that fulfills with one of the following availab
391473
An example usage is the following:
392474

393475
```js
394-
const options = { expectedInputLanguages: ["en", "es"], temperature: 2 };
476+
const options = {
477+
expectedInputs: [
478+
{ type: "text", languages: ["en", "es"] },
479+
{ type: "audio", languages: ["en", "es"] }
480+
],
481+
temperature: 2
482+
};
395483

396484
const availability = await ai.languageModel.availability(options);
397485

@@ -424,7 +512,7 @@ const session = await ai.languageModel.create({
424512

425513
If the download fails, then `downloadprogress` events will stop being emitted, and the promise returned by `create()` will be rejected with a "`NetworkError`" `DOMException`.
426514

427-
Note that in the case that multiple entities are downloaded (e.g., a base model plus a [LoRA fine-tuning](https://arxiv.org/abs/2106.09685) for the `expectedInputLanguages`) web developers do not get the ability to monitor the individual downloads. All of them are bundled into the overall `downloadprogress` events, and the `create()` promise is not fulfilled until all downloads and loads are successful.
515+
Note that in the case that multiple entities are downloaded (e.g., a base model plus [LoRA fine-tunings](https://arxiv.org/abs/2106.09685) for the `expectedInputs`) web developers do not get the ability to monitor the individual downloads. All of them are bundled into the overall `downloadprogress` events, and the `create()` promise is not fulfilled until all downloads and loads are successful.
428516

429517
The event is a [`ProgressEvent`](https://developer.mozilla.org/en-US/docs/Web/API/ProgressEvent) whose `loaded` property is between 0 and 1, and whose `total` property is always 1. (The exact number of total or downloaded bytes are not exposed; see the discussion in [webmachinelearning/writing-assistance-apis issue #15](https://github.com/webmachinelearning/writing-assistance-apis/issues/15).)
430518

@@ -481,17 +569,26 @@ interface AILanguageModelFactory {
481569
482570
[Exposed=(Window,Worker), SecureContext]
483571
interface AILanguageModel : EventTarget {
484-
Promise<DOMString> prompt(AILanguageModelPromptInput input, optional AILanguageModelPromptOptions options = {});
485-
ReadableStream promptStreaming(AILanguageModelPromptInput input, optional AILanguageModelPromptOptions options = {});
486-
487-
Promise<unsigned long long> countPromptTokens(AILanguageModelPromptInput input, optional AILanguageModelPromptOptions options = {});
572+
// These will throw "NotSupportedError" DOMExceptions if role = "system"
573+
Promise<DOMString> prompt(
574+
AILanguageModelPromptInput input,
575+
optional AILanguageModelPromptOptions options = {}
576+
);
577+
ReadableStream promptStreaming(
578+
AILanguageModelPromptInput input,
579+
optional AILanguageModelPromptOptions options = {}
580+
);
581+
582+
Promise<unsigned long long> countPromptTokens(
583+
AILanguageModelPromptInput input,
584+
optional AILanguageModelPromptOptions options = {}
585+
);
488586
readonly attribute unsigned long long maxTokens;
489587
readonly attribute unsigned long long tokensSoFar;
490588
readonly attribute unsigned long long tokensLeft;
491589
492590
readonly attribute unsigned long topK;
493591
readonly attribute float temperature;
494-
readonly attribute FrozenArray<DOMString>? expectedInputLanguages;
495592
496593
attribute EventHandler oncontextoverflow;
497594
@@ -518,25 +615,15 @@ dictionary AILanguageModelCreateCoreOptions {
518615
unrestricted double topK;
519616
unrestricted double temperature;
520617
521-
sequence<DOMString> expectedInputLanguages;
522-
}
618+
sequence<AILanguageModelExpectedInput> expectedInputs;
619+
};
523620
524621
dictionary AILanguageModelCreateOptions : AILanguageModelCreateCoreOptions {
525622
AbortSignal signal;
526623
AICreateMonitorCallback monitor;
527624
528625
DOMString systemPrompt;
529-
sequence<AILanguageModelInitialPrompt> initialPrompts;
530-
};
531-
532-
dictionary AILanguageModelInitialPrompt {
533-
required AILanguageModelInitialPromptRole role;
534-
required DOMString content;
535-
};
536-
537-
dictionary AILanguageModelPrompt {
538-
required AILanguageModelPromptRole role;
539-
required DOMString content;
626+
sequence<AILanguageModelPrompt> initialPrompts;
540627
};
541628
542629
dictionary AILanguageModelPromptOptions {
@@ -548,10 +635,38 @@ dictionary AILanguageModelCloneOptions {
548635
AbortSignal signal;
549636
};
550637
551-
typedef (DOMString or AILanguageModelPrompt or sequence<AILanguageModelPrompt>) AILanguageModelPromptInput;
638+
dictionary AILanguageModelExpectedInput {
639+
required AILanguageModelPromptType type;
640+
sequence<DOMString> languages;
641+
};
642+
643+
// The argument to the prompt() method and others like it
644+
645+
typedef (AILanguageModelPrompt or sequence<AILanguageModelPrompt>) AILanguageModelPromptInput;
646+
647+
// Prompt lines
648+
649+
typedef (
650+
DOMString // interpreted as { role: "user", type: "text", content: providedValue }
651+
or AILanguageModelPromptDict // canonical form
652+
) AILanguageModelPrompt;
653+
654+
dictionary AILanguageModelPromptDict {
655+
AILanguageModelPromptRole role = "user";
656+
AILanguageModelPromptType type = "text";
657+
required AILanguageModelPromptContent content;
658+
};
659+
660+
enum AILanguageModelPromptRole { "system", "user", "assistant" };
661+
662+
enum AILanguageModelPromptType { "text", "image", "audio" };
552663
553-
enum AILanguageModelInitialPromptRole { "system", "user", "assistant" };
554-
enum AILanguageModelPromptRole { "user", "assistant" };
664+
typedef (
665+
ImageBitmapSource
666+
or AudioBuffer
667+
or BufferSource
668+
or DOMString
669+
) AILanguageModelPromptContent;
555670
```
556671

557672
### Instruction-tuned versus base models

0 commit comments

Comments
 (0)