Skip to content

[Hold] Video and audio file processing #592

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 5 commits into
base: main
Choose a base branch
from
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -7,5 +7,10 @@ strategies other than **Auto** for sets of documents of different types could pr
including reduction in transformation quality.

- **VLM**: For the highest-quality transformation of these file types: `.bmp`, `.gif`, `.heic`, `.jpeg`, `.jpg`, `.pdf`, `.png`, `.tiff`, and `.webp`.
- **High Res**: For all other [supported file types](/ui/supported-file-types), and for the generation of bounding box coordinates.
- **Fast**: For text-only documents.
- **High Res**: For all other [supported file types](/ui/supported-file-types) except video and audio files, and for the generation of bounding box coordinates.
- **Fast**: For text-only documents.
- **Multimedia**: For video and audio files.

<Note>
Video and audio file partitioning is available only for [self-hosted](/self-hosted/overview) deployments of Unstructured.
</Note>
22 changes: 22 additions & 0 deletions snippets/general-shared-text/supported-file-types-platform.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,10 @@ By file extension:

| File extension |
| --- |
| `.3gp` |
| `.aac` |
| `.abw` |
| `.avi` |
| `.bmp` |
| `.csv` |
| `.cwk` |
Expand All @@ -19,6 +22,8 @@ By file extension:
| `.epub` |
| `.et` |
| `.eth` |
| `.flac` |
| `.flv` |
| `.fods` |
| `.gif` |
| `.heic` |
Expand All @@ -27,14 +32,26 @@ By file extension:
| `.hwp` |
| `.jpeg` |
| `.jpg` |
| `.m4a` |
| `.md` |
| `.mcw` |
| `.mov` |
| `.mp2` |
| `.mp3` |
| `.mp4` |
| `.mpeg` |
| `.mpegs` |
| `.mpg` |
| `.mpgs` |
| `.mw` |
| `.odt` |
| `.ogg` |
| `.opus` |
| `.org` |
| `.p7s` |
| `.pages` |
| `.pbd` |
| `.pcm` |
| `.pdf` |
| `.png` |
| `.pot` |
Expand All @@ -55,9 +72,12 @@ By file extension:
| `.uof` |
| `.uos1` |
| `.uos2` |
| `.wav` |
| `.web` |
| `.webm` |
| `.webp` |
| `.wk2` |
| `.wmv` |
| `.xls` |
| `.xlsb` |
| `.xlsm` |
Expand All @@ -71,6 +91,7 @@ By file type:
| Category | File types |
| --- | --- |
| Apple | `.cwk`, `.mcw`, `.pages`
| Audio | `.aac`, `.flac`, `.m4a`, `.mp2`, `.mp3`, `.mp4`, `.ogg`, `.opus`, `.pcm`, `.wav`, `.webm` |
| CSV | `.csv` |
| Data interchange | `.dif` |
| dBase | `.dbf` |
Expand All @@ -90,5 +111,6 @@ By file type:
| Spreadsheet | `.et`, `.fods`, `.uos1`, `.uos2`, `.wk2`, `.xls`, `.xlsb`, `.xlsm`, `.xlsx`, `.xlw` |
| StarOffice | `.sxg` |
| TSV | `.tsv` |
| Video | `.3gp`, `.avi`, `.flv`, `.mov`, `.mp4`, `.mpeg`, `.mpegs`, `.mpg`, `.mpgs`, `.webm`, `.wmv` |
| Word processing | `.abw`, `.doc`, `.docm`, `.docx`, `.dot`, `.dotm`, `.hwp`, `.zabw` |
| XML | `.xml` |
60 changes: 25 additions & 35 deletions snippets/quickstarts/single-file-ui.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -6,39 +6,21 @@ You can download that processed data as a `.json` file to your local machine.
This approach enables rapid, local, run-adjust-repeat prototyping of end-to-end Unstructured ETL+ workflows with a full range of Unstructured features.
After you get the results you want, you can then attach remote source and destination connectors to both ends of your existing workflow to begin processing remote files and data at scale in production.

To run this quickstart, you will need a local file with a size of 10 MB or less and one of the following file types:

| File type |
|---|
| `.bmp` |
| `.csv` |
| `.doc` |
| `.docx` |
| `.email` |
| `.epub` |
| `.heic` |
| `.html` |
| `.jpg` |
| `.md` |
| `.odt` |
| `.org` |
| `.pdf` |
| `.pot` |
| `.potm` |
| `.ppt` |
| `.pptm` |
| `.pptx` |
| `.rst` |
| `.rtf` |
| `.sgl` |
| `.tiff` |
| `.txt` |
| `.tsv` |
| `.xls` |
| `.xlsx` |
| `.xml` |
To run this quickstart, you will need a local file with a size of 20 MB or less for video and audio files, and 10 MB or less for
all other file types. This quickstart supports the following file types:

| | | | | | | | | |
| --- | --- | --- | --- | --- | --- | --- | --- | --- |
| `.3gp` | `.aac` | `.avi` | `.bmp` | `.csv` | `.doc` | `.docx` | `.email` | `.epub` |
| `.flac` | `.flv` | `.heic` | `.html` | `.jpg` | `.m4a` | `.md` | `.mov` | `.mp2` |
| `.mp3` | `.mp4` | `.mpeg` | `.mpegs` | `.mpg` | `.mpgs` | `.odt` | `.ogg` | `.opus` |
| `.org` | `.pcm` | `.pdf` | `.pot` | `.potm` | `.ppt` | `.pptm` | `.pptx` | `.rst` |
| `.rtf` | `.sgl` | `.tiff` | `.txt` | `.tsv` | `.wav` | `.webm` | `.wmv` | `.xls` |
| `.xlsx` | `.xml` |

<Note>
Video and audio file processing is available only for [self-hosted](/self-hosted/overview) deployments of Unstructured.

For processing remote files at scale in production, Unstructured supports many more files types than these. [See the list of supported file types](/ui/supported-file-types).

Unstructured also supports processing files from remote object stores, and data from remote sources in websites, web apps, databases, and vector stores. For more information, see the [source connector overview](/ui/sources/overview) and the [remote quickstart](/ui/quickstart#remote-quickstart)
Expand Down Expand Up @@ -79,15 +61,23 @@ import GetStartedSimpleUIOnly from '/snippets/general-shared-text/get-started-si
</Step>
<Step title="Process a local file">
1. Drag the file that you want Unstructured to process from your local machine's file browser app and drop it into the **Source** node's **Drop file to test** area.
The file must have a size of 10 MB or less and one of the file types listed at the beginning of this quickstart.
The file must have a size of 20 MB or less for video and audio files, and 10 MB or less for all other file types.
The file must be one of the supported file types listed at the beginning of this quickstart.

If you are not able to drag and drop the file, you can click **Drop file to test** and then browse to and select the file instead.

Alternatively, you can use a sample file that Unstructured offers. To do this, click the **Source** node, and then in the **Source** pane, with
**Details** selected, on the **Local file** tab, click one of the files under **Or use a provided sample file**. To view the file's contents before you
select it, click the eyes button next to the file.

2. Above the **Source** node, click **Test**.
2. If you are using a video or audio file, you must use a multimedia paritioning strategy; otherwise, you might get an error during processing.
To select the multimedia partitioning strategy, click the **Partitioner** node, and then click **Auto** or **Multimedia**.

<Note>
Video and audio file processing is available only for [self-hosted](/self-hosted/overview) deployments of Unstructured.
</Note>

3. Above the **Source** node, click **Test**.

![Testing a single local file workflow](/img/ui/Workflow-Test-Source.png)

Expand All @@ -98,12 +88,12 @@ import GetStartedSimpleUIOnly from '/snippets/general-shared-text/get-started-si

![Viewing single local file output](/img/ui/Workflow-Test-Single-File-Output.png)

3. In the **Test output** pane, you can:
4. In the **Test output** pane, you can:

- Search through the processed, JSON-formatted representation of the file by using the **Search JSON** box.
- Download the full JSON as a `.json` file to your local machine by clicking **Download full JSON**.

4. When you are done, click the **Close** button in the **Test output** pane.
5. When you are done, click the **Close** button in the **Test output** pane.

</Step>
<Step title="Add more nodes to the workflow">
Expand Down
61 changes: 44 additions & 17 deletions ui/document-elements.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -42,23 +42,29 @@ of the file and not care about its headers and footers. You can easily filter ou
Here are some examples of the element types your file might contain:

| Element type | Description |
|---------------------|------------------------------------------------------------------------------------------------------------------------------------------------------|
| `Address` | A text element for capturing physical addresses. |
| `CodeSnippet` | A text element for capturing code snippets. |
| `EmailAddress` | A text element for capturing email addresses. |
| `FigureCaption` | An element for capturing text associated with figure captions. |
| `Footer` | An element for capturing document footers. |
| `FormKeysValues` | An element for capturing key-value pairs in a form. |
| `Formula` | An element containing formulas in a file. |
| `Header` | An element for capturing document headers. |
| `Image` | A text element for capturing image metadata. |
| `ListItem` | `ListItem` is a `NarrativeText` element that is part of a list. |
| `NarrativeText` | `NarrativeText` is an element consisting of multiple, well-formulated sentences. This excludes elements such titles, headers, footers, and captions. |
| `PageBreak` | An element for capturing page breaks. |
| `PageNumber` | An element for capturing page numbers. |
| `Table` | An element for capturing tables. |
| `Title` | A text element for capturing titles. |
| `UncategorizedText` | Base element for capturing free text from within files. Applies to extracted text not associated with bounding boxes if the input is a PDF file. |
|--------------------- |------------------------------------------------------------------------------------------------------------------------------------------------------|
| `Address` | A text element for capturing physical addresses. |
| `CodeSnippet` | A text element for capturing code snippets. |
| `EmailAddress` | A text element for capturing email addresses. |
| `FigureCaption` | An element for capturing text associated with figure captions. |
| `Footer` | An element for capturing document footers. |
| `FormKeysValues` | An element for capturing key-value pairs in a form. |
| `Formula` | An element containing formulas in a file. |
| `Header` | An element for capturing document headers. |
| `Image` | A text element for capturing image metadata. |
| `ListItem` | `ListItem` is a `NarrativeText` element that is part of a list. |
| `NarrativeText` | `NarrativeText` is an element consisting of multiple, well-formulated sentences. This excludes elements such titles, headers, footers, and captions. |
| `PageBreak` | An element for capturing page breaks. |
| `PageNumber` | An element for capturing page numbers. |
| `SceneDescription` | An element for capturing scene descriptions, for example a description of a scene in a video. |
| `Table` | An element for capturing tables. |
| `Title` | A text element for capturing titles. |
| `TranscriptFragment` | An element for capturing transcription of speech, for example a speaker's words in an audio clip or video. |
| `UncategorizedText` | Base element for capturing free text from within files. Applies to extracted text not associated with bounding boxes if the input is a PDF file. |

<Note>
`SceneDescription` and `TranscriptFragment` are specific to video and audio file processing, which is available only for [self-hosted](/self-hosted/overview) deployments of Unstructured.
</Note>

If you apply chunking, you will also see the `CompositeElement` type.
`CompositeElement` is a chunk formed from text (non-`Table`) elements.
Expand Down Expand Up @@ -149,6 +155,27 @@ file.
Headers and footers in Word files include a `header_footer_type` indicating which page a header or footer applies to.
Valid values are `"primary"`, `"even_only"`, and `"first_page"`.

#### Video files

<Note>
Video file processing is available only for [self-hosted](/self-hosted/overview) deployments of Unstructured.
</Note>

Elements for video files include a `start_time` and `end_time`, representing the start and end times of a clip of video
from the parent video file to which this element belongs. Also included are the `model_version` representing the model that was used to
generate the element, and the `average_log_probability` representing the model's overall average confidence level for the model's output across the document, with values closer to
zero indicating higher confidence.

#### Audio files

<Note>
Audio file processing is available only for [self-hosted](/self-hosted/overview) deployments of Unstructured.
</Note>

Elements for audio files include a `start_time`, `end_time`, and `speaker`, representing the start and end times of a clip of audio
made by a specific speaker, as part of the parent audio file to which this element belongs.
If the speaker cannot be determined, `speaker` is set to `0` or `unknown`.

### Table-specific metadata

For `Table` elements, the raw text of the table will be stored in the `text` attribute for the element, and HTML representation
Expand Down
1 change: 1 addition & 0 deletions ui/workflows.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -62,6 +62,7 @@ By default, this workflow partitions, chunks, and generates embeddings as follow
- If the page or document has no images and likely does not have tables, **Fast** partitioning is used, and the page or document is billed at the **Fast** rate for processing.
- If the page or document has only a few tables or images with standard layouts and languages, **High Res** partitioning is used, and the page or document is billed at the **High Res** rate for processing.
- If the page or document has more than a few tables or images, **VLM** partitioning is used, and the page or document is billed at the **VLM** rate for processing.
- If the page or document is a video or audio file, **Multimedia** partitioning is used.

[Learn about partitioning strategies](/ui/partitioning).

Expand Down