diff --git a/api-reference/workflow/workflows.mdx b/api-reference/workflow/workflows.mdx index ea06fce0..8784400a 100644 --- a/api-reference/workflow/workflows.mdx +++ b/api-reference/workflow/workflows.mdx @@ -1030,17 +1030,21 @@ A **Partitioner** node has a `type` of `partition`. ```python auto_partitioner_workflow_node = WorkflowNode( name="Partitioner", - subtype="vlm", + subtype="unstructured_api", type="partition", settings={ - "provider": "anthropic", - "model": "claude-3-5-sonnet-20241022", - "output_format": "text/html", - "user_prompt": None, - "format_html": True, - "unique_element_ids": True, - "is_dynamic": True, - "allow_fast": True + "strategy": "auto", + "provider": "", + "provider_api_key": None, + "model": "", + "output_format": "", + "prompt": { + "text": "" + }, + "format_html": , + "unique_element_ids": , + "is_dynamic": , + "allow_fast": } ) ``` @@ -1050,22 +1054,53 @@ A **Partitioner** node has a `type` of `partition`. { "name": "Partitioner", "type": "partition", - "subtype": "vlm", + "subtype": "unstructured_api", "settings": { - "provider": "anthropic", - "model": "claude-3-5-sonnet-20241022", - "output_format": "text/html", - "user_prompt": null, - "format_html": true, - "unique_element_ids": true, - "is_dynamic": true, - "allow_fast": true + "strategy": "auto", + "provider": "", + "provider_api_key": null, + "model": "", + "output_format": "", + "prompt": { + "text": "" + }, + "format_html": , + "unique_element_ids": , + "is_dynamic": , + "allow_fast": } } ``` +Fields for `settings` include: + +- `strategy`: _Required_. The partitioning strategy to use. This field must be set to `auto`. +- `provider`: _Optional_. If the Auto partitioning strategy needs to use the VLM partitioning strategy, then use the specified VLM provider. Allowed values include `auto`, `openai`, `anthropic`, and `bedrock`. The default value is `anthropic`. +- `provider_api_key`: _Optional_. If specified, use a non-default API key for calls to the specified VLM provider as needed. The default is none, which means to rely on using Unstructured's internal default API key for the VLM provider. +- `model`: _Optional_. If the Auto partitioning strategy needs to use the VLM partitioning strategy, then use the specified VLM. The default value is `claude-3-5-sonnet-20241022`. + + - For `openai`, available values for `model` are `gpt-4o` and `gpt-4o-mini`. + - For `anthropic`, available values for `model` are `claude-3-5-sonnet-20241022` and `claude-3-7-sonnet-20250219`. + - For `bedrock`, available values for `model` are: + + - `us.amazon.nova-lite-v1:0` + - `us.amazon.nova-pro-v1:0` + - `us.anthropic.claude-3-opus-20240229-v1:0` + - `us.anthropic.claude-3-haiku-20240307-v1:0` + - `us.anthropic.claude-3-sonnet-20240229-v1:0` + - `us.anthropic.claude-3-5-sonnet-20241022-v2:0` + - `us.meta.llama3-2-11b-instruct-v1:0` + - `us.meta.llama3-2-90b-instruct-v1:0` + +- `output_format`: _Output_. The format of the response. Allowed values include `text/html` and `application/json`. The default is `text/html`. +- `prompt.text`: _Optional_. If the Auto partitioning strategy needs to use the VLM partitioning strategy, then use the specified prompt when calling the specified VLM. The default value is none, which means to rely on using Unstructured's internal default prompt when calling the VLM. +- `format_html`: _Optional_. If the Auto partitioning strategy needs to use the VLM partitioning strategy, true (the default) to apply Beautiful Soup's `prettify` method to the HTML that is generated by the VLM partitioner, which for example adds indentation for better readability. +- `unique_element_ids`: _Optional_. True (the default) to assign UUIDs to element IDs, which guarantees their uniqueness. This is useful for example when using them as primary keys in a database. False to assign a SHA-256 of the element's text as its element ID. +- `is_dynamic`: _Optional_. True (the default) to enable dynamic routing of pages to Fast, High Res, or VLM as needed for better overall performance and cost savings. +- `allow_fast`: _Optional_. True (the default) to allow routing of pages to Fast as needed for better overall performance and cost savings. + #### VLM strategy @@ -1077,11 +1112,16 @@ A **Partitioner** node has a `type` of `partition`. type="partition", settings={ "provider": "", + "provider_api_key": None, "model": "", - "output_format": "text/html", - "user_prompt": None, - "format_html": True, - "unique_element_ids": + "output_format": "", + "prompt": { + "text": "" + }, + "format_html": , + "unique_element_ids": , + "is_dynamic": , + "allow_fast": } ) ``` @@ -1094,41 +1134,47 @@ A **Partitioner** node has a `type` of `partition`. "subtype": "vlm", "settings": { "provider": "", + "provider_api_key": null, "model": "", - "output_format": "text/html", - "user_prompt": null, - "format_html": true, - "unique_element_ids": + "output_format": "", + "prompt": { + "text": "" + }, + "format_html": , + "unique_element_ids": , + "is_dynamic": , + "allow_fast": } } ``` -Allowed values for `provider` and `model` include: - -- `"provider": "anthropic"` +Fields for `settings` include: - - `"model": "claude-3-5-sonnet-20241022"` +- `provider`: _Optional_. Use the specified VLM provider. Allowed values include `auto`, `openai`, `anthropic`, and `bedrock`. The default value is `anthropic`. +- `provider_api_key`: _Optional_. If specified, use a non-default API key for calls to the specified VLM provider as needed. The default is none, which means to rely on using Unstructured's internal default API key for the VLM provider. +- `model`: _Optional_. If the Auto partitioning strategy needs to use the VLM partitioning strategy, then use the specified VLM. The default value is `claude-3-5-sonnet-20241022`. -- `"provider": "openai"` + - For `openai`, available values for `model` are `gpt-4o` and `gpt-4o-mini`. + - For `anthropic`, available values for `model` are `claude-3-5-sonnet-20241022` and `claude-3-7-sonnet-20250219`. + - For `bedrock`, available values for `model` are: - - `"model": "gpt-4o"` + - `us.amazon.nova-lite-v1:0` + - `us.amazon.nova-pro-v1:0` + - `us.anthropic.claude-3-opus-20240229-v1:0` + - `us.anthropic.claude-3-haiku-20240307-v1:0` + - `us.anthropic.claude-3-sonnet-20240229-v1:0` + - `us.anthropic.claude-3-5-sonnet-20241022-v2:0` + - `us.meta.llama3-2-11b-instruct-v1:0` + - `us.meta.llama3-2-90b-instruct-v1:0` -- `"provider": "bedrock"` - - - `"model": "us.anthropic.claude-3-5-sonnet-20241022-v2:0"` - - `"model": "us.anthropic.claude-3-opus-20240229-v1:0"` - - `"model": "us.anthropic.claude-3-haiku-20240307-v1:0"` - - `"model": "us.anthropic.claude-3-sonnet-20240229-v1:0"` - - `"model": "us.amazon.nova-pro-v1:0"` - - `"model": "us.amazon.nova-lite-v1:0"` - - `"model": "us.meta.llama3-2-90b-instruct-v1:0"` - - `"model": "us.meta.llama3-2-11b-instruct-v1:0"` - -- `"provider": "vertexai"` - - - `"model": "gemini-2.0-flash-001"` +- `output_format`: _Output_. The format of the response. Allowed values include `text/html` and `application/json`. The default is `text/html`. +- `prompt.text`: _Optional_. Use the specified prompt when calling the specified VLM. The default value is none, which means to rely on using Unstructured's internal default prompt when calling the VLM. +- `format_html`: _Optional_. True (the default) to apply Beautiful Soup's `prettify` method to the HTML that is generated by the VLM partitioner, which for example adds indentation for better readability. +- `unique_element_ids`: _Optional_. True (the default) to assign UUIDs to element IDs, which guarantees their uniqueness. This is useful for example when using them as primary keys in a database. False to assign a SHA-256 of the element's text as its element ID. +- `is_dynamic`: _Optional_. This setting has no effect for the VLM strategy. The default is false. +- `allow_fast`: _Optional_. This setting has no effect for the VLM strategy. The default is true. #### High Res strategy @@ -1142,6 +1188,7 @@ Allowed values for `provider` and `model` include: settings={ "strategy": "hi_res", "include_page_breaks": , + "pdf_infer_table_structure": , "exclude_elements": [ "", "" @@ -1156,7 +1203,7 @@ Allowed values for `provider` and `model` include: "image", "table" ], - "skip_infer_table_types": + "infer_table_structure": , } ) ``` @@ -1170,6 +1217,7 @@ Allowed values for `provider` and `model` include: "settings": { "strategy": "hi_res", "include_page_breaks": , + "pdf_infer_table_structure": , "exclude_elements": [ "", "" @@ -1184,13 +1232,38 @@ Allowed values for `provider` and `model` include: "image", "table" ], - "skip_infer_table_types": + "infer_table_structure": , } } ``` +- `strategy`: _Required_. The partitioning strategy to use. This field must be set to `hi_res`. +- `include_page_breaks`: _Optional_. True to include page breaks in the output if supported by the file type. The default is false. +- `pdf_infer_table_structure`: _Optional_. True for any `Table` elements extracted from a PDF to include an additional metadata field, `text_as_html`, where the value (string) is a just a transformation of the data into an HTML table. The default is false. +- `exclude_elements`: _Optional_. A list of any Unstructured element types to exclude from the output. The default is none. Available values include: + + - `FigureCaption` + - `NarrativeText` + - `ListItem` + - `Title` + - `Address` + - `Table` + - `PageBreak` + - `Header` + - `Footer` + - `UncategorizedText` + - `Image` + - `Formula` + - `EmailAddress` + +- `xml_keep_tags`: _Optional_. True to retain any XML tags in the output. False (the default) to just extract the text from any XML tags instead. +- `encoding`: _Optional_. The encoding method used to decode the text input. The default is `utf-8`. +- `ocr_languages`: _Optional_. A list of languages present in the input, for use in partitioning, OCR, or both. Multiple languages indicate that the text could be in any of the specified languages. The default is `[ 'eng' ]`. [See the language codes list](https://github.com/Unstructured-IO/unstructured/blob/main/unstructured/partition/common/lang.py). +- `extract_image_block_types`: _Optional_. A list of the Unstructured element types for use in extracting image blocks as Base64 encoded data stored in `metadata` fields. Available values include `Image` and `Table`. The default is `[ 'Image', 'Table' ]`. +- `infer_table_structure`: _Optional_. True to have any table elements extracted from a PDF to include an additional `metadata` field named `text_as_html`, containing an HTML `` transformation. The default is false. + #### Fast strategy @@ -1203,6 +1276,7 @@ Allowed values for `provider` and `model` include: settings={ "strategy": "fast", "include_page_breaks": , + "pdf_infer_table_structure": , "exclude_elements": [ "", "" @@ -1217,7 +1291,7 @@ Allowed values for `provider` and `model` include: "image", "table" ], - "skip_infer_table_types": + "infer_table_structure": } ) ``` @@ -1231,6 +1305,7 @@ Allowed values for `provider` and `model` include: "settings": { "strategy": "fast", "include_page_breaks": , + "pdf_infer_table_structure": , "exclude_elements": [ "", "" @@ -1245,13 +1320,40 @@ Allowed values for `provider` and `model` include: "image", "table" ], - "skip_infer_table_types": + "infer_table_structure": } } ``` +Fields for `settings` include: + +- `strategy`: _Required_. The partitioning strategy to use. This field must be set to `fast`. +- `include_page_breaks: _Optional_. True to include page breaks in the output if supported by the file type. The default is false. +- `pdf_infer_table_structure`: _Optional_. Although this field is listed, it applies only to the `hi_res` strategy and will not work if set to true. The default is false. +- `exclude_elements`: _Optional_. A list of any Unstructured element types to exclude from the output. The default is none. Available values include: + + - `FigureCaption` + - `NarrativeText` + - `ListItem` + - `Title` + - `Address` + - `Table` + - `PageBreak` + - `Header` + - `Footer` + - `UncategorizedText` + - `Image` + - `Formula` + - `EmailAddress` + +- `xml_keep_tags`: _Optional_. True to retain any XML tags in the output. False (the default) to just extract the text from any XML tags instead. +- `encoding`: _Optional_. The encoding method used to decode the text input. The default is `utf-8`. +- `ocr_languages`: _Optional_. A list of languages present in the input, for use in partitioning, OCR, or both. Multiple languages indicate that the text could be in any of the specified languages. The default is `[ 'eng' ]`. [See the language codes list](https://github.com/Unstructured-IO/unstructured/blob/main/unstructured/partition/common/lang.py). +- `extract_image_block_types`: _Optional_. A list of the Unstructured element types for use in extracting image blocks as Base64 encoded data stored in `metadata` fields. Available values include `Image` and `Table`. The default is `[ 'Image', 'Table' ]`. +- `infer_table_structure`: _Optional_. True to have any table elements extracted from a PDF to include an additional `metadata` field named `text_as_html`, containing an HTML `
` transformation. The default is false. + ### Chunker node A **Chunker** node has a `type` of `chunk`. @@ -1268,12 +1370,14 @@ A **Chunker** node has a `type` of `chunk`. subtype="chunk_by_character", type="chunk", settings={ + "unstructured_api_url": None, + "unstructured_api_key": None, "include_orig_elements": , "new_after_n_chars": , "max_characters": , "overlap": , "overlap_all": , - "contextual_chunking_strategy": "v1" + "contextual_chunking_strategy": "" } ) ``` @@ -1285,18 +1389,31 @@ A **Chunker** node has a `type` of `chunk`. "type": "chunk", "subtype": "chunk_by_character", "settings": { + "unstructured_api_url": null, + "unstructured_api_key": null, "include_orig_elements": , "new_after_n_chars": , "max_characters": , "overlap": , "overlap_all": , - "contextual_chunking_strategy": "v1" + "contextual_chunking_strategy": "" } } ``` +Fields for `settings` include: + +- `unstructured_api_url`: _Optional_. If specified, use a non-default API URL for calls to the specified chunker as needed. The default is none, which means to rely on using Unstructured's internal default API URL for the chunker. +- `unstructured_api_key`: _Optional_. If specified, use a non-default API key for calls to the specified chunker as needed. The default is none, which means to rely on using Unstructured's internal default API key for the chunker. +- `include_orig_elements`: _Optional_. True to have the elements that are used to form a chunk appear in `.metadata.orig_elements` for that chunk. The default is false. +- `new_after_n_chars`: _Optional_. Closes new sections after reaching a length of this many characters. This is an approximate limit. The default is none. +- `max_characters`: _Optional_. The absolute maximum number of characters in a chunk. The default is none. +- `overlap`: _Optional_. Applies a prefix of this many trailing characters from the prior text-split chunk to second and later chunks formed from oversized elements by text-splitting. The default is none. +- `overlap_all`: _Optional_. True to apply overlap to "normal" chunks formed by combining whole elements. Use with caution as this can introduce noise into otherwise clean semantic units. The default is false. +- `contextual_chunking_strategy`: _Optional_. If specified, prepends chunk-specific explanatory context to each chunk. Allowed values include `v1`. The default is none. + #### Chunk by Title strategy @@ -1307,6 +1424,8 @@ A **Chunker** node has a `type` of `chunk`. subtype="chunk_by_title", type="chunk", settings={ + "unstructured_api_url": None, + "unstructured_api_key": None, "multipage_sections": , "combine_text_under_n_chars": , "include_orig_elements": , @@ -1314,7 +1433,7 @@ A **Chunker** node has a `type` of `chunk`. "max_characters": , "overlap": , "overlap_all": , - "contextual_chunking_strategy": "v1" + "contextual_chunking_strategy": "" } ) ``` @@ -1326,6 +1445,8 @@ A **Chunker** node has a `type` of `chunk`. "type": "chunk", "subtype": "chunk_by_title", "settings": { + "unstructured_api_url": null, + "unstructured_api_key": null, "multipage_sections": , "combine_text_under_n_chars": , "include_orig_elements": , @@ -1333,13 +1454,25 @@ A **Chunker** node has a `type` of `chunk`. "max_characters": , "overlap": , "overlap_all": , - "contextual_chunking_strategy": "v1" + "contextual_chunking_strategy": "" } } ``` +Fields for `settings` include: + +- `unstructured_api_url`: _Optional_. If specified, use a non-default API URL for calls to the specified chunker as needed. The default is none, which means to rely on using Unstructured's internal default API URL for the chunker. +- `unstructured_api_key`: _Optional_. If specified, use a non-default API key for calls to the specified chunker as needed. The default is none, which means to rely on using Unstructured's internal default API key for the chunker.- `multipage_sections`: _Optional_. ... The default is false. +- `combine_text_under_n_chars`: _Optional_. Combines elements from a section into a chunk until a section reaches a length of this many characters. The default is none. +- `include_orig_elements`: _Optional_. True to have the elements that are used to form a chunk appear in `.metadata.orig_elements` for that chunk. The default is false. +- `new_after_n_chars`: _Optional_. Closes new sections after reaching a length of this many characters. This is an approximate limit. The default is none. +- `max_characters`: _Optional_. The absolute maximum number of characters in a chunk. The default is none. +- `overlap`: _Optional_. Applies a prefix of this many trailing characters from the prior text-split chunk to second and later chunks formed from oversized elements by text-splitting. The default is none. +- `overlap_all`: _Optional_. True to apply overlap to "normal" chunks formed by combining whole elements. Use with caution as this can introduce noise into otherwise clean semantic units. The default is false. +- `contextual_chunking_strategy`: _Optional_. If specified, prepends chunk-specific explanatory context to each chunk. Allowed values include `v1`. The default is none. + #### Chunk by Page strategy @@ -1350,12 +1483,14 @@ A **Chunker** node has a `type` of `chunk`. subtype="chunk_by_page", type="chunk", settings={ + "unstructured_api_url": None, + "unstructured_api_key": None, "include_orig_elements": , "new_after_n_chars": , "max_characters": , "overlap": , "overlap_all": , - "contextual_chunking_strategy": "v1" + "contextual_chunking_strategy": "" } ) ``` @@ -1367,18 +1502,31 @@ A **Chunker** node has a `type` of `chunk`. "type": "chunk", "subtype": "chunk_by_page", "settings": { + "unstructured_api_url": null, + "unstructured_api_key": null, "include_orig_elements": , "new_after_n_chars": , "max_characters": , "overlap": , "overlap_all": , - "contextual_chunking_strategy": "v1" + "contextual_chunking_strategy": "" } } ``` +Fields for `settings` include: + +- `unstructured_api_url`: _Optional_. If specified, use a non-default API URL for calls to the specified chunker as needed. The default is none, which means to rely on using Unstructured's internal default API URL for the chunker. +- `unstructured_api_key`: _Optional_. If specified, use a non-default API key for calls to the specified chunker as needed. The default is none, which means to rely on using Unstructured's internal default API key for the chunker.- `include_orig_elements`: _Optional_. ... The default is false. +- `include_orig_elements`: _Optional_. True to have the elements that are used to form a chunk appear in `.metadata.orig_elements` for that chunk. The default is false. +- `new_after_n_chars`: _Optional_. Closes new sections after reaching a length of this many characters. This is an approximate limit. The default is none. +- `max_characters`: _Optional_. The absolute maximum number of characters in a chunk. The default is none. +- `overlap`: _Optional_. Applies a prefix of this many trailing characters from the prior text-split chunk to second and later chunks formed from oversized elements by text-splitting. The default is none. +- `overlap_all`: _Optional_. True to apply overlap to "normal" chunks formed by combining whole elements. Use with caution as this can introduce noise into otherwise clean semantic units. The default is false. +- `contextual_chunking_strategy`: _Optional_. If specified, prepends chunk-specific explanatory context to each chunk. Allowed values include `v1`. The default is none. + #### Chunk by Similarity strategy @@ -1389,12 +1537,14 @@ A **Chunker** node has a `type` of `chunk`. subtype="chunk_by_similarity", type="chunk", settings={ + "unstructured_api_url": None, + "unstructured_api_key": None, "include_orig_elements": , "new_after_n_chars": , "max_characters": , "overlap": , "overlap_all": , - "contextual_chunking_strategy": "v1", + "contextual_chunking_strategy": "", "similarity_threshold": } ) @@ -1407,12 +1557,14 @@ A **Chunker** node has a `type` of `chunk`. "type": "chunk", "subtype": "chunk_by_similarity", "settings": { + "unstructured_api_url": null, + "unstructured_api_key": null, "include_orig_elements": , "new_after_n_chars": , "max_characters": , "overlap": , "overlap_all": , - "contextual_chunking_strategy": "v1", + "contextual_chunking_strategy": "", "similarity_threshold": } } @@ -1420,6 +1572,18 @@ A **Chunker** node has a `type` of `chunk`. +Fields for `settings` include: + +- `unstructured_api_url`: _Optional_. If specified, use a non-default API URL for calls to the specified chunker as needed. The default is none, which means to rely on using Unstructured's internal default API URL for the chunker. +- `unstructured_api_key`: _Optional_. If specified, use a non-default API key for calls to the specified chunker as needed. The default is none, which means to rely on using Unstructured's internal default API key for the chunker. +- `include_orig_elements`: _Optional_. True to have the elements that are used to form a chunk appear in `.metadata.orig_elements` for that chunk. The default is false. +- `new_after_n_chars`: _Optional_. Closes new sections after reaching a length of this many characters. This is an approximate limit. The default is none. +- `max_characters`: _Optional_. The absolute maximum number of characters in a chunk. The default is none. +- `overlap`: _Optional_. Applies a prefix of this many trailing characters from the prior text-split chunk to second and later chunks formed from oversized elements by text-splitting. The default is none. +- `overlap_all`: _Optional_. True to apply overlap to "normal" chunks formed by combining whole elements. Use with caution as this can introduce noise into otherwise clean semantic units. The default is false. +- `contextual_chunking_strategy`: _Optional_. If specified, prepends chunk-specific explanatory context to each chunk. Allowed values include `v1`. The default is none. +- `similarity_threshold`: _Optional_. The minimum similarity that text in consecutive elements must have to be included in the same chunk. This must be a value between `0.0` and `1.0`, exclusive (`0.01` to `0.99`). The default is none. + ### Enrichment node An **Enrichment** node has a `type` of `prompter`. @@ -1462,7 +1626,6 @@ Allowed values for `` include: - `openai_image_description` - `anthropic_image_description` - `bedrock_image_description` -- `vertexai_image_description` #### Table Description task @@ -1498,7 +1661,6 @@ Allowed values for `` include: - `openai_table_description` - `anthropic_table_description` - `bedrock_table_description` -- `vertexai_table_description` #### Table to HTML task @@ -1536,7 +1698,7 @@ import EnrichmentTableToHTMLHiResOnly from '/snippets/general-shared-text/enrich ```python ner_enrichment_workflow_node = WorkflowNode( name="Enrichment", - subtype="openai_ner", + subtype="", type="prompter", settings={ "prompt_interface_overrides": { @@ -1553,7 +1715,7 @@ import EnrichmentTableToHTMLHiResOnly from '/snippets/general-shared-text/enrich { "name": "Enrichment", "type": "prompter", - "subtype": "openai_ner", + "subtype": "", "settings": { "prompt_interface_overrides": { "prompt": { @@ -1566,6 +1728,15 @@ import EnrichmentTableToHTMLHiResOnly from '/snippets/general-shared-text/enrich +Fields for settings include: + +- `prompt_interface_overrides.prompt.user`: _Optional_. Any alternative prompt to use with the underlying NER model. The default is none, which means to rely on using Unstructured's internal default prompt when calling the NER model. + +Allowed values for `` include: + +- `openai_ner` +- `anthropic_ner` + ### Embedder node An **Embedder** node has a `type` of `embed`. @@ -1620,3 +1791,14 @@ Allowed values for `subtype` and `model_name` include: - `"model_name": "togethercomputer/m2-bert-80M-2k-retrieval"` - `"model_name": "togethercomputer/m2-bert-80M-8k-retrieval"` - `"model_name": "togethercomputer/m2-bert-80M-32k-retrieval"` + +- `"subtype": "voyageai"` + + - `"model_name": "voyage-3"` + - `"model_name": "voyage-3-large"` + - `"model_name": "voyage-3-lite"` + - `"model_name": "voyage-code-3"` + - `"model_name": "voyage-finance-2"` + - `"model_name": "voyage-law-2"` + - `"model_name": "voyage-code-2"` + - `"model_name": "voyage-multimodal-3"` \ No newline at end of file diff --git a/open-source/how-to/set-ocr-agent.mdx b/open-source/how-to/set-ocr-agent.mdx index ac918793..56a393e9 100644 --- a/open-source/how-to/set-ocr-agent.mdx +++ b/open-source/how-to/set-ocr-agent.mdx @@ -28,7 +28,7 @@ This example uses a PNG file with an embedded combination of English and Korean Language codes will differ depending on the OCR agent you use: -- For Tesseract OCR, [see the language codes list](https://github.com/Unstructured-IO/unstructured/blob/main/unstructured/partition/lang.py). +- For Tesseract OCR, [see the language codes list](https://github.com/Unstructured-IO/unstructured/blob/main/unstructured/partition/common/lang.py). - For Paddle OCR, [see the language codes list](https://github.com/Unstructured-IO/unstructured/blob/main/unstructured/partition/lang.py) and [language names list](https://github.com/PaddlePaddle/PaddleOCR/blob/main/doc/doc_en/multi_languages_en.md#language_abbreviations). - For Google Cloud Vision OCR, [see the language codes list](https://cloud.google.com/vision/docs/languages).