-
Notifications
You must be signed in to change notification settings - Fork 86
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
* add Docling converter Signed-off-by: Panos Vagenas <[email protected]> * Update and rename docling-converter.md to docling.md --------- Signed-off-by: Panos Vagenas <[email protected]> Co-authored-by: Bilge Yücel <[email protected]>
- Loading branch information
1 parent
4d729d8
commit 71895fa
Showing
2 changed files
with
94 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,94 @@ | ||
--- | ||
layout: integration | ||
name: Docling | ||
description: Use Docling to locally parse and chunk PDF, DOCX, and other document types in Haystack | ||
authors: | ||
- name: DS4SD | ||
socials: | ||
github: DS4SD | ||
pypi: https://pypi.org/project/docling-haystack | ||
repo: https://github.com/DS4SD/docling-haystack | ||
type: Data Ingestion | ||
report_issue: https://github.com/DS4SD/docling/issues | ||
logo: /logos/docling.png | ||
version: Haystack 2.0 | ||
toc: true | ||
--- | ||
### **Table of Contents** | ||
- [Overview](#overview) | ||
- [Installation](#installation) | ||
- [Usage](#usage) | ||
- [License](#license) | ||
|
||
## Overview | ||
|
||
[Docling](https://github.com/DS4SD/docling) locally parses PDF, DOCX, HTML, and other | ||
document formats into a rich standardized representation (incl. layout, tables etc.), | ||
which it can then export to Markdown, JSON, and others. | ||
|
||
Check out the [Docling docs](https://ds4sd.github.io/docling/) for more details. | ||
|
||
This integration introduces Docling support, enabling Haystack users to: | ||
- use various document types in LLM applications with ease and speed, and | ||
- leverage Docling's rich format for advanced, document-native grounding. | ||
|
||
## Installation | ||
|
||
```bash | ||
pip install docling-haystack | ||
``` | ||
|
||
## Usage | ||
|
||
### Components | ||
|
||
This integration introduces `DoclingConverter`, a component which reads document | ||
file paths (local or URL) and outputs Haystack `Document` objects. | ||
|
||
`DoclingConverter` supports two different export modes, see `export_type` initialization | ||
argument further below. | ||
|
||
### Use Docling Converter | ||
|
||
#### Docling Converter Initialization | ||
|
||
`DoclingConverter` creation can be parametrized via the following `__init__()` | ||
arguments, most of which refer to the initialization and usage of the underlying Docling | ||
[`DocumentConverter`](https://ds4sd.github.io/docling/usage/) and | ||
[chunker](https://ds4sd.github.io/docling/concepts/chunking/) instances: | ||
|
||
- `converter`: The Docling `DocumentConverter` to use; if not set, a system default is | ||
used. | ||
- `convert_kwargs`: Any parameters to pass to Docling conversion; if not set, a system | ||
default is used. | ||
- `export_type`: The export mode to use: `ExportType.DOC_CHUNKS` (default) chunks each | ||
input document (see `chunker`) and captures each individual chunk as a separate | ||
Haystack `Document`, while `ExportType.MARKDOWN` captures each input document as a | ||
separate Haystack `Document` (in which case splitting is likely required downstream). | ||
- `md_export_kwargs`: Any parameters to pass to Markdown export (in case of | ||
`ExportType.MARKDOWN`). | ||
- `chunker`: The Docling chunker instance to use; if not set, a system default is used | ||
(in case of `ExportType.DOC_CHUNKS`). | ||
- `meta_extractor`: The extractor instance to use for populating the output document | ||
metadata; if not set, a system default is used. | ||
|
||
#### Standalone | ||
|
||
```python | ||
from docling_haystack.converter import DoclingConverter | ||
|
||
converter = DoclingConverter() | ||
documents = converter.run(paths=["https://arxiv.org/pdf/2408.09869"])["documents"] | ||
|
||
print(repr(documents[2].content)) | ||
# -> Abstract\nThis technical report introduces Docling [...] | ||
``` | ||
|
||
#### In a Pipeline | ||
|
||
Check out [this notebook](https://ds4sd.github.io/docling/examples/rag_haystack/) | ||
illustrating usage in a complete example with indexing and RAG pipelines. | ||
|
||
### License | ||
|
||
MIT License. |
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.