Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Document export as markdown missing out some texts #953

Open
penquin17 opened this issue Feb 13, 2025 · 2 comments
Open

Document export as markdown missing out some texts #953

penquin17 opened this issue Feb 13, 2025 · 2 comments
Labels
bug Something isn't working

Comments

@penquin17
Copy link

Bug

When I converted a document and exported it to markdown format, the results showed differences from the two version docling==2.21 and docling==2.20.
I used export labels including DEFAULT_EXPORT_LABELS plus DocItemLabel.FORM, DocItemLabel.KEY_VALUE_REGION

Steps to reproduce

  1. Install the version docling==2.21 or docling==2.20
  2. Run the following
source = "./test-document.pdf"  # document per local path or URL
print(os.path.exists(source))
IMAGE_RESOLUTION_SCALE = 2.0

# previous `PipelineOptions` is now `PdfPipelineOptions`
pipeline_options = PdfPipelineOptions()
pipeline_options.do_ocr = True
pipeline_options.do_table_structure = True
pipeline_options.images_scale = IMAGE_RESOLUTION_SCALE
pipeline_options.generate_page_images = True
pipeline_options.generate_picture_images = True
ocr_options = TesseractOcrOptions(force_full_page_ocr=False, lang=["auto"])
pipeline_options.ocr_options = ocr_options
accelerator_options = AcceleratorOptions(
    num_threads=4, device=AcceleratorDevice.CPU
)
pipeline_options.accelerator_options = accelerator_options
# ...

# Custom options are now defined per format.
doc_converter = (
    DocumentConverter(  # all of the below is optional, has internal defaults.
        allowed_formats=[
            InputFormat.PDF,
            InputFormat.IMAGE,
            InputFormat.DOCX,
            InputFormat.HTML,
            InputFormat.PPTX,
            InputFormat.XLSX,
        ],  # whitelist formats, non-matching files are ignored.
        format_options={
            InputFormat.PDF: PdfFormatOption(
                pipeline_options=pipeline_options,  # pipeline options go here.
                backend=DoclingParseV2DocumentBackend  # optional: pick an alternative backend
            ),
            InputFormat.DOCX: WordFormatOption(
                pipeline_cls=SimplePipeline  # default for office formats and HTML
            ),
        },
    )
)

# read file
from io import BytesIO
stream = DocumentStream(
    name=source,
    stream=BytesIO(open(source, mode='rb').read())
)
result = doc_converter.convert(stream, raises_on_error=False)

# export to md
print(result.document.export_to_markdown())

Export results

docling==2.20

<!-- image -->

## INTERNATIONAL MEDICAL CENTER

Address: 1234 Fake Street, District 9, City Hotline: (555) 123-4567 | Email: [email protected]

## Patient Medical Record Report

## Title: Well Child Visit (Procedure)

| Aspect                     | Before Treatment              | After Treatment              |
|----------------------------|-------------------------------|------------------------------|
| Immunizations Administered | Hib (PRP-OMP)                 | Hib (PRP-OMP)                |
|                            | Rotavirus, monovalent         | Rotavirus, monovalent        |
|                            | IPV                           | IPV                          |
|                            | DTaP                          | DTaP                         |
|                            | Pneumococcal conjugate PCV 13 | Pneumococcal conjugal PCV 13 |

## Notes:

- ·  The patient received the following immunizations during the well-child visit: Hib (PRP-OMP), rotavirus, monovalent, IPV, DTaP, and Pneumococcal conjugate PCV 13.
- ·  No adverse reactions were noted following the administration of these immunizations.

Confidential Medical Record | All rights reserved Generated on: 2025-02-11

docling==2.21

<!-- image -->

## INTERNATIONAL MEDICAL CENTER

Address: 1234 Fake Street, District 9, City Hotline: (555) 123-4567 | Email: [email protected]

## Patient Medical Record Report

## Title: Well Child Visit (Procedure)

| Aspect                     | Before Treatment              | After Treatment              |
|----------------------------|-------------------------------|------------------------------|
| Immunizations Administered | Hib (PRP-OMP)                 | Hib (PRP-OMP)                |
|                            | Rotavirus, monovalent         | Rotavirus, monovalent        |
|                            | IPV                           | IPV                          |
|                            | DTaP                          | DTaP                         |
|                            | Pneumococcal conjugate PCV 13 | Pneumococcal conjugal PCV 13 |

## Notes:

- ·  The patient received the following immunizations during the well-child visit: Hib (PRP-OMP), rotavirus, monovalent, IPV, DTaP, and Pneumococcal conjugate PCV 13.
- ·  No adverse reactions were noted following the administration of these immunizations.

missing: Confidential Medical Record | All rights reserved Generated on: 2025-02-11

Docling version

docling==2.21 or docling==2.20

Python version

python=3.12

Attachment

test-doc.pdf

@penquin17 penquin17 added the bug Something isn't working label Feb 13, 2025
@Adolar13
Copy link

Same issue in docling 2.22.0.
Markdown is missing footer texts even with explicitly setting label.
Works in docling 2.20.0

@Adolar13
Copy link

Possibly tied to this pull request?:
DS4SD/docling-core#148

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants