Extracting images #77

tulas75 · 2024-10-24T16:52:12Z

tulas75
Oct 24, 2024

Hi guys.

I tried to extract images from pdfs, but I don't understand how to manipulate the output format from images.

I used the following code

import openparse
temp_file = 'myfile.pdf'
parser = openparse.DocumentParser(
table_args={
"parsing_algorithm": "pymupdf",
"table_output_format": "markdown"
}
)
import json

chunks = parsed_basic_doc.model_dump_json()
chunks = json.loads(chunks)
print (chunks)

Printing the chunks I can extract all the information(text, tokens, bbox, etc etc) from nodes but for image I don't understand the format. this is the cutted output

{'embedding': None, 'node_id': 'df849f05-4c01-45e4-8ae1-407f4cf6c78e', 'variant': ['image'], 'tokens': 512, 'images': [{'text': '', 'bbox': {'page': 0, 'page_height': 842.0, 'page_width': 596.0, 'x0': 439.5, 'y0': 595.25, 'x1': 612.75002, 'y1': 833.0}, 'image': 'QEBAPT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09Ozs7Ozs7AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
...
...

The pdf has been generated from Google Documents.

How could extract the image from the node?

Thank you
.andrea

tulas75 · 2024-10-25T20:26:10Z

tulas75
Oct 25, 2024
Author

It seems that problem is related to PNG format. Documents with jpeg images are parsed correctly.
If I use the tool(pdf2txt.py) of the same library(pdfminer)used in openparse seems to work correctly, extracting all the images(jpegs and pngs)

pdf2txt.py myfile.pdf --output-dir .

Any tips?
.a

0 replies

tulas75 · 2024-10-25T20:49:49Z

tulas75
Oct 25, 2024
Author

just noticed that "pdf2txt.py myfile.pdf --output-dir ." transforms all the png images in jpg.

0 replies

tulas75 · 2024-11-08T13:29:32Z

tulas75
Nov 8, 2024
Author

I have just installed new release 0.6.1, but the problem with png is even present. Could I please have a feedback on it, if I am wrong or the problem that I encountered is a bug?

thank you
.t

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Extracting images #77

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 3 comments

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

Select a reply

Extracting images #77

tulas75 Oct 24, 2024

Replies: 3 comments

tulas75 Oct 25, 2024 Author

tulas75 Oct 25, 2024 Author

tulas75 Nov 8, 2024 Author

tulas75
Oct 24, 2024

tulas75
Oct 25, 2024
Author

tulas75
Oct 25, 2024
Author

tulas75
Nov 8, 2024
Author