Replies: 3 comments
-
It seems that problem is related to PNG format. Documents with jpeg images are parsed correctly. pdf2txt.py myfile.pdf --output-dir . Any tips? |
Beta Was this translation helpful? Give feedback.
-
just noticed that "pdf2txt.py myfile.pdf --output-dir ." transforms all the png images in jpg. |
Beta Was this translation helpful? Give feedback.
-
I have just installed new release 0.6.1, but the problem with png is even present. Could I please have a feedback on it, if I am wrong or the problem that I encountered is a bug? thank you |
Beta Was this translation helpful? Give feedback.
-
Hi guys.
I tried to extract images from pdfs, but I don't understand how to manipulate the output format from images.
I used the following code
import openparse
temp_file = 'myfile.pdf'
parser = openparse.DocumentParser(
table_args={
"parsing_algorithm": "pymupdf",
"table_output_format": "markdown"
}
)
import json
chunks = parsed_basic_doc.model_dump_json()
chunks = json.loads(chunks)
print (chunks)
Printing the chunks I can extract all the information(text, tokens, bbox, etc etc) from nodes but for image I don't understand the format. this is the cutted output
{'embedding': None, 'node_id': 'df849f05-4c01-45e4-8ae1-407f4cf6c78e', 'variant': ['image'], 'tokens': 512, 'images': [{'text': '', 'bbox': {'page': 0, 'page_height': 842.0, 'page_width': 596.0, 'x0': 439.5, 'y0': 595.25, 'x1': 612.75002, 'y1': 833.0}, 'image': 'QEBAPT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09Ozs7Ozs7AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
...
...
The pdf has been generated from Google Documents.
How could extract the image from the node?
Thank you
.andrea
Beta Was this translation helpful? Give feedback.
All reactions