Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Zero height box found, cannot convert properly #33

Closed
theesfeld opened this issue Dec 6, 2023 · 11 comments
Closed

Zero height box found, cannot convert properly #33

theesfeld opened this issue Dec 6, 2023 · 11 comments

Comments

@theesfeld
Copy link

Every single PDF I have tried gets the following error:

Zero height box found, cannot convert properly
Traceback (most recent call last):
  File "/Users/grim/src/marker/convert_single.py", line 22, in <module>
    full_text, out_meta = convert_single_pdf(fname, model_lst, max_pages=args.max_pages, parallel_factor=args.parallel_factor)
  File "/Users/grim/src/marker/marker/convert.py", line 108, in convert_single_pdf
    block_types = detect_document_block_types(
  File "/Users/grim/src/marker/marker/segmentation.py", line 51, in detect_document_block_types
    encodings, metadata, sample_lengths = get_features(doc, blocks)
  File "/Users/grim/src/marker/marker/segmentation.py", line 160, in get_features
    encoding, other_data = get_page_encoding(doc[i], blocks[i])
  File "/Users/grim/src/marker/marker/segmentation.py", line 104, in get_page_encoding
    raise ValueError
ValueError

macOS 14.2
python 3.9

installed via exact instructions from git repo

@VikParuchuri
Copy link
Owner

That's strange, that will only happen when detected bboxes are 0 or less height (something weird with the pdf). Can you share the pdfs you're trying?

@drewocarr
Copy link

@VikParuchuri
Happening with select pdfs for me as well
design-for-the-real-world-victor-papanek.pdf

@VikParuchuri
Copy link
Owner

Thanks for the example, will take a look tomorrow.

@theesfeld
Copy link
Author

That's strange, that will only happen when detected bboxes are 0 or less height (something weird with the pdf). Can you share the pdfs you're trying?

Unfortunately I cannot as they are sensitive - I am getting the same issue with the pdf that @Blacktothefuture posted though

@VikParuchuri
Copy link
Owner

TIL that pdfs can be rotated, and the coordinates of the bboxes for the text will not be rotated accordingly.

Basically, this bug is due to trying to convert pdfs that have had pages rotated. I'm looking into a fix.

@theesfeld
Copy link
Author

good find!!

@VikParuchuri
Copy link
Owner

@theesfeld @Blacktothefuture I've pushed a fix to the dev branch. It works with the pdf example above. This needs more testing with a range of pdfs to make sure it works properly (and doesn't cause issues with other pdfs) before I merge it.

@theesfeld
Copy link
Author

can I privately send you a pdf I am having issues with?

@VikParuchuri
Copy link
Owner

Sure, you can email [email protected] or join the about-to-launch discord and DM me (https://discord.gg//KuZwXNGnfH)

@VikParuchuri
Copy link
Owner

If anyone has bandwidth to test the fix currently on the dev branch here, I'd appreciate it!

@VikParuchuri
Copy link
Owner

I've merged the PR, so am closing this. Please re-open if you notice any issues.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants