Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Wrong characters / difference between extraction and display #160

Open
keto33 opened this issue Sep 18, 2023 · 1 comment
Open

Wrong characters / difference between extraction and display #160

keto33 opened this issue Sep 18, 2023 · 1 comment
Labels
question Further information is requested

Comments

@keto33
Copy link

keto33 commented Sep 18, 2023

I noticed that characters are displayed correctly but extracted wrongly in old PFDs (probably digitized ones). Since the purpose of pdfalto and GROBID is to extract text from PDF whenever the original text is not available, paying attention to this issue might be of great importance. However, I am not sure if there is a solution for that.

I came across many examples, but this PDF is an excellent example. All instances of the word "awkward" are displayed identically by PDF viewers, but the last one is extracted by pdfalto as "awkM'ard".

As I inspected, the text object is actually awkM'ard, but it would be very beneficial if mapping to the correct or at least meaningful character.

@kermitt2
Copy link
Owner

kermitt2 commented Nov 6, 2023

Hi @keto33 !

The goal of pdfalto is to extract and normalize typescript documents, more precisely the text layer and layout information. It's not performing OCR. So if the text is awkM'ard in the text layer of the PDF (due to bad OCR), this is the text to be extracted by pdfalto.

If the PDF has only image or bad OCR, the idea is to use OCR or re-OCRize the document before applying pdfalto, e.g. via a user pipeline, selecting the appropriate OCR.

The only case I am considering OCR in pdfalto is to resolve UTF code for loaded fonts and for special characters where we only have a glyphs (bitmap) of characters, so a very restricted and targeted usage of a custom OCR (no progress on this however since a few year :D ).

@kermitt2 kermitt2 added the question Further information is requested label Nov 6, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants