Error case, missing digits #163

kermitt2 · 2024-02-21T19:09:39Z

The ALTO file resulting from the attached PDF does not include digits !
Normal xpdf library is working fine.

1909.13722.pdf

Aazhar · 2024-02-22T08:44:04Z

here are additional examples :
2006.09734.pdf
2001.04340.pdf

clason · 2024-04-10T10:41:23Z

(Managing editor of that overlay journal here, and the one responsible for the style file 👋 )

Note that copying them from a PDF viewer works fine -- how do you extract these numbers?

clason · 2024-04-10T11:41:23Z

It seems you're using Xpdf's pdftotext, which fails to read the oldstyle figures. Not sure there's anything that can be done on your side, except replacing it...

clason · 2024-04-10T13:20:06Z

Aha! It seems that the pdftotext from poppler (https://gitlab.freedesktop.org/poppler/poppler, which is forked from Xpdf 3) can extract oldstyle figures just fine! Maybe you can switch to that (or allow users to provide their own pdftotext)?

1909.13722.txt

clason · 2024-08-20T13:45:31Z

@kermitt2 Any news on this? Is there anything I can help? This is a big issue for us, so I would love to see this resolved.

lfoppiano · 2024-08-22T12:27:01Z

@clason If poppler implements a fork of xpdf3, might be tough to just integrate into it or to give users the ability to plug it in. Are you familiar with C++ programming? I wonder whether xpdf 4.05 could provide a solution, instead

PS: I'm trying to slowly helping to maintain this package, however the time is limited and I'm not a c++ developer, so any help would be mostly appreciated ;-)

clason · 2024-08-22T12:36:33Z

Hard to tell; I just looked at the CLI tools, not the underlying library. (And I tested with xpdf 4.05, that makes no difference.) I'm not a C++ programmer myself.

Maybe it would be possible to allow people to provide a manually converted txt file so users could simply use the correct pdftotext CLI tool as a pre-step in their workflow?

clason · 2024-08-22T12:38:29Z

(If that sweetens the deal: poppler is actually hosted as a public repository so is easier to work with and doesn't need to be vendored.)

kermitt2 · 2024-08-22T15:09:40Z

Hello !

These digit characters correspond to font glyphs that are not mapped correctly to unicode. So what needs to be done is to examine the PDF, identify the font used for these digit characters and look at the problematic unicode mapping for these "digit" values (it might be problematic or missing ToUnicode CMap for this font). We could look why those values were correctly mapped in Xpdf version 3 and not any more in version 4.0 (there are extra mapping and heuristics for this in xpdf).

Moving back to a non-maintained 10 years old version of Xpdf is not a solution ;)
... nor using Poppler I think, given that Xpdf is now very well maintained from version 4.0 (so Poppler not anymore a clearly relevant replacement) and it would mean more or less to rewrite entirely pdfalto.

@clason This will be certainly fixed in pdfalto at some point, but I am wondering if the font used for these digits in your latex package is something particular and could be replace by a more standard font ?

clason · 2024-08-22T15:18:12Z

We could look why those values were correctly mapped in Xpdf version 3 and not any more in version 4.0 (there are extra mapping and heuristics for this in xpdf).

I don't think this is a regression but part of the better support that the poppler fork has received. And for the record: the suggestion was never to revert to Xpdf 3. (I will take your word for it that Xpdf 4 is much better maintained and respondent to issues. Personally, I'm a bit concerned about the commercial entity behind Xpdf and its motivation for open source development -- it would probably also affect the possibility of backporting patches from poppler. But it's your project, and I certainly understand the effort argument.)

This will be certainly fixed in pdfalto at some point, but I am wondering if the font used for these digits in your latex package is something particular and could be replace by a more standard font ?

This is a standard font (linux libertine), and changing it is unfortunately not an option for us (as it's part of the visual identity and chosen deliberately for typographic reasons). Switching now wouldn't help with the already published articles, too. Again, the issue is the use of oldstyle figures, not the font itself.

I am more than willing to help dig into the font mappings, but given that poppler extracts them correctly and Xpdf doesn't indicates that this is something the latter should backport from the former. I am not sufficiently familiar with either project (or C++ in general) to do that, though -- someone else would have to take care of that part. The CharCodeToUnicode.cc has diverged quite a bit...

lfoppiano · 2024-12-25T20:58:55Z

A solution is explained here: https://forum.xpdfreader.com/viewtopic.php?p=46796#p46796

lfoppiano · 2024-12-31T09:48:53Z

Thanks to the comment I posted last week, I think I might have found a solution which does not require any modification in pdfalto, @clason could you please check the Grobid branch https://github.com/kermitt2/grobid/tree/update-pdfalto-recognition ?

clason · 2024-12-31T10:34:12Z

How would I do that? Do you have a running instance somewhere? (Otherwise it'd just be simpler for you to download the file in the top comment and check if extraction works.)

clason · 2024-12-31T10:46:58Z

Ah, I found the pre-built binary. Yes, that seems to extract the numbers correctly from the linked file (where pdfalto HEAD fails).

(Curiously, pdfalto works fine for a PDF -- using the same style and fonts! -- that I compiled locally just now, so recent changes in pdflatex or the font may just fix that. Of course, it'll be a while before arXiv updates their TeX stack, so the fix would still be welcome.)

lfoppiano · 2024-12-31T10:51:12Z

@clason The fix was made into Grobid, by adding the missing mapping to a non-standard adobe font name. You can test it here: https://huggingface.co/spaces/lfoppiano/grobid-dev

In these PDF there are also ff ligatures that were not mapped correctly and tt (but I did not find the correct font mapping).

clason · 2024-12-31T11:06:14Z

That doesn't do anything for me?

(Weird that calling the bundled pdfalto on the same file then produces an XML with correct reference numbers, where locally built pdfalto fails.)

lfoppiano · 2024-12-31T11:09:20Z

That doesn't do anything for me?

I don't understand. You might need to select TEI -> Process Fulltext Document and upload the PDF file:

See the reference markers are now visible (e.g. [25]):

(Weird that calling the bundled pdfalto on the same file then produces an XML with correct reference numbers, where locally built pdfalto fails.)

Bundled pdfalto in grobid? If you use the branch I gave you, it will pick up the updated configuration file.

If you now update the master branch of pdfalto and rebuild it, it should also work. Make sure it included the commit 8f38bb1

clason · 2024-12-31T11:16:57Z

I don't understand. You might need to select TEI -> Process Fulltext Document and upload the PDF file:

Thank you. I don't know the Grobid stack; I'm just working with pdf->text extraction (which failed previously). This seems to work, as in the references appear in the exported TEI.

There's still some errors, though; it seems for example that the 8 is not extracted? See, e.g., the references with id b44 and b45 1909.13722.pdf.tei.xml.zip

Bundled pdfalto in grobid? If you use the branch I gave you, it will pick up the updated configuration file.

Yes, there's pre-built pdfalto binaries under https://github.com/kermitt2/grobid/tree/update-pdfalto-recognition/grobid-home/pdfalto.

clason · 2024-12-31T11:27:48Z

This seems only to apply to older PDFs from arXiv such as the one in the first comments; current (late 2022+) PDFs seem to work fine.

lfoppiano · 2024-12-31T11:29:44Z

There's still some errors, though; it seems for example that the 8 is not extracted? See, e.g., the references with id b44 and b45 1909.13722.pdf.tei.xml.zip

Yes, there seems to be some issues with certain characters. I need to investigate a bit more. Thanks for checking that.

clason · 2024-12-31T11:33:14Z

Seems only to be 8, from what I can tell (which breaks detection of years and pages etc.)

(Ligatures are annoying but don't break the number format, so less problematic.)

lfoppiano · 2024-12-31T11:59:27Z

Yes, but not everywhere. The DOI, for example is correct. After inspecting further, it seems that 8 is mapped to a different font name eight.taboldstyle.

Most of them should be fixed but since the amount of non-standard fonts is not so predictable there might be a few other missing characters somewhere.

lfoppiano · 2024-12-31T12:03:37Z

@flydutch could you please remind what hack was done with poppler to solve most of these issues?

clason · 2024-12-31T12:05:56Z

Yes, but not everywhere. The DOI, for example is correct.

Yes, because that's a different font. It's only oldstyle figures in Linux Libertine (pre-2022).

Most of them should be fixed but since the amount of non-standard fonts is not so predictable there might be a few other missing characters somewhere.

Yeah, sorry about that. (There's also math mode (newtxmath, https://ctan.org/pkg/newtx) where, e.g., subscript numbers are missing.)

The good news is that "modern" PDFs seem to work fine, so this only affects a finite set.

lfoppiano · 2024-12-31T12:18:35Z

I cannot find which font name is.

@clason If you could provide me the pair unicode -> font name, I'm happy to add them in the mapping right away. I see there are a lot of formula's characters that might also not be mapped (e.g. the coursive l in formulas).

For example I had to map:

0030 zero.oldstyle
0031 one.oldstyle
...

clason · 2024-12-31T12:31:32Z

Does this help?
https://tug.ctan.org/fonts/libertine/latex/LinLibertine_R.tex

(The font name for math should be nxlmi and ntxsy*.)

lfoppiano · 2024-12-31T17:45:17Z

I did cover some cases, probably I miss the || || parallel but I cannot find the name that should be used.

I created a branch called fix-non-standard-font-names in this repository so that we can add all the missing mapping without rushing

clason · 2024-12-31T17:49:20Z

The 8 is the big issue, since it breaks metadata detection. The rest is just nice to have (with the ligatures being the most bang for the buck).

lfoppiano · 2024-12-31T17:51:54Z

Ah, the 8 was fixed since this afternoon ^_^

<biblStruct xml:id="b44">
                        <analytic>
                            <title level="a" type="main">Optimal control of quasistatic plasticity with linear kinematic hardening II: Regularization and differentiability</title>
                            <author>
                                <persName>
                                    <forename type="first">G</forename>
                                    <surname>Wachsmuth</surname>
                                </persName>
                            </author>
                            <idno type="DOI">10.4171/ZAA/1546</idno>
                        </analytic>
                        <monogr>
                            <title level="j">Z. Anal. Anwend</title>
                            <imprint>
                                <biblScope unit="volume">34</biblScope>
                                <biblScope unit="page" from="391" to="418"/>
                                <date type="published" when="2015">2015</date>
                            </imprint>
                        </monogr>
                    </biblStruct>
                    <biblStruct xml:id="b45">
                        <analytic>
                            <title level="a" type="main">Optimal control of quasistatic plasticity with linear kinematic hardening III: Optimality conditions</title>
                            <author>
                                <persName>
                                    <forename type="first">G</forename>
                                    <surname>Wachsmuth</surname>
                                </persName>
                            </author>
                            <idno type="DOI">10.4171/ZAA/1556</idno>
                        </analytic>
                        <monogr>
                            <title level="j">Z. Anal. Anwend</title>
                            <imprint>
                                <biblScope unit="volume">35</biblScope>
                                <biblScope unit="page" from="81" to="118"/>
                                <date type="published" when="2016">2016</date>
                            </imprint>
                        </monogr>
                    </biblStruct>

I was trying to improve the formulas, which are now covered in most part, but some small characters are missing, still. As soon as I develop a method to recognise and systematize this problems we should be able to scale it and speed it up.

clason · 2024-12-31T17:54:14Z

Oooh, taboldstyle. That sounds like a font error...

flydutch · 2025-01-02T09:38:14Z

@flydutch could you please remind what hack was done with poppler to solve most of these issues?

(I am by no means a PDF expert) I have seen that in poppler there is a parseCharName() function that ignores all characters after a period (that's why 'one.oldstyle' & co. don't give problems). Other fixes/workarounds are also done by that function (ligatures, ...).

lfoppiano · 2025-01-11T10:27:31Z

At the current stage, after adding sufficient mappings, the output from pdfalto looks better than what can be obtained by poppler. However I'm not sure they won't break other documents.

kermitt2 self-assigned this Feb 22, 2024

lfoppiano added the bug Something isn't working label Mar 25, 2024

lfoppiano added the help wanted Extra attention is needed label Aug 22, 2024

lfoppiano self-assigned this Dec 25, 2024

lfoppiano added the implemented Normally it's fixed ! label Dec 31, 2024

lfoppiano removed the implemented Normally it's fixed ! label Dec 31, 2024

lfoppiano added the implemented Normally it's fixed ! label Dec 31, 2024

lfoppiano added the error case label Jan 1, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Error case, missing digits #163

Error case, missing digits #163

kermitt2 commented Feb 21, 2024

Aazhar commented Feb 22, 2024

clason commented Apr 10, 2024

clason commented Apr 10, 2024

clason commented Apr 10, 2024 •

edited

Loading

clason commented Aug 20, 2024

lfoppiano commented Aug 22, 2024

clason commented Aug 22, 2024 •

edited

Loading

clason commented Aug 22, 2024

kermitt2 commented Aug 22, 2024

clason commented Aug 22, 2024 •

edited

Loading

lfoppiano commented Dec 25, 2024

lfoppiano commented Dec 31, 2024

clason commented Dec 31, 2024

clason commented Dec 31, 2024

lfoppiano commented Dec 31, 2024 •

edited

Loading

clason commented Dec 31, 2024

lfoppiano commented Dec 31, 2024 •

edited

Loading

clason commented Dec 31, 2024 •

edited

Loading

clason commented Dec 31, 2024 •

edited

Loading

lfoppiano commented Dec 31, 2024

clason commented Dec 31, 2024

lfoppiano commented Dec 31, 2024

lfoppiano commented Dec 31, 2024

clason commented Dec 31, 2024

lfoppiano commented Dec 31, 2024

clason commented Dec 31, 2024

lfoppiano commented Dec 31, 2024 •

edited

Loading

clason commented Dec 31, 2024

lfoppiano commented Dec 31, 2024

clason commented Dec 31, 2024

flydutch commented Jan 2, 2025

lfoppiano commented Jan 11, 2025

Error case, missing digits #163

Error case, missing digits #163

Comments

kermitt2 commented Feb 21, 2024

Aazhar commented Feb 22, 2024

clason commented Apr 10, 2024

clason commented Apr 10, 2024

clason commented Apr 10, 2024 • edited Loading

clason commented Aug 20, 2024

lfoppiano commented Aug 22, 2024

clason commented Aug 22, 2024 • edited Loading

clason commented Aug 22, 2024

kermitt2 commented Aug 22, 2024

clason commented Aug 22, 2024 • edited Loading

lfoppiano commented Dec 25, 2024

lfoppiano commented Dec 31, 2024

clason commented Dec 31, 2024

clason commented Dec 31, 2024

lfoppiano commented Dec 31, 2024 • edited Loading

clason commented Dec 31, 2024

lfoppiano commented Dec 31, 2024 • edited Loading

clason commented Dec 31, 2024 • edited Loading

clason commented Dec 31, 2024 • edited Loading

lfoppiano commented Dec 31, 2024

clason commented Dec 31, 2024

lfoppiano commented Dec 31, 2024

lfoppiano commented Dec 31, 2024

clason commented Dec 31, 2024

lfoppiano commented Dec 31, 2024

clason commented Dec 31, 2024

lfoppiano commented Dec 31, 2024 • edited Loading

clason commented Dec 31, 2024

lfoppiano commented Dec 31, 2024

clason commented Dec 31, 2024

flydutch commented Jan 2, 2025

lfoppiano commented Jan 11, 2025

clason commented Apr 10, 2024 •

edited

Loading

clason commented Aug 22, 2024 •

edited

Loading

clason commented Aug 22, 2024 •

edited

Loading

lfoppiano commented Dec 31, 2024 •

edited

Loading

lfoppiano commented Dec 31, 2024 •

edited

Loading

clason commented Dec 31, 2024 •

edited

Loading

clason commented Dec 31, 2024 •

edited

Loading

lfoppiano commented Dec 31, 2024 •

edited

Loading