Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Font name encoding issue #2971

Closed
dothinking opened this issue Jan 3, 2024 · 4 comments
Closed

Font name encoding issue #2971

dothinking opened this issue Jan 3, 2024 · 4 comments
Labels
not a bug not a bug / user error / unable to reproduce wontfix no intention to resolve

Comments

@dothinking
Copy link

Description of the bug

In my test case below, the font name (with Chinese characters) seems encoded with error when extracted with get_fonts() or get_text('rawdict'). Please look into it, thanks.

How to reproduce the bug

doc = fitz.Document('sample.pdf')
doc[0].get_fonts()

# output:
#[(6,
#  'ttf',
#  'TrueType',
#  'BCDEEE+å\x8d\x8eæ\x96\x87仿å®\x8b',    <- from PDF Viewer, the name should be 华文仿宋
#  'F1',
#  'WinAnsiEncoding')]

sample.pdf

PyMuPDF version

1.23.8

Operating system

Windows

Python version

3.8

@JorjMcKie
Copy link
Collaborator

There is no way to do that. The font name is stored in the PDF like this:
image

The name of a font is a PDF name object. By definition, a PDF "name" object starts with a "/" followed by 8-bit characters. No Unicode supported.

@JorjMcKie JorjMcKie added not a bug not a bug / user error / unable to reproduce wontfix no intention to resolve labels Jan 3, 2024
@dothinking
Copy link
Author

Thanks for the quick reply. But why a pdf viewer can display the correct font name, can it be an upstream issue of mupdf?

@JorjMcKie
Copy link
Collaborator

Thanks for the quick reply. But why a pdf viewer can display the correct font name, can it be an upstream issue of mupdf?

No, of course not.
You are looking at the PDF source code, where object with xref number 6 is being defined. In that definition, we find PDF objects that describe the fonts characteristics. Of course one could try to decipher those single bytes in the /BaseFont definition and try to interpret them as UTF8 - we simply do not bother to do that at the moment, that's all.

We also do not look into the font's binary at this point at all either.

With text extract and font extraction itself, access to the font's self-identification is included, see here:

import fitz
doc=fitz.open("sample.pdf")
ff=doc.extract_font(6)
font=fitz.Font(fontbuffer=ff[-1])
font
Font('STFangsong Regular')

You can also do this to invoke Python capabilities to interpret bytes as UTF8:

page=doc[0]
item=page.get_fonts()[0]
fontname=item[3]
realname = bytes([ord(c) for c in fontname]).decode()
realname
'BCDEEE+华文仿宋'

@dothinking
Copy link
Author

Thanks for the detailed explanation. Fully understand now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
not a bug not a bug / user error / unable to reproduce wontfix no intention to resolve
Projects
None yet
Development

No branches or pull requests

2 participants