You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
def get_text_pdf(input_pdf):
pdf = fitz.open(input_pdf)
for page in pdf:
d = page.get_text("dict", sort=True)["blocks"]
for i in d:
for k, v in i.items():
if k == "lines":
for i in v:
for k1, v1 in i.items():
if k1 == "spans":
for j in v1:
print(j["text"])
print("---------------")
PyMuPDF version
1.23.9rc1
Operating system
Linux
Python version
3.9
The text was updated successfully, but these errors were encountered:
Nancis1130
changed the title
Get_ Text() cannot correctly partition blocks on Chinese documents
get_text() cannot correctly partition blocks on Chinese documents
Jan 4, 2024
All text extraction extracts text in the sequence as stored in the page's /Contents (source code generating the appearance).
Structuring text in blocks, lines, etc. is happening in MuPDF exclusively: nothing in the PDF itself represents such structure.
So the algorithm uses a number of criteria like font, font size, writing direction, line distances, inter-character distances and much more to come up with that structure.
That works in many cases, but it is by no means a guarantee for general success.
You unfortunately did not provide the problem file - so we cannot be sure what really is going on there.
It looks however like text pieces ":", "---" etc. have been (1) inserted later and (2) with a different bottom coordinate than neighboring text. There is no way to heal this situation other than additional code from your side.
Description of the bug
Complete text in the source file, in get_text() method is divided into two blocks. On English documents, the text blocks are well divided.
source file:

get_text() result:
source pdf
Thanks for your help!
How to reproduce the bug
PyMuPDF version
1.23.9rc1
Operating system
Linux
Python version
3.9
The text was updated successfully, but these errors were encountered: