Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

get_text() cannot correctly partition blocks on Chinese documents #2974

Closed
Nancis1130 opened this issue Jan 4, 2024 · 1 comment
Closed
Labels
not a bug not a bug / user error / unable to reproduce

Comments

@Nancis1130
Copy link

Nancis1130 commented Jan 4, 2024

Description of the bug

Complete text in the source file, in get_text() method is divided into two blocks. On English documents, the text blocks are well divided.

source file:
image
get_text() result:

内容提要
 :
本文从实证上研究中国金融发展和经济增长之间的关系。由于金融
发展主要包括金融中介体发展和股票市场发展两部分
 ,
本文依次研究中国金融中介
体发展和经济增长之间的实证关系、中国股票市场发展和经济增长之间的实证关系
以及中国金融中介体发展和股票市场发展之间的实证关系。本文的结论是
 ,
在中国
---------------
金融中介体发展和经济增长之间有显著的、很强的正相关关系
 ,
这意味着我国金融中
介体的发展有可能促进经济增长
 ,
同时也意味着金融中介体的发展不能滞后于经济
增长
 ;
在中国股票市场发展和经济增长之间有不显著的负相关关系
 ,
这意味着我国股
票市场发展对经济增长的作用是极其有限的
 ,
即使有那么一点点
 ,
也是不利的
 ;
在中
国金融中介体发展和股票市场发展之间有显著的正相关关系
 ,
这意味着在现阶段的
---------------
我国
 ,
股票市场的发展并不排斥金融中介体的发展。

source pdf

Thanks for your help!

How to reproduce the bug

def get_text_pdf(input_pdf):
    pdf = fitz.open(input_pdf)
    for page in pdf:    
        d = page.get_text("dict", sort=True)["blocks"]
        for i in d:
            for k, v in i.items():
                if k == "lines":
                    for i in v:
                        for k1, v1 in i.items():
                            if k1 == "spans":
                                for j in v1:
                                    print(j["text"])
            print("---------------")

PyMuPDF version

1.23.9rc1

Operating system

Linux

Python version

3.9

@Nancis1130 Nancis1130 changed the title Get_ Text() cannot correctly partition blocks on Chinese documents get_text() cannot correctly partition blocks on Chinese documents Jan 4, 2024
@JorjMcKie JorjMcKie added the not a bug not a bug / user error / unable to reproduce label Jan 4, 2024
@JorjMcKie
Copy link
Collaborator

This is no bug.

All text extraction extracts text in the sequence as stored in the page's /Contents (source code generating the appearance).
Structuring text in blocks, lines, etc. is happening in MuPDF exclusively: nothing in the PDF itself represents such structure.
So the algorithm uses a number of criteria like font, font size, writing direction, line distances, inter-character distances and much more to come up with that structure.
That works in many cases, but it is by no means a guarantee for general success.

You unfortunately did not provide the problem file - so we cannot be sure what really is going on there.
It looks however like text pieces ":", "---" etc. have been (1) inserted later and (2) with a different bottom coordinate than neighboring text. There is no way to heal this situation other than additional code from your side.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
not a bug not a bug / user error / unable to reproduce
Projects
None yet
Development

No branches or pull requests

2 participants