Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Duplicate value for merged cell instead of None #420

Open
tungph opened this issue Apr 22, 2021 · 3 comments
Open

Duplicate value for merged cell instead of None #420

tungph opened this issue Apr 22, 2021 · 3 comments
Labels
feature-request All feature requests receive this label initially, can be upgraded to "enhancement"

Comments

@tungph
Copy link

tungph commented Apr 22, 2021

Hi there,
Firstly, I want to thank you for creating this awesome pdfplumber project. I have been using it a lot lately.
Secondly, I'm facing a problem and want to propose change to the output of the table module regarding the utility extract_tables with merged cells.
As it is now, only the first row of the merged cell has value, the remaining cells get None.
image

Header 1, Header 2
Merged cell 1, Cell 2-1
None, Cell 2-2

It would be more semantically correct to have the value in merged cell duplicated like this

Header 1, Header 2
Merged cell 1, Cell 2-1
Merged cell 1, Cell 2-2

I'm preparing a PR, please consider it.
Thank you!

Update 1

Here's the PR: #422

Approach: finding the centroid of each "cell", and then get the cells from the centroids for each row.
image

Update 2:

I've encountered this case:
image
Expected output is:

1-1, 2-1
None, 2-2
None, 2-3

None means not a cell there, and it makes the number of elements is even between rows which is convenient for further processing.
Note: the PR is updated

@tungph tungph added the feature-request All feature requests receive this label initially, can be upgraded to "enhancement" label Apr 22, 2021
@jsvine
Copy link
Owner

jsvine commented Apr 26, 2021

Hi @tungph, thanks for your interest in this library and thank you for the PR. I agree that pdfplumber's current approach to complex tables is not optimal, but it is the result of not wanting to over-impose assumptions about any given table. The handling of merged cells in tables is a tricky topic, and one with a lot of edge cases — what may be a good solution for one set of tables may be a poor one for others. So I'll need to consider this suggestion and PR somewhat carefully, and think about (and test) other PDFs.

Relatedly: I have an idea for providing a richer representation of complex tables — one that would make the structure of the tables much clearer — but it would be a substantial change and so it will probably have to wait for v0.6.0.

@tungph
Copy link
Author

tungph commented Apr 29, 2021

Thank you for the consideration, @jsvine. Please let me know if you found bug with my PR. I'm happy to provide fix.

@Safrone
Copy link

Safrone commented Mar 20, 2024

I would love to have this as an option when extracting tables even if it is not the default behavior.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature-request All feature requests receive this label initially, can be upgraded to "enhancement"
Projects
None yet
Development

No branches or pull requests

3 participants