fix: PyPDFToDocument initializes documents with content and meta #8698

julian-risch · 2025-01-09T18:29:50Z

Related Issues

related to Document ID doesn't updated upon metadata update #8692

Proposed Changes:

Initialize the Document returned by PyPDFToDocument with content and meta so that both are taken into account for document ID generation. Previously only the content was used for the initialization of the Document and the meta data was updated later

How did you test it?

Notes for the reviewer

Checklist

I have read the contributors guidelines and the code of conduct
I have updated the related issue with new insights and changes
I added unit tests and updated the docstrings
I've used one of the conventional commit types for my PR title: fix:, feat:, build:, chore:, ci:, docs:, style:, refactor:, perf:, test: and added ! in case the PR includes breaking changes.
I documented my code
I ran pre-commit hooks and fixed any issue

wochinge

🙌🏻 Let's go!

wochinge · 2025-01-09T18:33:25Z

test/components/converters/test_pypdf_to_document.py

@@ -113,8 +113,8 @@ def test_default_convert(self):
            layout_mode_font_height_weight=1.5,
        )

-        doc = converter._default_convert(mock_reader)
-        assert doc.content == "Page 1 content\fPage 2 content"
+        text = converter._default_convert(mock_reader)


testing against private methods seems not really best practice but I guess that's a thing for a different time 😁

coveralls · 2025-01-09T19:13:27Z

Pull Request Test Coverage Report for Build 12696715605

Details

0 of 0 changed or added relevant lines in 0 files are covered.
No unchanged relevant lines lost coverage.
Overall coverage remained the same at 91.094%

Totals
Change from base Build 12694180152:	0.0%
Covered Lines:	8653
Relevant Lines:	9499

💛 - Coveralls

julian-risch added 2 commits January 9, 2025 19:27

initialize document with content and meta

a193d1b

update test

0f92855

github-actions bot added the topic:tests label Jan 9, 2025

wochinge approved these changes Jan 9, 2025

View reviewed changes

julian-risch marked this pull request as ready for review January 9, 2025 18:46

julian-risch requested review from a team as code owners January 9, 2025 18:46

julian-risch requested review from dfokina and davidsbatista and removed request for a team January 9, 2025 18:46

add test checking that not only content is used for id generation

89007bf

julian-risch enabled auto-merge (squash) January 9, 2025 19:00

julian-risch merged commit dd9660f into main Jan 9, 2025
18 checks passed

julian-risch deleted the pypdf-docid branch January 9, 2025 19:12

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: PyPDFToDocument initializes documents with content and meta #8698

fix: PyPDFToDocument initializes documents with content and meta #8698

julian-risch commented Jan 9, 2025 •

edited

Loading

wochinge left a comment

wochinge Jan 9, 2025

coveralls commented Jan 9, 2025

fix: PyPDFToDocument initializes documents with content and meta #8698

fix: PyPDFToDocument initializes documents with content and meta #8698

Conversation

julian-risch commented Jan 9, 2025 • edited Loading

Related Issues

Proposed Changes:

How did you test it?

Notes for the reviewer

Checklist

wochinge left a comment

Choose a reason for hiding this comment

wochinge Jan 9, 2025

Choose a reason for hiding this comment

coveralls commented Jan 9, 2025

Pull Request Test Coverage Report for Build 12696715605

Details

💛 - Coveralls

julian-risch commented Jan 9, 2025 •

edited

Loading