Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: PyPDFToDocument initializes documents with content and meta #8698

Merged
merged 3 commits into from
Jan 9, 2025

Conversation

julian-risch
Copy link
Member

@julian-risch julian-risch commented Jan 9, 2025

Related Issues

Proposed Changes:

  • Initialize the Document returned by PyPDFToDocument with content and meta so that both are taken into account for document ID generation. Previously only the content was used for the initialization of the Document and the meta data was updated later

How did you test it?

Notes for the reviewer

Checklist

  • I have read the contributors guidelines and the code of conduct
  • I have updated the related issue with new insights and changes
  • I added unit tests and updated the docstrings
  • I've used one of the conventional commit types for my PR title: fix:, feat:, build:, chore:, ci:, docs:, style:, refactor:, perf:, test: and added ! in case the PR includes breaking changes.
  • I documented my code
  • I ran pre-commit hooks and fixed any issue

Copy link
Contributor

@wochinge wochinge left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🙌🏻 Let's go! :shipit:

@@ -113,8 +113,8 @@ def test_default_convert(self):
layout_mode_font_height_weight=1.5,
)

doc = converter._default_convert(mock_reader)
assert doc.content == "Page 1 content\fPage 2 content"
text = converter._default_convert(mock_reader)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

testing against private methods seems not really best practice but I guess that's a thing for a different time 😁

@julian-risch julian-risch marked this pull request as ready for review January 9, 2025 18:46
@julian-risch julian-risch requested review from a team as code owners January 9, 2025 18:46
@julian-risch julian-risch requested review from dfokina and davidsbatista and removed request for a team January 9, 2025 18:46
@julian-risch julian-risch enabled auto-merge (squash) January 9, 2025 19:00
@julian-risch julian-risch merged commit dd9660f into main Jan 9, 2025
18 checks passed
@julian-risch julian-risch deleted the pypdf-docid branch January 9, 2025 19:12
@coveralls
Copy link
Collaborator

Pull Request Test Coverage Report for Build 12696715605

Details

  • 0 of 0 changed or added relevant lines in 0 files are covered.
  • No unchanged relevant lines lost coverage.
  • Overall coverage remained the same at 91.094%

Totals Coverage Status
Change from base Build 12694180152: 0.0%
Covered Lines: 8653
Relevant Lines: 9499

💛 - Coveralls

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants