Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Uv4 #8

Draft
wants to merge 53 commits into
base: newr
Choose a base branch
from
Draft

Uv4 #8

wants to merge 53 commits into from

Conversation

bosd
Copy link
Owner

@bosd bosd commented Dec 1, 2024

Major refactor of the code by adding typing and updating the docstrings.
Pytest and flake 8 fixes.

bosd added 30 commits November 24, 2024 17:02
Update cookiecutter template
Fix N806 change variable tmp_folder to lowercase
Fix D205
Docstrings: Improved the docstrings to follow the Google style guide, including clear descriptions, parameter and return type documentation, and example usage.
Indentation: Fixed indentation to consistently use 4 spaces.
Type Hints: Added type hints to function parameters and return values.
File Handling: Used with open(filename, "w") as xml_file: for better file handling and to ensure the file is closed properly.
Loop Counter: Simplified the loop counter using enumerate.
Unnecessary elif: Removed unnecessary elif conditions when checking for int and float types by combining them using isinstance(v, (int, float)).

add defused xml library
Docstrings: Rewrote the docstrings to adhere to the Google style guide, providing clear explanations, parameter descriptions, and example usage.
Redundant elif: Removed the redundant elif isinstance(item, datetime.date) condition, as it was a duplicate of the first if condition.
Iteration: Simplified the code by using a consistent way to iterate over both dictionaries and lists using iter_obj.
codecs import: Removed the unnecessary codecs import, as the built-in open function handles UTF-8 encoding.
sort_keys: Removed the sort_keys=False argument in json.dump, as the default behavior is already False.
File Handling: Used the more concise conditional expression (filename = path + ".json" if not path.endswith(".json") else path) for filename assignment.
Docstrings: Rewrote the docstrings to follow the Google style guide, providing clear explanations, parameter descriptions, and example usage.
Python 2 Compatibility: Removed the Python 2 compatibility code (if sys.version_info[0] < 3), as Python 2 is no longer supported.
File Handling: Used the more concise conditional expression for filename assignment and opened the file with UTF-8 encoding.
Header Handling: Simplified the header row creation using list(line.keys()).
Date Formatting: Added a comment to clarify the assumption that the value v is a date object before applying strftime.
Unnecessary Comment: Removed the unnecessary comment # first_row.append(k).
Removed unnecessary imports: Removed the unused json_format import.
Updated API calls:
Changed vision.types.Feature to vision.Feature.
Changed vision.enums.Feature.Type.DOCUMENT_TEXT_DETECTION to vision.Feature.Type.DOCUMENT_TEXT_DETECTION.
Changed vision.types.GcsSource to vision.GcsSource.
Changed vision.types.InputConfig to vision.InputConfig.
Changed vision.types.GcsDestination to vision.GcsDestination.
Changed vision.types.OutputConfig to vision.OutputConfig.
Changed vision.types.AsyncAnnotateFileRequest to vision.AsyncAnnotateFileRequest.
Changed json_format.Parse(json_string, vision.types.AnnotateFileResponse()) to vision.AnnotateFileResponse.from_json(json_string).
Added type hints: Added type hints to the function parameters and return value.
Improved docstrings: Added a Google-style docstring to the function.
Minor code style improvements: Used f-strings for string formatting and improved variable naming consistency.

Refactor Google vision import module

Test: Fix gvision unit tests

This commit fixes the gvision unit tests that were failing due to incorrect mocking and assertions.

The following changes were made:

- Corrected the mocking of `get_blob` in both test cases to accurately simulate the behavior of Google Cloud Storage.
- Added `side_effect` to `mock_bucket.get_blob` in `test_to_text` to return different values on consecutive calls.
- Simplified the mocking of `get_blob` in `test_to_text_existing_result` to return the result blob directly.
- Ensured that `to_text` is called with the correct `path` argument in both test cases.
- Used `assert_any_call` in `test_to_text_existing_result` to check if `get_blob` is called with the expected argument without enforcing it as the only call.

These corrections ensure that the unit tests accurately test the functionality of the gvision input module and pass reliably.
Removed pkg_resources: Replaced pkg_resources.resource_filename with importlib.resources.files to access the built-in templates. This eliminates the dependency on setuptools.
Optimized file loading: Simplified the file loading logic by directly iterating through the files in the template directory and using a single try-except block to handle both YAML and JSON loading errors.
Improved docstrings: Added comprehensive Google-style docstrings to all functions, including descriptions, parameter and return type information, and examples.
Minor optimizations:
Used in instead of .keys() for dictionary key checks.
Combined keyword list conversion and default value assignment into a single line using a conditional expression.
Used setdefault to set the priority if it's not already present in the template.
This refactored code is more efficient, has clearer documentation, and no longer relies on setuptools.
Replaced pkg_resources with importlib.resources:

The pkg_resources module (from setuptools) was used in the original code to access files within the package.
This was replaced with the importlib.resources module, which is part of the Python standard library since Python 3.7.
Specifically, pkg_resources.resource_filename(__name__, "compare") was replaced with files(__package__).joinpath("compare"). This change allows accessing package resources without relying on setuptools.
Simplified file counting in test_copy:

The original code used pkg_resources.resource_filename and os.walk to count the number of copied files.
This was simplified by directly using os.walk with the absolute path of the directory, eliminating the need for pkg_resources.
Added docstrings:

Added a Google-style docstring to the compare_json_content function to improve code readability and documentation.

refactor test extraction no longer rely on setuptools

Refactor tests/common

Removed pkg_resources: Replaced pkg_resources.resource_filename with files(__package__) from the importlib.resources module to access package resources without relying on setuptools.
Optimized get_sample_files: Simplified the file retrieval logic by directly iterating through files in the "compare" directory using os.listdir.
Optimized exclude_template: Used a list comprehension for more efficient filtering of the test list, avoiding unnecessary modifications to the original list.
Added Google Style Docstrings: Included comprehensive docstrings for all functions, following the Google style guide.
Type Hints: Added type hints to function parameters and return values for better code readability and maintainability.

Test: Refactor get_sample_files to use os.walk

Refactor the `get_sample_files` function to use `os.walk` instead of
`os.listdir`. This allows the function to find files in subdirectories
within the "compare" directory, ensuring that all relevant test files
are included.

Test: Refactor test_cli

Test: Refactor test_custom_invoices to use os.walk

Refactor the `test_custom_invoices` test function to use `os.walk` for iterating through files in the "custom" directory.

This change ensures that the test function can correctly locate and process all test files, including those in subdirectories.
Use `pytest.raises` to assert that an `AssertionError` is raised when creating an `InvoiceTemplate` with an invalid language code.

This ensures that the test correctly checks for the exception even when the code is run with optimizations (`python -O`).
Refactor the `ocrmypdf` availability check to use `ocrmypdf_available()` instead of `have_ocrmypdf()`.

This change ensures consistency with the new availability check function and improves code readability.
Update the expected warning message in `test_ordered_load_broken_json` to match the actual output.

This change ensures that the test correctly verifies the warning message generated when a broken JSON file is loaded.
This commit refactors the command-line interface (CLI) to use the `click` library instead of `argparse`.

The `click` library provides a more concise and readable way to define command-line options and arguments. It also offers features like automatic help generation and type validation, improving the user experience.

This change removes the dependency on `argparse` and modernizes the CLI implementation.
This commit removes the dependency on the `importlib.resources` module and instead uses a relative path to access the templates directory.

The `importlib.resources` module was introduced in Python 3.7, so removing this dependency makes the code compatible with older versions of Python.

Additionally, this commit includes the following changes:

- Add type hints to function parameters and return values.
- Update docstrings to conform to Google style guidelines.
- Refactor code for clarity and consistency.
This commit refactors the `InvoiceTemplate` class and adds several optimizations to improve performance and maintainability.

The following changes were made:

- **Refactor `__init__`:** Use `super()` without arguments for calling the superclass initializer.
- **Refactor `matches_input`:** Improve the docstring and logic for checking keyword matches.
- **Optimize `parse_number`:** Add an early exit condition for simple numbers and handle locale-specific thousands separators.
- **Refactor `coerce_type`:** Improve the docstring and raise `AssertionError` directly for unknown types.
- **Refactor `extract`:**  Improve the docstring and add a "Raises" section.
- **Add type hints:** Add type hints to function parameters and return values.
- **Update docstrings:** Update docstrings to conform to Google style guidelines.
- **General cleanup:** Remove unnecessary comments and improve code readability.

Refactor: Update pdfminer_wrapper and add type hints

This commit updates the `pdfminer_wrapper` module and adds type hints to the `to_text` function.

The following changes were made:

- Removed unnecessary encoding-related code that is no longer needed in Python 3.
- Added type hints to the function parameters and return value.
- Updated the docstring to conform to Google style guidelines.
- Added a module-level docstring.

These changes improve the code's readability, maintainability, and compatibility with modern Python versions.

Refactor: Update pdfplumber and add type hints

This commit updates the `pdfplumber` input module and adds type hints to the `to_text` function.

The following changes were made:

- Added type hints to the function parameters and return value.
- Updated the docstring to conform to Google style guidelines.
- Added a module-level docstring.

These changes improve the code's readability, maintainability, and make it easier to understand its usage.

Refactor: Update pdftotext and add type hints

This commit updates the `pdftotext` input module and adds type hints to the `to_text` function.

The following changes were made:

- Added type hints to the function parameters and return value.
- Updated the docstring to conform to Google style guidelines, including a "Raises" section.
- Added a module-level docstring.
- Raise `FileNotFoundError` if the PDF file is not found.

These changes improve the code's readability, maintainability, and error handling.

Refactor: Update tesseract and add type hints

This commit updates the `tesseract` input module and adds type hints to the `to_text` function.

The following changes were made:

- Added type hints to the function parameters and return value.
- Updated the docstring to conform to Google style guidelines, including a "Raises" section.
- Raise `FileNotFoundError` if the image file is not found.
- Check for the `tesseract` executable using `shutil.which`.

These changes improve the code's readability, maintainability, and error handling.

Refactor: Update to_csv and add type hints

This commit updates the `to_csv` output module and adds type hints to the `write_to_file` function.

The following changes were made:

- Added type hints to the function parameters and return value.
- Updated the docstring to conform to Google style guidelines.
- Added a module-level docstring.

These changes improve the code's readability, maintainability, and make it easier to understand its usage.

Refactor: Improve Google Vision input module

This commit refactors the Google Vision input module (`gvision.py`) to improve its structure, error handling, and compatibility with other input modules.

The following changes were made:

- Moved the import of `google.cloud.vision` inside the `to_text` function to prevent import errors when other input modules are used.
- Added a check (`have_google_cloud`) to verify if the `google.cloud.vision` module is available before attempting to use it.
- Improved the error message to guide users on installing the necessary dependency if it's missing.
- Updated the docstring to reflect the dependency on `google-cloud-vision`.
- Removed the unused `language` parameter for consistency with other input modules.
- Added type hints for improved readability and maintainability.

These changes make the Google Vision input module more robust and user-friendly while ensuring compatibility with the rest of the invoice2data project.

Refactor: Update to_xml and add type hints

This commit updates the `to_xml` output module and adds type hints to its functions.

The following changes were made:

- Added type hints to all function parameters and return values.
- Updated docstrings to conform to Google style guidelines.
- Refactored code for clarity and consistency.
- Removed unnecessary logging and simplified the defusedxml availability check.
- Updated the module-level docstring.

These changes improve the code's readability, maintainability, and type safety.

Refactor: Move google.cloud imports in gvision

This commit moves the imports of `google.cloud.storage` and `google.cloud.vision` inside the `to_text` function in the `gvision` input module.

This change ensures that these modules are only imported when the Google Cloud Vision API is actually used, preventing unnecessary imports and potential import errors when other input methods are used.

This approach aligns with the structure of other input modules, such as `ocrmypdf`, where the module-specific libraries are only imported when the function is called.

Test: Improve ocrmypdf fallback test. Pretty error on list index out of range

Refactor: Update ocrmypdf input module

This commit refactors the ocrmypdf input module to improve its structure, error handling, and documentation.

The following changes were made:

- Refactored the `have_ocrmypdf` function to `ocrmypdf_available` and improved its implementation.
- Added type hints to function parameters and return values.
- Updated docstrings to conform to Google style guidelines.
- Moved the import of `ocrmypdf` inside the `pre_process_pdf` function to prevent unnecessary imports.
- Improved logging and error handling.
- Added a module-level docstring.

These changes enhance the readability, maintainability, and robustness of the ocrmypdf input module.
@bosd bosd force-pushed the newr branch 2 times, most recently from 4230ecc to cc549dd Compare December 1, 2024 21:37
@bosd bosd force-pushed the newr branch 11 times, most recently from d707369 to 752058d Compare December 17, 2024 18:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant