Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improved Exception Handling during website-screenshot fallback and several fixes for pydantic ValidationErrors #115

Merged
merged 6 commits into from
Sep 26, 2024

Conversation

Criamos
Copy link
Contributor

@Criamos Criamos commented Sep 26, 2024

This PR includes the following changes:

  • feat: improve robustness of website screenshot fallback in thumbnail pipeline for edge-cases when neither thumbnail URL nor website URL is available for rendering via a headless browser
    • this edge-case typically happened during crawls of large datasets (e.g. sodix_spider) under two circumstances:
      • the item itself did not provide a valid (or reachable) thumbnail URL, therefore the pipeline tried to fall back to a website screenshot
      • when the first URL in LOM Technical location (cclom:location) would point towards a binary file (e.g. .mp3 or .mp4), the headless browser wouldn't be able to render / take a website screenshot either, causing an AttributeError Exception when trying to access the screenshot_bytes-object (expected type: bytes, but received None)
    • future crawls should handle these edge-cases more gracefully and try a second fallback if a second URL was stored within LOM Technical location
      • if neither fallback 1 nor 2 were successful, the item will be uploaded without a thumbnail
  • fix: several fixes for pydantic ValidationErrors during POST requests to edu-sharing
    • LOM Technical duration: int values are wrapped in a string before submitting the item
    • BaseItem.hash: making sure that (old) crawlers which returned an int-value in getHash() are typecast to str before submitting the item
    • LOM General aggregationLevel: int values are wrapped in a string before submitting the item
  • fix: add missing license.internal mapping for "NONPUBLIC" license values
  • style: code formatting via black to increase readability of pipelines.py

PS: Thank you, Constantin, for the useful debug logs!

…_connector (expected type: str)

- feat: additional check in the pipelines for "lom.technical.duration"-values and conversion to string
background:
- some old crawlers (e.g. merlin_spider) returned an Integer value in their "getHash()"-method, which now causes pydantic ValidationErrors with the new API client (since the API expects a string value for this property)
- this fixes pydantic ValidationErrors in the es_connector during POST requests
…peline

- fix AttributeErrors when a website-screenshot fails:
  - if the first website-screenshot fails (e.g. when the first URL in LOM technical location points towards a .mp3 or .mp4), screenshot_bytes won't be available to work with
  - this caused an AttributeError when the pipeline tried to convert a "None"-type to a thumbnail
- refactor: extracted the functionality of taking a website screenshot with playwright in the pipeline into its own method
- feat: second website screenshot fallback
  - if a second URL is available in LOM technical location, the pipeline will try to take a screenshot of that URL before finally giving up
  - this could happen in edge-cases where the first URL in LOM technical location is unavailable for website-screenshots, but the array contains a second URL that might be more fruitful for a screenshot (e.g. a landing page)
- fix: 2 warnings w.r.t. too broad exception clauses
@Criamos Criamos added bug Something isn't working enhancement New feature or request labels Sep 26, 2024
@Criamos Criamos marked this pull request as ready for review September 26, 2024 19:11
@Criamos Criamos merged commit 6c196ab into develop Sep 26, 2024
3 checks passed
@Criamos Criamos deleted the 2024-09-26-fixes branch September 26, 2024 19:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant