Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement Seed-level video capture setting handling + Job-level PDF-only option #288

Open
wants to merge 14 commits into
base: master
Choose a base branch
from

Conversation

gretchenleighmiller
Copy link

@gretchenleighmiller gretchenleighmiller commented Sep 12, 2024

This PR covers the following:

  • Adds a new video_capture configuration option on the Seed level. This has four possible values; see newly added documentation for details.
  • Implements the video_capture option, which impacts yt-dlp extraction and MIME types of outlinks. The remainder of video capture handling is accomplished in warcprox via Warcprox-Meta headers.
  • Removes the previous file-based skip_av_seeds functionality.
  • Adds a new pdfs_only configuration option on the Job level. This is a boolean that defaults to False; see newly added documentation for details.
  • Implements the pdfs_only option based on the MIME type of outlinks. The remaining of PDF-only filtering is accomplished in warcprox via Warcprox-Meta headers.
  • Documentation updates and minor style cleanup.

@gretchenleighmiller gretchenleighmiller changed the title Gmiller/2950 skip ytdlp Implement Seed-level video capture setting handling Sep 12, 2024
@gretchenleighmiller gretchenleighmiller changed the title Implement Seed-level video capture setting handling Implement Seed-level video capture setting handling + Job-level PDF-only option Sep 20, 2024
@gretchenleighmiller gretchenleighmiller marked this pull request as ready for review September 20, 2024 23:45
@@ -250,7 +249,17 @@ def brozzle_page(

if not self._needs_browsing(page_headers):
self.logger.info("needs fetch: %s", page)
self._fetch_url(site, page=page)
if site.video_capture in [
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

with if and elif, shouldn't pdfs_only should get checked first?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should have the correct behavior either way.

The first conditional is checking that 1) there is a rule to limit video capture by MIME type AND 2) that the MIME type specified in the Content-Type header of the response indicates that it's a video. Both have to be true to enter that branch, so we aren't going to accidentally skip the PDF conditional if Content-Type is a PDF.

The second conditional is similarly constrained on the PDF-only option being set and the Content-Type header specifying that the response is a PDF.

I would not expect the order to matter, but maybe I'm missing something?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks! I did miss self._is_video_type(page_headers) initially.

I still like starting with the pdfs_only check — it seems simpler, and encompasses any limiting VideoCaptureOptions enabled.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Swapped them.

job-conf.rst Outdated
+=========+==========+===========+
| boolean | no | ``false`` |
+---------+----------+-----------+
Limits capture to PDFs based on MIME type. This value will only impact
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Current code changes only processing of page-level urls with PDF-ish content-type.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you suggesting a better way of wording this in the documentation? It is called out here that this option specifically has limited impact and must be paired with a warcprox_meta rule to further filter by MIME type.

Do you think "PDF-ish" is better phrasing given the ambiguity of MIME type in Content-Type headers? I would expect MIME type to be pretty straightforward in this case given that PDFs are a pretty well-defined and well-known format.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Something like this would be better, I think:

"Limits captures to PDFs, based on page's content-type header.
This value, for now, affects only brozzler's capture of page-level urls.

Note: fully limiting a crawl to PDFs only will also require updates to brozzler's Warcprox-Meta header and warcprox."

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reworded slightly for clarity.

job-conf.rst Outdated Show resolved Hide resolved
job-conf.rst Show resolved Hide resolved
brozzler/worker.py Outdated Show resolved Hide resolved
Copy link
Contributor

@galgeek galgeek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gretchenleighmiller, thanks for your work on this!

I've left a couple of comments it might be good to address.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants