Implement Seed-level video capture setting handling + Job-level PDF-only option #288

gretchenleighmiller · 2024-09-12T23:54:33Z

This PR covers the following:

Adds a new video_capture configuration option on the Seed level. This has four possible values; see newly added documentation for details.
Implements the video_capture option, which impacts yt-dlp extraction and MIME types of outlinks. The remainder of video capture handling is accomplished in warcprox via Warcprox-Meta headers.
Removes the previous file-based skip_av_seeds functionality.
Adds a new pdfs_only configuration option on the Job level. This is a boolean that defaults to False; see newly added documentation for details.
Implements the pdfs_only option based on the MIME type of outlinks. The remaining of PDF-only filtering is accomplished in warcprox via Warcprox-Meta headers.
Documentation updates and minor style cleanup.

…g on outlinks

galgeek · 2024-09-23T20:03:14Z

brozzler/worker.py

@@ -250,7 +249,17 @@ def brozzle_page(

        if not self._needs_browsing(page_headers):
            self.logger.info("needs fetch: %s", page)
-            self._fetch_url(site, page=page)
+            if site.video_capture in [


with if and elif, shouldn't pdfs_only should get checked first?

It should have the correct behavior either way.

The first conditional is checking that 1) there is a rule to limit video capture by MIME type AND 2) that the MIME type specified in the Content-Type header of the response indicates that it's a video. Both have to be true to enter that branch, so we aren't going to accidentally skip the PDF conditional if Content-Type is a PDF.

The second conditional is similarly constrained on the PDF-only option being set and the Content-Type header specifying that the response is a PDF.

I would not expect the order to matter, but maybe I'm missing something?

thanks! I did miss self._is_video_type(page_headers) initially.

I still like starting with the pdfs_only check — it seems simpler, and encompasses any limiting VideoCaptureOptions enabled.

Swapped them.

galgeek · 2024-09-23T21:22:50Z

job-conf.rst

+=========+==========+===========+
+| boolean | no       | ``false`` |
+---------+----------+-----------+
+Limits capture to PDFs based on MIME type. This value will only impact


Current code changes only processing of page-level urls with PDF-ish content-type.

Are you suggesting a better way of wording this in the documentation? It is called out here that this option specifically has limited impact and must be paired with a warcprox_meta rule to further filter by MIME type.

Do you think "PDF-ish" is better phrasing given the ambiguity of MIME type in Content-Type headers? I would expect MIME type to be pretty straightforward in this case given that PDFs are a pretty well-defined and well-known format.

Something like this would be better, I think:

"Limits captures to PDFs, based on page's content-type header.
This value, for now, affects only brozzler's capture of page-level urls.

Note: fully limiting a crawl to PDFs only will also require updates to brozzler's Warcprox-Meta header and warcprox."

Reworded slightly for clarity.

job-conf.rst

brozzler/worker.py

galgeek

@gretchenleighmiller, thanks for your work on this!

I've left a couple of comments it might be good to address.

Gretchen Miller added 3 commits September 12, 2024 16:45

WT-2950 remove skip_av_seeds

d9ed5c4

WT-2950 replace skip_ytdlp with video_capture

eb227b0

WT-2950 invert conditionals to PEP8 preferred code style (E713)

c3a92b1

gretchenleighmiller changed the title ~~Gmiller/2950 skip ytdlp~~ Implement Seed-level video capture setting handling Sep 12, 2024

Gretchen Miller added 5 commits September 13, 2024 13:33

WT-2950 exclude video file types if site has disabled video capture

c722549

small ruff formatting pass

77e6b9e

WT-2950 video capture options enum

66263f0

another tiny ruff format pass

8275f3e

WT-2950 cleaning up video capture options handling; PDFs only handlin…

dca9630

…g on outlinks

gretchenleighmiller changed the title ~~Implement Seed-level video capture setting handling~~ Implement Seed-level video capture setting handling + Job-level PDF-only option Sep 20, 2024

WT-2950 documentation + better conf handling + linting

41aab1a

gretchenleighmiller requested review from avdempsey and galgeek September 20, 2024 23:44

gretchenleighmiller marked this pull request as ready for review September 20, 2024 23:45

WT-2950 fix RST formatting

6fdc2b9

galgeek reviewed Sep 23, 2024

View reviewed changes

job-conf.rst Outdated Show resolved Hide resolved

galgeek reviewed Sep 23, 2024

View reviewed changes

job-conf.rst Show resolved Hide resolved

galgeek reviewed Sep 23, 2024

View reviewed changes

brozzler/worker.py Outdated Show resolved Hide resolved

galgeek requested changes Sep 23, 2024

View reviewed changes

Gretchen Miller added 4 commits September 23, 2024 16:38

WT-2950 fix typos

12db06a

WT2590 addressing PR feedback

36b17d2

WT-2950 update job schema

720601a

WT-2950 update job schema pt. 2

9c77961

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement Seed-level video capture setting handling + Job-level PDF-only option #288

Implement Seed-level video capture setting handling + Job-level PDF-only option #288

gretchenleighmiller commented Sep 12, 2024 •

edited

Loading

galgeek Sep 23, 2024

gretchenleighmiller Sep 23, 2024

galgeek Sep 24, 2024

gretchenleighmiller Sep 30, 2024

galgeek Sep 23, 2024

gretchenleighmiller Sep 23, 2024

galgeek Sep 24, 2024

gretchenleighmiller Sep 30, 2024

galgeek left a comment

Implement Seed-level video capture setting handling + Job-level PDF-only option #288

Are you sure you want to change the base?

Implement Seed-level video capture setting handling + Job-level PDF-only option #288

Conversation

gretchenleighmiller commented Sep 12, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

galgeek left a comment

Choose a reason for hiding this comment

gretchenleighmiller commented Sep 12, 2024 •

edited

Loading