-
Notifications
You must be signed in to change notification settings - Fork 97
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement Seed-level video capture setting handling + Job-level PDF-only option #288
base: master
Are you sure you want to change the base?
Changes from 10 commits
d9ed5c4
eb227b0
c3a92b1
c722549
77e6b9e
66263f0
8275f3e
dca9630
41aab1a
6fdc2b9
12db06a
36b17d2
720601a
9c77961
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -107,6 +107,18 @@ Puts a cap on the number of sites belonging to a given job that can be brozzled | |
simultaneously across the cluster. Addresses the problem of a job with many | ||
seeds starving out other jobs. | ||
|
||
``pdfs_only`` | ||
galgeek marked this conversation as resolved.
Show resolved
Hide resolved
|
||
~~~~~~~~~~~~~~~~~~~~~ | ||
+---------+----------+-----------+ | ||
| type | required | default | | ||
+=========+==========+===========+ | ||
| boolean | no | ``false`` | | ||
+---------+----------+-----------+ | ||
Limits capture to PDFs based on MIME type. This value will only impact | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Current code changes only processing of page-level urls with PDF-ish content-type. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Are you suggesting a better way of wording this in the documentation? It is called out here that this option specifically has limited impact and must be paired with a Do you think "PDF-ish" is better phrasing given the ambiguity of MIME type in Content-Type headers? I would expect MIME type to be pretty straightforward in this case given that PDFs are a pretty well-defined and well-known format. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Something like this would be better, I think: "Limits captures to PDFs, based on page's content-type header. Note: fully limiting a crawl to PDFs only will also require updates to brozzler's Warcprox-Meta header and warcprox." There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Reworded slightly for clarity. |
||
processing of outlinks within Brozzler. Fully limiting a crawl to only PDFs | ||
requires an additional entry in the Warcprox-Meta header ``mime-type-filters`` | ||
key to fully block videos by MIME type. | ||
gretchenleighmiller marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
``seeds`` | ||
~~~~~~~~~ | ||
+------------------------+----------+---------+ | ||
|
@@ -158,6 +170,25 @@ other fields like checkboxes and/or hidden fields, brozzler will leave | |
the default values in place. Brozzler submits login forms after page load. | ||
Then brozzling proceeds as usual. | ||
|
||
``video_capture`` | ||
~~~~~~~~~~~~~~~~~ | ||
+--------+----------+--------------------------+ | ||
| type | required | default | | ||
+========+==========+==========================+ | ||
| string | yes | ``ENABLE_VIDEO_CAPTURE`` | | ||
+--------+----------+--------------------------+ | ||
Determines the level of video capture for the seed. This is an enumeration with four possible values: | ||
|
||
* ENABLE_VIDEO_CAPTURE (default): All video is captured. | ||
* DISABLE_VIDEO_CAPTURE: No video is captured. This is effectively a | ||
combination of the next two values. | ||
* BLOCK_VIDEO_MIME_TYPES: Any response with a Content-Type header containing | ||
the word "video" is not captured. | ||
* DISABLE_YTDLP_CAPTURE: Video capture via yt-dlp is disabled. | ||
|
||
*Note: Ensuring full video MIME type blocking requires an additional entry in | ||
the Warcprox-Meta header `mime-type-filters` key.* | ||
|
||
Seed-level / top-level settings | ||
------------------------------- | ||
These are seed settings that can also be specified at the top level, in which | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
with if and elif, shouldn't pdfs_only should get checked first?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It should have the correct behavior either way.
The first conditional is checking that 1) there is a rule to limit video capture by MIME type AND 2) that the MIME type specified in the Content-Type header of the response indicates that it's a video. Both have to be true to enter that branch, so we aren't going to accidentally skip the PDF conditional if Content-Type is a PDF.
The second conditional is similarly constrained on the PDF-only option being set and the Content-Type header specifying that the response is a PDF.
I would not expect the order to matter, but maybe I'm missing something?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks! I did miss
self._is_video_type(page_headers)
initially.I still like starting with the pdfs_only check — it seems simpler, and encompasses any limiting VideoCaptureOptions enabled.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Swapped them.