-
Notifications
You must be signed in to change notification settings - Fork 97
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement Seed-level video capture setting handling + Job-level PDF-only option #288
base: master
Are you sure you want to change the base?
Conversation
brozzler/worker.py
Outdated
@@ -250,7 +249,17 @@ def brozzle_page( | |||
|
|||
if not self._needs_browsing(page_headers): | |||
self.logger.info("needs fetch: %s", page) | |||
self._fetch_url(site, page=page) | |||
if site.video_capture in [ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
with if and elif, shouldn't pdfs_only should get checked first?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It should have the correct behavior either way.
The first conditional is checking that 1) there is a rule to limit video capture by MIME type AND 2) that the MIME type specified in the Content-Type header of the response indicates that it's a video. Both have to be true to enter that branch, so we aren't going to accidentally skip the PDF conditional if Content-Type is a PDF.
The second conditional is similarly constrained on the PDF-only option being set and the Content-Type header specifying that the response is a PDF.
I would not expect the order to matter, but maybe I'm missing something?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks! I did miss self._is_video_type(page_headers)
initially.
I still like starting with the pdfs_only check — it seems simpler, and encompasses any limiting VideoCaptureOptions enabled.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Swapped them.
job-conf.rst
Outdated
+=========+==========+===========+ | ||
| boolean | no | ``false`` | | ||
+---------+----------+-----------+ | ||
Limits capture to PDFs based on MIME type. This value will only impact |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Current code changes only processing of page-level urls with PDF-ish content-type.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you suggesting a better way of wording this in the documentation? It is called out here that this option specifically has limited impact and must be paired with a warcprox_meta
rule to further filter by MIME type.
Do you think "PDF-ish" is better phrasing given the ambiguity of MIME type in Content-Type headers? I would expect MIME type to be pretty straightforward in this case given that PDFs are a pretty well-defined and well-known format.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Something like this would be better, I think:
"Limits captures to PDFs, based on page's content-type header.
This value, for now, affects only brozzler's capture of page-level urls.
Note: fully limiting a crawl to PDFs only will also require updates to brozzler's Warcprox-Meta header and warcprox."
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reworded slightly for clarity.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@gretchenleighmiller, thanks for your work on this!
I've left a couple of comments it might be good to address.
This PR covers the following:
video_capture
configuration option on the Seed level. This has four possible values; see newly added documentation for details.video_capture
option, which impactsyt-dlp
extraction and MIME types of outlinks. The remainder of video capture handling is accomplished in warcprox viaWarcprox-Meta
headers.skip_av_seeds
functionality.pdfs_only
configuration option on the Job level. This is a boolean that defaults toFalse
; see newly added documentation for details.pdfs_only
option based on the MIME type of outlinks. The remaining of PDF-only filtering is accomplished in warcprox viaWarcprox-Meta
headers.