-
Notifications
You must be signed in to change notification settings - Fork 55
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
warcproxy context manager? #18
Comments
We've written a warprox context manager for Social Feed Manager: In our case, we instantiate warcprox as a separate process rather than a On Sat, May 7, 2016 at 4:34 PM, Pascal Jürgens [email protected]
|
@justinlittman Thanks, that's pretty much like what I had in mind! By the way, I had read about twarc but didn't know about SFM. Looks pretty cool! |
This is great, thanks for the suggestion. I'll need to take a closer look to see where in the code it would live most comfortably. For the questions about integrating outside changes, and 2.x, I opened #19. I'll comment more over there. |
@trifle I am curious about the use case with the context manager. I am working on a generalized component architecture for web archiving which will include a recording proxy, and it would be great to understand your particular use case with the context manager. (A screenshot creation workflow is something that I'd like to include especially). I think the traditional approach is to start the proxy running in the background and have it record into a WARC (or several WARCs) over a period of time. When is it necessary to create a new proxy, wrapped in a context manager, for each request? Is it to create a new WARC for each request? Is it necessary to turn off the proxy for some other reasons? Or is it just for a one-off task that? I'm guessing that it is to have more control over which WARC a request is recorded too, but perhaps there are other reasons. |
In the case of Social Feed Manager, it is for control over which WARC a request is recorded to. |
@ikreymer I'd certainly love to see such an architecture! (see #19) Yes, a context manager would be for creating WARC files for a very small number of requests. I guess the single-shot WARCs are a question of your scope: Where in web archiving your base units might be sites/domains that are crawled on one go, some people require control, error handling and access on a page level. Projects such as perma will create one WARC per single webpage, since they archive individual articles and do one at a time. I'm a mass communication researcher (probably the same crowd that @justinlittman works with), which means that I routinely collect large (100s to 10000s) batches of articles spanning many domains. In most cases, the discovery process is not a crawl with a somewhat predictable frontier but rather external batch or stream evens (think twitter). Now, in such a situation it's often quite inconvenient to produce large WARCs: The grouping in terms of time and order of incoming URLs at capture and at access time is probably unpredictable (bursty) and differs a lot. Which means that if I bundle records by domain while recording them but want to query across domains later on, that's going to be complicated. |
@trifle a pull request against 2.x would be welcome. A pull request against master would also be welcome. Whichever or both. :) |
In my case I have an automated setting with third-party software to perform the actual archival. I'm using warcprox mainly for convenience so I don't have to write my own WARC addition to the software. A context manager would be nice so that I don't have to spawn an additional process for the proxy and could just include it in my (Python-based) code. |
Hi,
I've used warcproxy indirectly through the perma project who (as you probably know) do the phantomjs + warcproxy dance to create archives.
While reading and modifying the code I noticed that the proxy usage pattern almost perfectly matches the use case of context managers:
Would you consider adding such a context manager to the warcproxy project? Adding it here should be the best fit, in case the class API would need to be modified.
PS: cc @jcushman since their code might benefit from this (hope you don't mind the ping)
A rough sketch of the idea looks like this (pasted together from perma.cc code and simplified, not actually runnable):
edit: Ah, and here is a simple usage example:
Now if that's not tidy I don't know what is!
The text was updated successfully, but these errors were encountered: