Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

warcproxy context manager? #18

Open
trifle opened this issue May 7, 2016 · 11 comments
Open

warcproxy context manager? #18

trifle opened this issue May 7, 2016 · 11 comments

Comments

@trifle
Copy link
Contributor

trifle commented May 7, 2016

Hi,

I've used warcproxy indirectly through the perma project who (as you probably know) do the phantomjs + warcproxy dance to create archives.

While reading and modifying the code I noticed that the proxy usage pattern almost perfectly matches the use case of context managers:

  • set up background scaffolding (the proxy)
  • hand over a handle to the relevant context variables (a class instance or at least the CA file location and the ip:port address)
  • pull down everything once finished (join the threads)

Would you consider adding such a context manager to the warcproxy project? Adding it here should be the best fit, in case the class API would need to be modified.

PS: cc @jcushman since their code might benefit from this (hope you don't mind the ping)

A rough sketch of the idea looks like this (pasted together from perma.cc code and simplified, not actually runnable):

@contextmanager
def warc_proxy(*args, **kwargs):
    """
    Context manager for warcproxy
    """
    # Set up proxy instance
    #  use kwargs with default arguments
    proxy = WarcProxy(server_address=('127.0.0.1'),
        kwargs.get('port', 27500),
        recorded_url_q=some_q,
        )
    writer_thread = WarcWriterThread(recorded_url_q=some_q)  
    proxy.warcprox_controller = WarcproxController(proxy, writer_thread)   
    proxy.warcprox_thread = threading.Thread(target=proxy.warcprox_controller.run_until_shutdown)
    proxy.warcprox_thread.start()

    try:
        # whatever we are yielding would need to carry all relevant data
        # such as adding the threads as instance attributes
        yield proxy
    finally:
        # tear down
        proxy.warcprox_controller.stop.set()
        proxy.warcprox_thread.join()

edit: Ah, and here is a simple usage example:

with warc_proxy(port=5000) as proxy:
    browser = setup_browser(ca=proxy.ca.ca_file, address=proxy.server_address)
    browser.do_stuff()
# proxy with all threads disappears at scope exit

Now if that's not tidy I don't know what is!

@justinlittman
Copy link

We've written a warprox context manager for Social Feed Manager:
https://github.com/gwu-libraries/sfm-utils/blob/master/sfmutils/warcprox.py

In our case, we instantiate warcprox as a separate process rather than a
separate thread.

On Sat, May 7, 2016 at 4:34 PM, Pascal Jürgens [email protected]
wrote:

Hi,

I've used warcproxy indirectly through the perma project who (as you
probably know) do the phantomjs + warcproxy dance to create archives.

While reading and modifying the code I noticed that the proxy usage
pattern almost perfectly matches the use case of context managers:

  • set up background scaffolding (the proxy)
  • hand over a handle to the relevant context variables (a class
    instance or at least the CA file location and the ip:port address)
  • pull down everything once finished (join the threads)

Would you consider adding such a context manager to the warcproxy project?
Adding it here should be the best fit, in case the class API would need to
be modified.

PS: cc @jcushman https://github.com/jcushman since their code might
benefit from this (hope you don't mind the ping)

A rough sketch of the idea looks like this (pasted together from perma.cc
code and simplified, not actually runnable):

@contextmanager
def warc_proxy(_args, *_kwargs):
"""
Context manager for warcproxy
"""
# Set up proxy instance
# use kwargs with default arguments
proxy = WarcProxy(server_address=('127.0.0.1'),
kwargs.get('port', 27500),
recorded_url_q=some_q,
)
writer_thread = WarcWriterThread(recorded_url_q=some_q)
proxy.warcprox_controller = WarcproxController(proxy, writer_thread)
proxy.warcprox_thread = threading.Thread(target=proxy.warcprox_controller.run_until_shutdown)
proxy.warcprox_thread.start()

try:
    # whatever we are yielding would need to carry all relevant data
    # such as adding the threads as instance attributes
    yield proxy
finally:
    # tear down
    proxy.warcprox_controller.stop.set()
    proxy.warcprox_thread.join()


You are receiving this because you are subscribed to this thread.
Reply to this email directly or view it on GitHub
#18

@trifle
Copy link
Contributor Author

trifle commented May 8, 2016

@justinlittman Thanks, that's pretty much like what I had in mind!

By the way, I had read about twarc but didn't know about SFM. Looks pretty cool!
That said, I think the fact that almost everyone seems to need to fork warcproxy for their project is a sign that it might benefit from integrating changes back into the original project - at least that's what I'd love to see.

@justinlittman
Copy link

Agree.

I recently noticed that @nlevitt has a mess of changes underway in #17. @nlevitt -- care to comment on the roadmap for 2?

@nlevitt
Copy link
Contributor

nlevitt commented May 10, 2016

This is great, thanks for the suggestion. I'll need to take a closer look to see where in the code it would live most comfortably.

For the questions about integrating outside changes, and 2.x, I opened #19. I'll comment more over there.

@trifle
Copy link
Contributor Author

trifle commented May 10, 2016

Great! Would you like a pull request written against #17, @nlevitt ?

@ikreymer
Copy link
Contributor

@trifle I am curious about the use case with the context manager. I am working on a generalized component architecture for web archiving which will include a recording proxy, and it would be great to understand your particular use case with the context manager. (A screenshot creation workflow is something that I'd like to include especially).

I think the traditional approach is to start the proxy running in the background and have it record into a WARC (or several WARCs) over a period of time. When is it necessary to create a new proxy, wrapped in a context manager, for each request? Is it to create a new WARC for each request? Is it necessary to turn off the proxy for some other reasons? Or is it just for a one-off task that?

I'm guessing that it is to have more control over which WARC a request is recorded too, but perhaps there are other reasons.

@justinlittman
Copy link

In the case of Social Feed Manager, it is for control over which WARC a request is recorded to.

@trifle
Copy link
Contributor Author

trifle commented May 10, 2016

@ikreymer I'd certainly love to see such an architecture! (see #19)

Yes, a context manager would be for creating WARC files for a very small number of requests.

I guess the single-shot WARCs are a question of your scope: Where in web archiving your base units might be sites/domains that are crawled on one go, some people require control, error handling and access on a page level.

Projects such as perma will create one WARC per single webpage, since they archive individual articles and do one at a time. I'm a mass communication researcher (probably the same crowd that @justinlittman works with), which means that I routinely collect large (100s to 10000s) batches of articles spanning many domains. In most cases, the discovery process is not a crawl with a somewhat predictable frontier but rather external batch or stream evens (think twitter).

Now, in such a situation it's often quite inconvenient to produce large WARCs: The grouping in terms of time and order of incoming URLs at capture and at access time is probably unpredictable (bursty) and differs a lot. Which means that if I bundle records by domain while recording them but want to query across domains later on, that's going to be complicated.

@nlevitt
Copy link
Contributor

nlevitt commented May 11, 2016

@trifle a pull request against 2.x would be welcome. A pull request against master would also be welcome. Whichever or both. :)

@nlevitt
Copy link
Contributor

nlevitt commented May 11, 2016

@trifle @ilya 2.x supports a special request header called "warcprox-meta", which among a whole bunch of other things, lets you specify the name of the warc file (prefix actually, warcprox will add a serial number). That way you can write many small warcs using one long running warcprox process.

@TheTechRobo
Copy link
Contributor

In my case I have an automated setting with third-party software to perform the actual archival. I'm using warcprox mainly for convenience so I don't have to write my own WARC addition to the software. A context manager would be nice so that I don't have to spawn an additional process for the proxy and could just include it in my (Python-based) code.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants