Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve WARC export speed #433

Open
tokee opened this issue Oct 26, 2023 · 0 comments
Open

Improve WARC export speed #433

tokee opened this issue Oct 26, 2023 · 0 comments

Comments

@tokee
Copy link
Contributor

tokee commented Oct 26, 2023

The bottleneck for WARC-exporting is the retrieval of raw content from source WARCs. Currently that is done sequentially: One resource is retrieved and delivered, then the next retrieval is initiated etc. If all source WARCs are on a single local disk that is about as efficient as it can get, but Netarchives are typically stored on networked storage systems where parallel requests will result in higher throughput. The problem here is that the delivery is by nature a single stream of data, which makes it non-trivial to use multiple concurrent sources.

Ideas:

  1. Do not enforce the order of delivered content: Let X threads initiate content retrieval and synchronize on the delivery point: When a thread enters the state where it is ready to transfer bytes ("connection established / file opened"), it waits until the delivery point is available. This would work well if the resolving of the WARS is the primary bottleneck and the byte transfer is fast.
  2. Same as above, but with a buffer for each thread so that minor content is fully resolved before being delivered. This would probably result in optimum speed.
  3. Idea 2, but keeping the order of the elements, maybe with a queue of Futures?

Note: Fully reading the content before delivering it is problematic with Netarchives where single elements ranges from a few bytes to gigabytes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant