Improve WARC export speed #433

tokee · 2023-10-26T11:22:58Z

The bottleneck for WARC-exporting is the retrieval of raw content from source WARCs. Currently that is done sequentially: One resource is retrieved and delivered, then the next retrieval is initiated etc. If all source WARCs are on a single local disk that is about as efficient as it can get, but Netarchives are typically stored on networked storage systems where parallel requests will result in higher throughput. The problem here is that the delivery is by nature a single stream of data, which makes it non-trivial to use multiple concurrent sources.

Ideas:

Do not enforce the order of delivered content: Let X threads initiate content retrieval and synchronize on the delivery point: When a thread enters the state where it is ready to transfer bytes ("connection established / file opened"), it waits until the delivery point is available. This would work well if the resolving of the WARS is the primary bottleneck and the byte transfer is fast.
Same as above, but with a buffer for each thread so that minor content is fully resolved before being delivered. This would probably result in optimum speed.
Idea 2, but keeping the order of the elements, maybe with a queue of Futures?

Note: Fully reading the content before delivering it is problematic with Netarchives where single elements ranges from a few bytes to gigabytes.

The text was updated successfully, but these errors were encountered:

tokee added enhancement backend labels Oct 26, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve WARC export speed #433

Improve WARC export speed #433

tokee commented Oct 26, 2023

Improve WARC export speed #433

Improve WARC export speed #433

Comments

tokee commented Oct 26, 2023