You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The bottleneck for WARC-exporting is the retrieval of raw content from source WARCs. Currently that is done sequentially: One resource is retrieved and delivered, then the next retrieval is initiated etc. If all source WARCs are on a single local disk that is about as efficient as it can get, but Netarchives are typically stored on networked storage systems where parallel requests will result in higher throughput. The problem here is that the delivery is by nature a single stream of data, which makes it non-trivial to use multiple concurrent sources.
Ideas:
Do not enforce the order of delivered content: Let X threads initiate content retrieval and synchronize on the delivery point: When a thread enters the state where it is ready to transfer bytes ("connection established / file opened"), it waits until the delivery point is available. This would work well if the resolving of the WARS is the primary bottleneck and the byte transfer is fast.
Same as above, but with a buffer for each thread so that minor content is fully resolved before being delivered. This would probably result in optimum speed.
Idea 2, but keeping the order of the elements, maybe with a queue of Futures?
Note: Fully reading the content before delivering it is problematic with Netarchives where single elements ranges from a few bytes to gigabytes.
The text was updated successfully, but these errors were encountered:
The bottleneck for WARC-exporting is the retrieval of raw content from source WARCs. Currently that is done sequentially: One resource is retrieved and delivered, then the next retrieval is initiated etc. If all source WARCs are on a single local disk that is about as efficient as it can get, but Netarchives are typically stored on networked storage systems where parallel requests will result in higher throughput. The problem here is that the delivery is by nature a single stream of data, which makes it non-trivial to use multiple concurrent sources.
Ideas:
Future
s?Note: Fully reading the content before delivering it is problematic with Netarchives where single elements ranges from a few bytes to gigabytes.
The text was updated successfully, but these errors were encountered: