Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

s3fs IO errors (was: Error received during crawl) #772

Open
Neo-Oli opened this issue Feb 18, 2025 · 1 comment
Open

s3fs IO errors (was: Error received during crawl) #772

Neo-Oli opened this issue Feb 18, 2025 · 1 comment

Comments

@Neo-Oli
Copy link

Neo-Oli commented Feb 18, 2025

Hi, I have an issue that I am hoping you can help me with.

I am trying to archive a rather large site. Because the resulting archives quickly filled up my VPS' storage, I mounted an S3 space with s3fs. But when I run the scraper I get this error.

{"timestamp":"2025-02-12T01:19:35.728Z","logLevel":"info","context":"behavior","message":"Run Script Finished","details":{"frameUrl":"[https://www.redcross.ch/it/sostenerci/donare/soccorsi-per-le-persone-in-fuga-nell-europa-dell-est","page":"https://www.redcross.ch/it/sostenerci/donare/soccorsi-per-le-persone-in-fuga-nell-europa-dell-est","workerid":0](https://www.redcross.ch/it/sostenerci/donare/soccorsi-per-le-persone-in-fuga-nell-europa-dell-est%22,%22page%22:%22https://www.redcross.ch/it/sostenerci/donare/soccorsi-per-le-persone-in-fuga-nell-europa-dell-est%22,%22workerid%22:0)}}
{"timestamp":"2025-02-12T01:19:35.729Z","logLevel":"info","context":"behavior","message":"Behaviors finished","details":{"finished":1,"page":"[https://www.redcross.ch/it/sostenerci/donare/soccorsi-per-le-persone-in-fuga-nell-europa-dell-est","workerid":0](https://www.redcross.ch/it/sostenerci/donare/soccorsi-per-le-persone-in-fuga-nell-europa-dell-est%22,%22workerid%22:0)}}
{"timestamp":"2025-02-12T01:19:36.732Z","logLevel":"info","context":"pageStatus","message":"Page Finished","details":{"loadState":4,"page":"[https://www.redcross.ch/it/sostenerci/donare/soccorsi-per-le-persone-in-fuga-nell-europa-dell-est","workerid":0](https://www.redcross.ch/it/sostenerci/donare/soccorsi-per-le-persone-in-fuga-nell-europa-dell-est%22,%22workerid%22:0)}}
{"timestamp":"2025-02-12T01:19:40.834Z","logLevel":"info","context":"worker","message":"Worker done, all tasks complete","details":{"workerid":0}}
{"timestamp":"2025-02-12T01:19:41.333Z","logLevel":"info","context":"crawlStatus","message":"Crawl statistics","details":{"crawled":1252,"total":1252,"pending":0,"failed":1,"limit":{"max":0,"hit":false},"pendingPages":[]}}
{"timestamp":"2025-02-12T01:19:41.340Z","logLevel":"info","context":"general","message":"Crawling done","details":{}}
{"timestamp":"2025-02-12T01:19:41.341Z","logLevel":"info","context":"general","message":"Generating WACZ","details":{}}
{"timestamp":"2025-02-12T01:19:41.432Z","logLevel":"info","context":"general","message":"Num WARC Files: 39","details":{}}
node:events:496
throw er; // Unhandled 'error' event
^

Error: EIO: i/o error, close
Emitted 'error' event on WriteStream instance at:
[90m at emitErrorNT (node:internal/streams/destroy:169:8)[39m
[90m at emitErrorCloseNT (node:internal/streams/destroy:128:3)[39m
[90m at process.processTicksAndRejections (node:internal/process/task_queues:82:21)[39m {
errno: [33m-5[39m,
code: [32m'EIO'[39m,
syscall: [32m'close'[39m
}

Node.js v20.11.1
1

I tried it many times and it always causes the same error. The s3fs mount is stable during that time, so it isn't just a network disconnect. Do you have any ideas what could cause this issue?

@ikreymer
Copy link
Member

We haven't had a chance to test with s3fs, so can't really help there specifically. However, Browsertrix Crawler actually has native support for uploading to S3-compatible storage. For security, the S3 settings are only provided via environment variables.

docker run
-e STORE_ENDPOINT_URL=https://s3-endpoint.example.com/bucket/ \
-e STORE_ACCESS_KEY=<access key> \
-e STORE_SECRET_KEY=<secret key> \
-e STORE_PATH=<optional prefix>/ \
... crawl --generateWACZ

This will upload the WACZ to https://s3-endpoint.example.com/bucket/<optional prefix>/
The prefix is not required.

We have some docs on this, but they should be extended to include this example:
https://crawler.docs.browsertrix.com/user-guide/common-options/#uploading-crawl-outputs-to-s3-compatible-storage

A working example can be found in the tests also:
https://github.com/webrecorder/browsertrix-crawler/blob/main/tests/upload-wacz.test.js

You can set the --sizeLimit on the crawl, where it will upload to S3 and exit, and you can run it in a script that restarts the crawler in this way. (We use it this way in Browsertrix app with Kubernetes).

At this time, only upload WACZs is supported, and the WACZ upload should stream directly to S3, without requiring any additional local disk space. Hope this helps!

@ikreymer ikreymer changed the title Error received during crawl s3fs IO errors (was: Error received during crawl) Feb 20, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Triage
Development

No branches or pull requests

2 participants