Skip to content

Commit

Permalink
Update README.md
Browse files Browse the repository at this point in the history
  • Loading branch information
Wesley van Lee committed Oct 9, 2024
1 parent 1fc3bc5 commit 11b5d85
Show file tree
Hide file tree
Showing 2 changed files with 10 additions and 14 deletions.
22 changes: 9 additions & 13 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,21 +1,17 @@
# Scrapy Webarchive

A Web Archive extension for Scrapy
Scrapy Webarchive is a plugin for Scrapy that allows users to capture and export web archives in the WARC and WACZ formats during crawling.

## Features

# Installation
* Save web crawls in WACZ format (multiple storages supported; local and cloud).
* Crawl against WACZ format archives.
* Integrate seamlessly with Scrapy’s spider request and response cycle.

Add to your `settings.py` or your spider configuration.
## Compatibility

```python
EXTENSIONS = {
'scrapy_webarchive.extensions.WaczExporter': 543,
}
* Python 3.8+

DOWNLOADER_MIDDLEWARES = {
'scrapy_webarchive.downloadermiddlewares.WaczMiddleware': 543,
}
## Documentation

# year, month, day and timestamp are the supported template variables that you can use.
ARCHIVE_EXPORT_URI = 's3://scrapy-webarchive/{year}/{month}/{day}/'
```
Documentation is available online at [developers.thequestionmark.org/scrapy-webarchive/](https://developers.thequestionmark.org/scrapy-webarchive/)
2 changes: 1 addition & 1 deletion scrapy_webarchive/middleware.py
Original file line number Diff line number Diff line change
Expand Up @@ -58,7 +58,7 @@ def process_start_requests(self, start_requests: Iterable[Request], spider: Spid
url = entry["url"]

# filter out off-site responses
if hasattr(spider, 'allowed_domains') and urlparse(url).hostname not in spider.allowed_domains:
if hasattr(spider, "allowed_domains") and urlparse(url).hostname not in spider.allowed_domains:
continue

# only accept whitelisted responses if requested by spider
Expand Down

0 comments on commit 11b5d85

Please sign in to comment.