A Python-based web crawler that checks for broken links on websites. It handles Cloudflare-protected sites and can output results in multiple formats.
- Recursively crawls websites to find broken links
- Handles Cloudflare-protected websites using
cloudscraper
- Ignores Cloudflare email protection links to reduce false positives
- Multiple output formats (Text, CSV, JSON)
- Stays within the same domain during crawling
- Clone the repository
- Install the required packages:
pip install -r requirements.txt
Basic usage:
python broken_link_checker.py URL [--format {text|csv|json}]
URL
: The starting URL to crawl (required)--format
: Output format (optional, defaults to text)text
: Human-readable formatcsv
: Comma-separated valuesjson
: JSON structure
- Basic usage with text output:
python broken_link_checker.py https://example.com
- CSV output format:
python broken_link_checker.py https://example.com --format csv
- JSON output format:
python broken_link_checker.py https://example.com --format json
Broken links found:
On page: https://example.com
https://example.com/broken-link -> 404
https://example.com/another-broken -> 500
Page,Broken Link,Status
https://example.com,https://example.com/broken-link,404
https://example.com,https://example.com/another-broken,500
{
"broken_links": [
{
"page": "https://example.com",
"broken_links": [
{
"url": "https://example.com/broken-link",
"status": 404
},
{
"url": "https://example.com/another-broken",
"status": 500
}
]
}
]
}
The easiest way to use the link checker is to pull the pre-built image from GitHub Container Registry:
docker pull ghcr.io/doofusdavid/broken-link-checker:latest
docker run ghcr.io/doofusdavid/broken-link-checker https://example.com
Alternatively, you can build the image locally:
docker build -t broken-link-checker .
Then run the container with your desired URL and options:
# Basic usage with text output
docker run broken-link-checker https://example.com
# Using CSV output format
docker run broken-link-checker https://example.com --format csv
# Using JSON output format
docker run broken-link-checker https://example.com --format json
# Save output to a file on your host machine
docker run broken-link-checker https://example.com --format json > results.json
The script automatically ignores:
- JavaScript links (starting with
javascript:
) - Email links (starting with
mailto:
) - Cloudflare email protection links
- Cloudflare email protection documentation pages