Skip to content

doofusdavid/broken-link-checker

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

7 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Broken Link Checker

A Python-based web crawler that checks for broken links on websites. It handles Cloudflare-protected sites and can output results in multiple formats.

Features

  • Recursively crawls websites to find broken links
  • Handles Cloudflare-protected websites using cloudscraper
  • Ignores Cloudflare email protection links to reduce false positives
  • Multiple output formats (Text, CSV, JSON)
  • Stays within the same domain during crawling

Installation

  1. Clone the repository
  2. Install the required packages:
pip install -r requirements.txt

Usage

Basic usage:

python broken_link_checker.py URL [--format {text|csv|json}]

Parameters

  • URL: The starting URL to crawl (required)
  • --format: Output format (optional, defaults to text)
    • text: Human-readable format
    • csv: Comma-separated values
    • json: JSON structure

Examples

  1. Basic usage with text output:
python broken_link_checker.py https://example.com
  1. CSV output format:
python broken_link_checker.py https://example.com --format csv
  1. JSON output format:
python broken_link_checker.py https://example.com --format json

Output Format Examples

Text Format (Default)

Broken links found:

On page: https://example.com
  https://example.com/broken-link -> 404
  https://example.com/another-broken -> 500

CSV Format

Page,Broken Link,Status
https://example.com,https://example.com/broken-link,404
https://example.com,https://example.com/another-broken,500

JSON Format

{
  "broken_links": [
    {
      "page": "https://example.com",
      "broken_links": [
        {
          "url": "https://example.com/broken-link",
          "status": 404
        },
        {
          "url": "https://example.com/another-broken",
          "status": 500
        }
      ]
    }
  ]
}

Docker Usage

Using Pre-built Image from GitHub Container Registry

The easiest way to use the link checker is to pull the pre-built image from GitHub Container Registry:

docker pull ghcr.io/doofusdavid/broken-link-checker:latest
docker run ghcr.io/doofusdavid/broken-link-checker https://example.com

Building Locally

Alternatively, you can build the image locally:

docker build -t broken-link-checker .

Then run the container with your desired URL and options:

# Basic usage with text output
docker run broken-link-checker https://example.com

# Using CSV output format
docker run broken-link-checker https://example.com --format csv

# Using JSON output format
docker run broken-link-checker https://example.com --format json

# Save output to a file on your host machine
docker run broken-link-checker https://example.com --format json > results.json

Note

The script automatically ignores:

  • JavaScript links (starting with javascript:)
  • Email links (starting with mailto:)
  • Cloudflare email protection links
  • Cloudflare email protection documentation pages

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages