Skip to content

Architecture

TheTechRobo edited this page Jan 12, 2025 · 1 revision

Tracker

The tracker keeps track of the queue and stores job results. The tracker uses rue.

Bot

Connects to an http2irc endpoint using bot2h, listening for commands to manipulate the queue. Stored in the irc directory.

Server

Serves items to the workers. See Tracker Protocol for information on how exactly that occurs. Uses the server_secrets RethinkDB table with id/value objects. Timing attacks can be used to discover the length of the secret, but not the content, so it is recommended to use a fixed-length secret such as a hash or UUID. Authentication is done via Basic authentication, where the username maps to the id, and the password maps to the value.

Dashboard

Provides information on the queue and items. Uses Quart and Jinja2.

Pipeline

The pipeline does the actual archiving.

warcprox

warcprox is used to write the WARCs. It connects to a local RethinkDB instance for dedupe. Dedupe entries older than seven days are purged whenever a job is completed.

Worker

The worker does the browsing. It controls a headless Chrome instance using brozzler. app.py contains the main loop, browse.py contains the browsing code, tracker.py handles communication with the tracker, and meta.py contains the version. The version MUST be updated whenever the worker code is changed so that if a version is broken in some way, the affected jobs can be easily discovered.

Uploader

Uploads WARC files to the target. Relies on filename parsing to get certain information without parsing the WARC file.

Target

Runs bullseye.

The three mnbot-specific parts are:

Verifier

Verifies WARCs by decompressing them and checking them with warc-tiny.

Packer

Packs verified WARCs into a megawarc.

Uploader

Uploads old megawarcs to IA.

Clone this wiki locally