-
Notifications
You must be signed in to change notification settings - Fork 0
Architecture
The tracker keeps track of the queue and stores job results. The tracker uses rue.
Connects to an http2irc endpoint using bot2h, listening for commands to manipulate the queue. Stored in the irc
directory.
Serves items to the workers. See Tracker Protocol for information on how exactly that occurs. Uses the server_secrets
RethinkDB table with id/value objects. Timing attacks can be used to discover the length of the secret, but not the content, so it is recommended to use a fixed-length secret such as a hash or UUID. Authentication is done via Basic authentication, where the username maps to the id, and the password maps to the value.
Provides information on the queue and items. Uses Quart and Jinja2.
The pipeline does the actual archiving.
warcprox is used to write the WARCs. It connects to a local RethinkDB instance for dedupe. Dedupe entries older than seven days are purged whenever a job is completed.
The worker does the browsing. It controls a headless Chrome instance using brozzler. app.py
contains the main loop, browse.py
contains the browsing code, tracker.py
handles communication with the tracker, and meta.py
contains the version. The version MUST be updated whenever the worker code is changed so that if a version is broken in some way, the affected jobs can be easily discovered.
Uploads WARC files to the target. Relies on filename parsing to get certain information without parsing the WARC file.
Runs bullseye.
The three mnbot-specific parts are:
Verifies WARCs by decompressing them and checking them with warc-tiny.
Packs verified WARCs into a megawarc.
Uploads old megawarcs to IA.