- update to mcmetadata v1.1.0 for:
- insecure_requests_session()
- is_non_news_domain()
- update to sitemap-tools v2.0.0
- mypy.sh: detect changes to mypy-requirements.txt
- update dokku-scripts/http-proxy.sh for new version of dokku on tarbell
- handle scheme-less link URLs
- ran autopep8.sh
- bring back improvements from web-search dokku-scripts
- use private config repo
- update airtable
- fetcher/config.py: remove REDIS_URL, add UNDEAD_FEED{S,_MAX_DAYS}
- fetcher/tasks.py: if UNDEAD_FEEDS set:
- never disable feeds
- use UNDEAD_FEED_MAX_DAYS instead of MAXIMUM_BACKOFF_MINS (if failures > MAX_FAILURES)
- scripts/gen_daily_story_rss.py: use date range for faster query
- fetcher/rss/rsswriter.py: read item template once
- requirements.txt: update to sitemap-tools v1.1.0 w/ HTML detection
- fetcher/tasks.py: no need for HTML detection
- RSS sync crontab fix
- update dokku-scripts/instance.sh to disable sync of RSS files to S3
- remove mcweb network, use MCWEB_URL from .prod
- update runtime.txt to python-3.10.14
- add sources_id index to stories table
- add sitemap parsing
- add HTTP_KEEP_CONNECTION_HEADER (defaults to off) stops akamai https npr.org feeds from timing out
- server/rss_entries.py: add fetched_at
- dokku-scripts/config.sh: require/pick-up MCWEB_TOKEN
- scripts/gen_daily_story_rss.py: generate items even if feed_url unavailable
- new server/rss_entries.py: add /api/rss_entries endpoint
- use mcmetadata (v0.12.0) webpages.MEDIA_CLOUD_USER_AGENT
- updated dashboards directory json files
- create OUTPUT_RSS_DIR if it doesn't yet exist!
- cleanup from runs of autopep8.sh and mypy.sh
- add
- fix check function in scripts/update_feeds.py: was not picking up new feeds!
- redeploy with latest mcmetadata
- update to python-3.10.13
- dashboards/rss-fetcher-alerts.json updated
- dokku-scripts/http-proxy.sh: add helpful message
- dokku-scripts/install-dokku.sh: fix apt update command
- dokku-scripts/instance.sh: remove double echo
- added LICENSE (Apache 2.0)
- dokku-scripts/config.sh: set RSS_OUTPUT_DAYS=90 for production, removed MAX_FEEDS
- server/sources.py: use queued.isnot(True)
- .env.template: removed MAX_FEEDS
- fetcher/config.py: lower AUTO_ADJUST_MIN_POLL_MINUTES default to 10 minutes(!)
- dokku-scripts/instance.sh: fix MCWEB URL, redis removal
- scripts/db_archive.py: fix for SQLAlchemy 2.0
- gen_daily_story_rss.py: fix for SQLAlchemy 2.0, write to .tmp file and rename
- dokku-scripts/configure.sh: lower prod/staging worker count to 16
- updated runtime.txt to Python 3.10.10 for security fixes
- server (API) fixes for SQLAlchemy 2.0
- Raise staging/prod workers to 32
- Raise default concurrency to 2
- Fudge SBItem.next_start to avoid extra waits
- Log feeds in Manager process
- Update to sqlalchemy 2.0, psycopg 3.1
- Use "rank" in headhunter ready query (from legacy crawler_provider/init.py)
- All server methods are async
- dokku-scripts cleanup
- add RSS_FETCH_READY_LIMIT
- update prod dashboard .json file
NOTE! Untuned!! almost certainly queries database more than needed!
- fully adaptive fetching (adjusts poll_minutes both up and down)
- replace work queue with direct process management
- replace scripts/{queue_feeds,worker}.py with scripts/fetcher.py
- removed fetcher/queue.py
- added fetcher/{direct,headhunter,scoreboard}.py
- use official PyPI mediacloud package for scripts/update_feeds.py
- dokku-scripts improvements:
- moved dokku instance configuration to config.sh
- run config.sh from push.sh
- instance.sh saves INSTANCE_SH_GIT_HASH, checked by push.sh
Implement auto-adjustment of Feed.poll_minutes
- dokku-scripts/push.sh: up prod workers from 10 to 12
- add server/sources.py: add /api/sources/N/stories/{fetched,published}_by_day}
- fetcher/stats.py: add break to loops: fix double increments!
- fetcher/tasks.py:
- split "dup" from "skipped"
- always call mcmetadata.urls.is_homepage_url (to detect bad urls early)
- keep saved_count
- report queue length as gauge at end of processing
- scripts/poll_update.py: handle new "N skipped / N dup / N added" reports
- scripts/queue_feeds.py: add "added2" counter
- removed unused dokku-scripts/sync-feeds.sh
- dokku-scripts/push.sh: complain about unknown arguments
- dokku-scripts/instance.sh: run scripts using "enter"
- Procfile: remove generator/archiver/update
- Procfile: queue feeds once a minute
- fetcher/config.py
- make MAX_FAILURES default 10 (was 4)
- add SKIP_HOME_PAGES config (default to off)
- fetcher/tasks.py
- re-raise JobTimeout in fetch_and_process_feed
- honor SKIP_HOME_PAGES
- scripts/poll_update.py
- update feeds one at a time
- add --fetches and --max-urls for experimentation
- add HTTP_CONDITIONAL_FETCH config variable
- new doc/columns.md -- explain db columns
- new dokku-scripts/dburl.sh
- scripts.poll_update: add options for experimentation
Fix more parse errors
- fetcher/tasks.py: feed response.content to feedparser: response.text decodes utf-16 as utf-8 (w/ bad results)
- dokku-scripts/test-feeds.psql: add feeds w/ doctype html, html, utf-16 remove www.mbc.mw urls (all HTML)
- CHANGELOG.md: add dates on 0.12.* versions
Fix spurrious parse errors:
- dokku-scripts/test-feeds.psql: create feeds table with small test set
- fetcher/config.py: add SAVE_PARSE_ERRORS param
- fetcher/path.py: add PARSE_ERROR_DIR
- fetcher/tasks.py: ignore feedparser "bozo"; check "version" only honor SAVE_PARSE_ERRORS
- .env.template: add TZ, SAVE_PARSE_ERRORS
- dokku-scripts/instance.sh: speed up deployment, fix mcweb config
- dokku-scripts/push.sh: vary workers acording to instance type add git:set
- fetcher/config.py: add FAST_POLL_MINUTES
- fetcher/database/models.py: comment
- scripts/poll_update.py:
- use FAST_POLL_MINUTES
- don't overwrite poll period if less than or equal
- add stats
- take pidfile lock before gathering candidates
- scripts/queue_feeds.py:
- order by id % 1001
- stats for stray catcher
Reduce default fetch interval to 6 hours (from 12):
- fetcher/config.py: change _DEFAULT_DEFAULT_INTERVAL_MINS to 6 hours!
- dokku-scripts/randomize-feeds.sh: change from 12 to 6 hours
Implement Feed.poll_minutes override, for feeds that publish uniformly short lists of items, with little overlap when polled normally:
- fetcher/database/models.py: add poll_minutes (poll period override) (currently only set by scripts/poll_update.py
- fetcher/database/versions/20230111_1237_add_poll_minutes.py: migration
- fetcher/tasks.py: implement policy changes to honor poll_minutes
- scripts/poll_update.py: script to set poll_minutes for "short fast" feeds
Administrivia:
- dokku-scripts/instance.sh:
- fix/use app_http_url function
- save feed update script output
- add crontab entry for poll_updates
Create/use global /storage/lock directory:
- fetcher/path.py: add LOCK_DIR
- fetcher/pidfile: use fetcher.path.LOCK_DIR, create if needed
Cleanup:
- scripts/update_feeds.py: import LogArgumentParser in main
- Procfile: add "update" for update_feeds.py
- instance.sh:
- configure rss-fetcher AND mcweb Dokku app networking
- install crontab entry for update in production
- Fix mypy complaint about _MAXHEADERS
- Add MAX_URL -- max URL length to accept
- Add /api/{feeds,sources}/ID/stories endpoints
- New: fetcher.pidfile -- create exclusion locks for scripts
- New: fetcher/mcweb_api.py
- scripts/update_feeds.py:
- use fetcher.mcweb_api
- change defaults
- use fetcher.pidfile
- --sleep-seconds takes float
- add --reset-next-url
- require created_at
- fetcher/database/property.py: add logging
- Update User-Agent to old system string plus plus in front of URL (rssfeeds.usatoday.com returning HTML w/ browser U-A string)
- Accept up to 1000 HTTP headers in responses (www.lexpress.fr was sometimes sending more than 100?)
- Add /api/sources/N/fetch-soon (randomizes next_fetch_attempt)
- Add Feed.rss_title column, always update Feed.name from mcweb
- Add MAXIMUM_INTERVAL_MINS from meatspace review
- Add properties.py: section/key/value store
- scripts/update_feeds.py:
- update to use mcweb API
- improve logging
- use properties
- add --reset-last-modified
- autopep8.sh: ignore venv*
- runtime.txt: update to 3.9.15 due to vulnerability
- /api/stories/by-source endpoint
- Honor SQLALCHEMY_ECHO for debug
- Fix exception in parse exception handler!
- dokku-scripts/push.sh:
- fix push.log
- check for push errors
- add --force-push
- start of feed syncing scripts (not ready)
- scripts/queue_feeds.py: fix queuing feeds by number
- fetcher/tasks.py:
- ignore feedparser charset (etc) errors
- detect "temporary" DNS errors, treat as softer than SOFT!
- use HTTP Retry-After values
- only randomize 429 errors, after range checks & backoff scaling
- don't round failure count multiplier
- log prev_system_status when clearing last_fetch_failures (to see/understand what errors are transient)
- Add Feed.last_new_stories column
- Set system_status to Working when same hash or no change
- fetcher/rss/item.template: output one RSS item per line
- dashboards -- NEW: json files for Grafana dashboards
- scripts/import_feeds.py: add --delete-{fetch-events,stories}
- dokku-scripts/instance.sh: add per-user .pw file
- server/auth.py -- NEW HTTP Basic authentication
- server/{feeds,sources,stories}.py: add authentication
- server/feeds.py: add /api/feeds/ID/fetch-soon
- fetcher/config.py: fix comments
- doc/deployment.md: update
- add/use conf.LOG_BACKUP_COUNT
- fetcher/tasks.py: add "clearing failure count" log message
- treat HTTP 504 (gateway timeout) as a soft error
- scripts/db_archive.py: fix log message
- dokku-scripts/instance.sh: remove obsolete RSS_FILE_PATH variable
- fetcher/config.py: drop TASK_TIMEOUT_SECONDS back to 180
- fetcher/logargparse.py: fix --logger-level/-L
- fetcher/tasks.py: clean up exception handling (pull up to fetch_feed) use fresh session for update_feeds; sentry.io issue BACKUP-RSS-FETCHER-67M
- fetcher/tasks.py: fetches_per_minute returns float
- fetcher/tasks.py: handle 'always' in _feed_update_period_mins & catch KeyErrors, log exceptions, log unknown period names
- dokku-scripts/push.sh: fix VERSION extraction; make more verbose require staging & prod to be pushed only to mediacloud
- scripts/db_archive.py: compress stories on the fly, fix headers, add .csv
- scripts/queue_feeds.py: refactor to allow more command line params and fix command line feeds; move FetchEvent creation & feed update to queue_feeds. multiply fetches_per_minute before rounding (used to truncate then multiply).
- scripts/db_archive.py: use max(RSS_OUTPUT_DAYS, NORMALIZED_TITLE_DAYS) for story_days default. Display default values in help message.
- NEW: dokku-scripts/randomize-feeds.sh: randomize feed.next_fetch_attempt times
- NEW: dokku-scripts/clone-db.sh: clone production database & randomize
- doc/deployment.md: update
- scripts/queue_feeds.py: if qlen==0 but db_queue!=0, clear queued feeds (fix leakage).
- fetcher/tasks.py: clear queued on insane feeds (stop leakage).
Major raking by Phil Budne
-
runtime.txt updated to python-3.9.13 (security fixes)
-
autopep.sh runs autopep8 -a -i on all files (except fetcher/database/versions/*.py)
-
mypy.sh installs and runs mypy in a virtual env. RUNS CLEANLY!
-
All scripts take uniform command line arguments for logging, initialization, help and version (in "fetcher.logargparse"):
-h, --help show this help message and exit --verbose, -v set default logging level to 'DEBUG' --quiet, -q set default logging level to 'WARNING' --list-loggers list all logger names and exit --log-config LOG_CONFIG_FILE configure logging with .json, .yml, or .ini file --log-file LOG_FILE log file name (default: main.pid.310509.log) --log-level {critical,fatal,error,warn,warning,info,debug,notset}, -l {critical,fatal,error,warn,warning,info,debug,notset} set default logging level to LEVEL --no-log-file don't log to a file --logger-level LOGGER:LEVEL, -L LOGGER:LEVEL set LOGGER (see --list-loggers) verbosity to LEVEL (see --level) --set VAR=VALUE, -S VAR=VALUE set config/environment variable --version, -V show program's version number and exit
-
fetcher.queue abstraction
All queue access abstracted to fetcher.queue; using "rq" for work queue (only redis needed, allows length monitoring), saving of "result" (ie; celery backend) data is disabled, since we only queue jobs "blind" and never check for function results returned (although queue_feeds in --loop mode could poll for results).
-
All database datetimes stored without timezones.
-
"fetcher" module (fetcher/init.py) stripped to bare minimum
(version string and fetching a few environment variables)
-
All config variables in fetcher.config "conf" object
provides mechanisms for specifying optional, boolean, integer params.
-
Script startup logging
All script startup logging includes script name and Dokku deployed git hash, followed by ONLY logging the configuration that is referenced.
-
All scripts log to BASE/storage/logs/APP.DYNO.log
Files are turned over at midnight (to filename.log.YYYY-MM-DD), seven files are kept.
-
All file path information in "fetcher.path"
-
Common Sentry integration in "fetcher.sentry"
enable passing environment="staging", enabled fastapi support, rq integration
-
SQLAlchemy "Session" factory moved to "fetcher.database"
so db params only logged if db access used/needed
-
All Proctab entries invoke existing ./run-....sh scripts
Only one place to change how a script is invoked.
-
"fetcher" process (scripts/queue_feeds.py) runs persistently (no longer invoked by crontab) [enabled by --loop PERIOD in Proctab] and: reports statistics (queue length, database counts, etc)
-
queues ready feeds every PERIOD minutes.
queues only the number of feeds necessary to cover a day's fetch attempts divided into equal sized batches (based on active enabled feeds advertised update rate, and config)
-
Allows any number of feed id's on command line.
-
Operates as before (queues MAX_FEEDS feeds) if invoked without feed ids or --loop.
-
Clears queue and exits given
--clear
-
-
Queue "worker" process started by scripts/worker.py takes common logging arguments, stats connection init runs a single queue worker (need to use dokku ps:scale worker=8).
workers set process title when active, visible by ps, top:
pbudne@ifill:~$ ps ax | grep pbudne-rss-fetcher 4121658 ? Rl 48:13 pbudne-rss-fetcher worker.1 feed 2023073 4124300 ? Rl 48:25 pbudne-rss-fetcher worker.2 feed 122482 4127145 ? Sl 47:34 pbudne-rss-fetcher worker.3 feed 1461182 4129593 ? Sl 49:49 pbudne-rss-fetcher worker.4 feed 459899
-
import_feeds script gives each feed a random "next_fetch_attempt" time to (initially) spread workload throughout the minimum requeue time interval.
*Reorganized /app/storage for non-volatile storage of logs etc;
/app/storage/db-archive /logs /rss-output-files /saved-input-files
-
Log files are persistent across container instances, available (eg; for tail) on host without docker shenanigans in /var/lib/dokku/data/storage/....
-
API server:
- New endpoints implemented:
- /api/feeds/N returns None or dict
- /api/sources/N/feeds returns list of dicts
- Enhanced endpoints:
- /api/version return data now includes "git_rev"
- /api/feeds/N/history
takes optional
limit=N
query parameter
- Non-API endpoint for RSS files:
- /rss/FILENAME
-
New feeds table columns
column use http_etag
Saved data from HTTP response ETag:
headerhttp_last_modified
Saved data from HTTP response Last-Modified:
headernext_fetch_attempt
Next time to attempt to fetch the feed queued
TRUE if the feed is currently in the work queue system_enabled
Set to FALSE by fetcher after excess failures update_minutes
Update period advertised by feed http_304
HTTP 304 (Not Modified) response seen from server system_status
Human readable result of last fetch attempt Also: last_fetch_failures is now a float, incremented by 0.5 for "soft" errors that might resolve given some (more) time.
-
Archiver process
Run from crontab: archives fetch_event and stories rows based on configuration settings.
-
Reports statistics via
dokku-graphite
plugin, displayed by grafana.
Handle some more feed and url parsing errors. Update feed title after fetch. Switch database to merged feeds.
Integrate non-news-domain skiplist from mcmetadata library.
Increase default fetch frequency to twice a day.
Pull in more aggresive URL query param removal for URL normalization.
Disable extra verbose debugging. Also update some requirements.
Fix requirements bug by forcing a minimum version of mediacloud-metadata library.
Skip homepage-like URLs.
Safer normalized title/url queries.
Refactored database code to support testing. Also handling failure counting more robustly now.
Properly save and double-check against normalized URLs for uniqueness.
Better testing of RSS generation.
Better handling of missing dates in output RSS files.
Write out own feed so we can customize error handling and fields outputted more closely. Also fix a small URL validity check bug fix.
Fix bug in function call
Requirements bump.
Don't allow NULL chars in story titles.
Make Celery Backend a configuration option. We default to RabbitMQ for Broker and Redis for Backend because that is a super common setup that seems to scale well.
Small bug fixes.
Add feed history to help debugging, view new FetchEvents objects.
Fix some date parsing bugs by using built-in approach from feed parsing library. Also add some more unit tests.
Added back in a necessary index for fast querying.
More debug logging.
Pretending to be a browser in order to see if it fixes a 403 bug.
Add fetch_events
table for history and debugging. Also move title uniqueness check to software (not DB) to allow for
empty title fields.
Rewrite main rss fetching task to make logic more obvious, and also try and streamline database handle usage.
Switch to FastApi for returning counts to help debug. See /redoc
, or /docs
for full API documentation and Open API
specification file.
New option to log RSS info to files on disk, controlled via SAVE_RSS_FILES
env-var (1 or 0)
Small tweak to skip relative URLs. Also more debug logging.
Fix bug that was checking for duplicate titles across all sources within last 7 days, instead of just within one media source.
Update requirements and fix bug related to overly aggressive marking failures.
Add in more feeds from production server.
Check a normalized story URL and title for uniqueness before saving, like we do on our production system. This is a critical de-duplication step.
Generate files for yesterday (not 2 days ago) because that will make delivered results more timely.
Add in new feed. Prep to show some data on website.
More work on concurrency for prod server and related configurations.
Tweaks to RSS file generation to make it more robust.
Query bug fix.
Handle podcast feeds, which don't have links by ignoring them in reporting script (they have enclosures instead)
Deployment work for generating daily rss files.
Retry feeds that we tried by didn't respond (up to 3 times in a row before giving up).
Update dependencies to latest
RSS path loaded from env-var
Ignore a whole bunch of errors that are expected ones
Add title and canonical domain to daily feeds
Move max feeds to fetch at a time limit to an env var for easier config (MAX_FEEDS
defaults to 1000)
Restructured queries to try and solve DB connection leak bug.
Production performance-related tweaks.
Make sure duplicate story urls don't get inserted (no matter where they are from). This is the quick solution to making sure an RSS feed with stories we have already saved doesn't create duplicates.
First release, seems to work.