Skip to content

Latest commit

 

History

History
722 lines (491 loc) · 21.9 KB

CHANGELOG.md

File metadata and controls

722 lines (491 loc) · 21.9 KB

Change Log

v0.16.9 2024-10-12

  • update to mcmetadata v1.1.0 for:
    • insecure_requests_session()
    • is_non_news_domain()
  • update to sitemap-tools v2.0.0
  • mypy.sh: detect changes to mypy-requirements.txt
  • update dokku-scripts/http-proxy.sh for new version of dokku on tarbell
  • handle scheme-less link URLs
  • ran autopep8.sh
  • bring back improvements from web-search dokku-scripts
    • use private config repo
    • update airtable

v0.16.8 2024-09-08

  • fetcher/config.py: remove REDIS_URL, add UNDEAD_FEED{S,_MAX_DAYS}
  • fetcher/tasks.py: if UNDEAD_FEEDS set:
    1. never disable feeds
    2. use UNDEAD_FEED_MAX_DAYS instead of MAXIMUM_BACKOFF_MINS (if failures > MAX_FAILURES)

v0.16.7 2024-09-03

  • scripts/gen_daily_story_rss.py: use date range for faster query
  • fetcher/rss/rsswriter.py: read item template once
  • requirements.txt: update to sitemap-tools v1.1.0 w/ HTML detection
  • fetcher/tasks.py: no need for HTML detection

v0.16.6 2024-08-16

  • RSS sync crontab fix

v0.16.5 2024-08-16

  • update dokku-scripts/instance.sh to disable sync of RSS files to S3
  • remove mcweb network, use MCWEB_URL from .prod

v0.16.4 2024-08-15

  • update runtime.txt to python-3.10.14
  • add sources_id index to stories table

v0.16.3 2024-08-02

  • add sitemap parsing

v0.16.2 2024-04-07

  • add HTTP_KEEP_CONNECTION_HEADER (defaults to off) stops akamai https npr.org feeds from timing out

v0.16.1 2024-03-23

  • server/rss_entries.py: add fetched_at

v0.16.0 2024-03-19

  • dokku-scripts/config.sh: require/pick-up MCWEB_TOKEN
  • scripts/gen_daily_story_rss.py: generate items even if feed_url unavailable
  • new server/rss_entries.py: add /api/rss_entries endpoint

v0.15.1 2024-02-15

  • use mcmetadata (v0.12.0) webpages.MEDIA_CLOUD_USER_AGENT

v0.15.0 2023-08-07

  • updated dashboards directory json files
  • create OUTPUT_RSS_DIR if it doesn't yet exist!
  • cleanup from runs of autopep8.sh and mypy.sh
  • add tag to RSS file

v0.14.9 2023-10-16

  • fix check function in scripts/update_feeds.py: was not picking up new feeds!

v0.14.8 2023-10-14

  • redeploy with latest mcmetadata
  • update to python-3.10.13
  • dashboards/rss-fetcher-alerts.json updated
  • dokku-scripts/http-proxy.sh: add helpful message
  • dokku-scripts/install-dokku.sh: fix apt update command
  • dokku-scripts/instance.sh: remove double echo

v0.14.7 2023-05-18

  • added LICENSE (Apache 2.0)
  • dokku-scripts/config.sh: set RSS_OUTPUT_DAYS=90 for production, removed MAX_FEEDS
  • server/sources.py: use queued.isnot(True)
  • .env.template: removed MAX_FEEDS

v0.14.6 2023-05-05

  • fetcher/config.py: lower AUTO_ADJUST_MIN_POLL_MINUTES default to 10 minutes(!)
  • dokku-scripts/instance.sh: fix MCWEB URL, redis removal

v0.14.5 2023-04-24

  • scripts/db_archive.py: fix for SQLAlchemy 2.0
  • gen_daily_story_rss.py: fix for SQLAlchemy 2.0, write to .tmp file and rename
  • dokku-scripts/configure.sh: lower prod/staging worker count to 16
  • updated runtime.txt to Python 3.10.10 for security fixes

v0.14.4 2023-04-23

  • server (API) fixes for SQLAlchemy 2.0

v0.14.3 2023-04-23

  • Raise staging/prod workers to 32
  • Raise default concurrency to 2
  • Fudge SBItem.next_start to avoid extra waits
  • Log feeds in Manager process

v0.14.2 2023-04-23

  • Update to sqlalchemy 2.0, psycopg 3.1
  • Use "rank" in headhunter ready query (from legacy crawler_provider/init.py)
  • All server methods are async

v0.14.1 2023-04-22

  • dokku-scripts cleanup
  • add RSS_FETCH_READY_LIMIT
  • update prod dashboard .json file

v0.14.0 2023-04-22

NOTE! Untuned!! almost certainly queries database more than needed!

  • fully adaptive fetching (adjusts poll_minutes both up and down)
  • replace work queue with direct process management
    • replace scripts/{queue_feeds,worker}.py with scripts/fetcher.py
    • removed fetcher/queue.py
    • added fetcher/{direct,headhunter,scoreboard}.py
  • use official PyPI mediacloud package for scripts/update_feeds.py
  • dokku-scripts improvements:
    • moved dokku instance configuration to config.sh
    • run config.sh from push.sh
    • instance.sh saves INSTANCE_SH_GIT_HASH, checked by push.sh

v0.13.0 2023-03-14

Implement auto-adjustment of Feed.poll_minutes

v0.12.15 2023-02-20

  • dokku-scripts/push.sh: up prod workers from 10 to 12
  • add server/sources.py: add /api/sources/N/stories/{fetched,published}_by_day}

v0.12.14 2023-02-17

  • fetcher/stats.py: add break to loops: fix double increments!
  • fetcher/tasks.py:
    • split "dup" from "skipped"
    • always call mcmetadata.urls.is_homepage_url (to detect bad urls early)
    • keep saved_count
    • report queue length as gauge at end of processing
  • scripts/poll_update.py: handle new "N skipped / N dup / N added" reports
  • scripts/queue_feeds.py: add "added2" counter
  • removed unused dokku-scripts/sync-feeds.sh
  • dokku-scripts/push.sh: complain about unknown arguments
  • dokku-scripts/instance.sh: run scripts using "enter"
  • Procfile: remove generator/archiver/update

v0.12.13 2023-02-12

  • Procfile: queue feeds once a minute
  • fetcher/config.py
    • make MAX_FAILURES default 10 (was 4)
    • add SKIP_HOME_PAGES config (default to off)
  • fetcher/tasks.py
    • re-raise JobTimeout in fetch_and_process_feed
    • honor SKIP_HOME_PAGES
  • scripts/poll_update.py
    • update feeds one at a time
    • add --fetches and --max-urls for experimentation

v0.12.12 2023-02-01

  • add HTTP_CONDITIONAL_FETCH config variable
  • new doc/columns.md -- explain db columns
  • new dokku-scripts/dburl.sh
  • scripts.poll_update: add options for experimentation

v0.12.11 2023-01-28

Fix more parse errors

  • fetcher/tasks.py: feed response.content to feedparser: response.text decodes utf-16 as utf-8 (w/ bad results)
  • dokku-scripts/test-feeds.psql: add feeds w/ doctype html, html, utf-16 remove www.mbc.mw urls (all HTML)
  • CHANGELOG.md: add dates on 0.12.* versions

v0.12.10 2023-01-23

Fix spurrious parse errors:

  • dokku-scripts/test-feeds.psql: create feeds table with small test set
  • fetcher/config.py: add SAVE_PARSE_ERRORS param
  • fetcher/path.py: add PARSE_ERROR_DIR
  • fetcher/tasks.py: ignore feedparser "bozo"; check "version" only honor SAVE_PARSE_ERRORS
  • .env.template: add TZ, SAVE_PARSE_ERRORS

v0.12.9 2023-01-20

  • dokku-scripts/instance.sh: speed up deployment, fix mcweb config
  • dokku-scripts/push.sh: vary workers acording to instance type add git:set
  • fetcher/config.py: add FAST_POLL_MINUTES
  • fetcher/database/models.py: comment
  • scripts/poll_update.py:
    • use FAST_POLL_MINUTES
    • don't overwrite poll period if less than or equal
    • add stats
    • take pidfile lock before gathering candidates
  • scripts/queue_feeds.py:
    • order by id % 1001
    • stats for stray catcher

v0.12.8 2023-01-13

Reduce default fetch interval to 6 hours (from 12):

  • fetcher/config.py: change _DEFAULT_DEFAULT_INTERVAL_MINS to 6 hours!
  • dokku-scripts/randomize-feeds.sh: change from 12 to 6 hours

Implement Feed.poll_minutes override, for feeds that publish uniformly short lists of items, with little overlap when polled normally:

  • fetcher/database/models.py: add poll_minutes (poll period override) (currently only set by scripts/poll_update.py
  • fetcher/database/versions/20230111_1237_add_poll_minutes.py: migration
  • fetcher/tasks.py: implement policy changes to honor poll_minutes
  • scripts/poll_update.py: script to set poll_minutes for "short fast" feeds

Administrivia:

  • dokku-scripts/instance.sh:
    • fix/use app_http_url function
    • save feed update script output
    • add crontab entry for poll_updates

Create/use global /storage/lock directory:

  • fetcher/path.py: add LOCK_DIR
  • fetcher/pidfile: use fetcher.path.LOCK_DIR, create if needed

Cleanup:

  • scripts/update_feeds.py: import LogArgumentParser in main

v0.12.7 2023-01-03

  • Procfile: add "update" for update_feeds.py
  • instance.sh:
    • configure rss-fetcher AND mcweb Dokku app networking
    • install crontab entry for update in production
  • Fix mypy complaint about _MAXHEADERS
  • Add MAX_URL -- max URL length to accept
  • Add /api/{feeds,sources}/ID/stories endpoints
  • New: fetcher.pidfile -- create exclusion locks for scripts
  • New: fetcher/mcweb_api.py
  • scripts/update_feeds.py:
    • use fetcher.mcweb_api
    • change defaults
    • use fetcher.pidfile
    • --sleep-seconds takes float
    • add --reset-next-url
    • require created_at
  • fetcher/database/property.py: add logging

v0.12.6 2022-12-26

  • Update User-Agent to old system string plus plus in front of URL (rssfeeds.usatoday.com returning HTML w/ browser U-A string)
  • Accept up to 1000 HTTP headers in responses (www.lexpress.fr was sometimes sending more than 100?)

v0.12.5 2022-12-21

  • Add /api/sources/N/fetch-soon (randomizes next_fetch_attempt)
  • Add Feed.rss_title column, always update Feed.name from mcweb
  • Add MAXIMUM_INTERVAL_MINS from meatspace review
  • Add properties.py: section/key/value store
  • scripts/update_feeds.py:
    • update to use mcweb API
    • improve logging
    • use properties
    • add --reset-last-modified
  • autopep8.sh: ignore venv*
  • runtime.txt: update to 3.9.15 due to vulnerability

v0.12.4 2022-12-10

  • /api/stories/by-source endpoint
  • Honor SQLALCHEMY_ECHO for debug
  • Fix exception in parse exception handler!
  • dokku-scripts/push.sh:
    • fix push.log
    • check for push errors
    • add --force-push
  • start of feed syncing scripts (not ready)

v0.12.3 2022-12-10?

  • scripts/queue_feeds.py: fix queuing feeds by number
  • fetcher/tasks.py:
    • ignore feedparser charset (etc) errors
    • detect "temporary" DNS errors, treat as softer than SOFT!
    • use HTTP Retry-After values
    • only randomize 429 errors, after range checks & backoff scaling
    • don't round failure count multiplier
    • log prev_system_status when clearing last_fetch_failures (to see/understand what errors are transient)
    • Add Feed.last_new_stories column
    • Set system_status to Working when same hash or no change
  • fetcher/rss/item.template: output one RSS item per line
  • dashboards -- NEW: json files for Grafana dashboards
  • scripts/import_feeds.py: add --delete-{fetch-events,stories}
  • dokku-scripts/instance.sh: add per-user .pw file
  • server/auth.py -- NEW HTTP Basic authentication
  • server/{feeds,sources,stories}.py: add authentication
  • server/feeds.py: add /api/feeds/ID/fetch-soon

v0.12.2 2022-11-19

  • fetcher/config.py: fix comments
  • doc/deployment.md: update
  • add/use conf.LOG_BACKUP_COUNT
  • fetcher/tasks.py: add "clearing failure count" log message
  • treat HTTP 504 (gateway timeout) as a soft error
  • scripts/db_archive.py: fix log message
  • dokku-scripts/instance.sh: remove obsolete RSS_FILE_PATH variable

v0.12.1 2022-11-09

  • fetcher/config.py: drop TASK_TIMEOUT_SECONDS back to 180
  • fetcher/logargparse.py: fix --logger-level/-L
  • fetcher/tasks.py: clean up exception handling (pull up to fetch_feed) use fresh session for update_feeds; sentry.io issue BACKUP-RSS-FETCHER-67M
  • fetcher/tasks.py: fetches_per_minute returns float
  • fetcher/tasks.py: handle 'always' in _feed_update_period_mins & catch KeyErrors, log exceptions, log unknown period names
  • dokku-scripts/push.sh: fix VERSION extraction; make more verbose require staging & prod to be pushed only to mediacloud
  • scripts/db_archive.py: compress stories on the fly, fix headers, add .csv
  • scripts/queue_feeds.py: refactor to allow more command line params and fix command line feeds; move FetchEvent creation & feed update to queue_feeds. multiply fetches_per_minute before rounding (used to truncate then multiply).
  • scripts/db_archive.py: use max(RSS_OUTPUT_DAYS, NORMALIZED_TITLE_DAYS) for story_days default. Display default values in help message.
  • NEW: dokku-scripts/randomize-feeds.sh: randomize feed.next_fetch_attempt times
  • NEW: dokku-scripts/clone-db.sh: clone production database & randomize
  • doc/deployment.md: update
  • scripts/queue_feeds.py: if qlen==0 but db_queue!=0, clear queued feeds (fix leakage).
  • fetcher/tasks.py: clear queued on insane feeds (stop leakage).

v0.12.0 2022-11-07

Major raking by Phil Budne

  • runtime.txt updated to python-3.9.13 (security fixes)

  • autopep.sh runs autopep8 -a -i on all files (except fetcher/database/versions/*.py)

  • mypy.sh installs and runs mypy in a virtual env. RUNS CLEANLY!

  • All scripts take uniform command line arguments for logging, initialization, help and version (in "fetcher.logargparse"):

    -h, --help            show this help message and exit
    --verbose, -v         set default logging level to 'DEBUG'
    --quiet, -q           set default logging level to 'WARNING'
    --list-loggers        list all logger names and exit
    --log-config LOG_CONFIG_FILE
     		 configure logging with .json, .yml, or .ini file
    --log-file LOG_FILE   log file name (default: main.pid.310509.log)
    --log-level {critical,fatal,error,warn,warning,info,debug,notset}, -l {critical,fatal,error,warn,warning,info,debug,notset}
     		 set default logging level to LEVEL
    --no-log-file         don't log to a file
    --logger-level LOGGER:LEVEL, -L LOGGER:LEVEL
     		 set LOGGER (see --list-loggers) verbosity to LEVEL (see --level)
    --set VAR=VALUE, -S VAR=VALUE
     		 set config/environment variable
    --version, -V         show program's version number and exit
    
  • fetcher.queue abstraction

    All queue access abstracted to fetcher.queue; using "rq" for work queue (only redis needed, allows length monitoring), saving of "result" (ie; celery backend) data is disabled, since we only queue jobs "blind" and never check for function results returned (although queue_feeds in --loop mode could poll for results).

  • All database datetimes stored without timezones.

  • "fetcher" module (fetcher/init.py) stripped to bare minimum

    (version string and fetching a few environment variables)
    
  • All config variables in fetcher.config "conf" object

    provides mechanisms for specifying optional, boolean, integer params.

  • Script startup logging

    All script startup logging includes script name and Dokku deployed git hash, followed by ONLY logging the configuration that is referenced.

  • All scripts log to BASE/storage/logs/APP.DYNO.log

    Files are turned over at midnight (to filename.log.YYYY-MM-DD), seven files are kept.

  • All file path information in "fetcher.path"

  • Common Sentry integration in "fetcher.sentry"

      enable passing environment="staging", enabled fastapi support, rq integration
    
  • SQLAlchemy "Session" factory moved to "fetcher.database"

      so db params only logged if db access used/needed
    
  • All Proctab entries invoke existing ./run-....sh scripts

      Only one place to change how a script is invoked.
    
  • "fetcher" process (scripts/queue_feeds.py) runs persistently (no longer invoked by crontab) [enabled by --loop PERIOD in Proctab] and: reports statistics (queue length, database counts, etc)

    • queues ready feeds every PERIOD minutes.

      queues only the number of feeds necessary to cover a day's fetch attempts divided into equal sized batches (based on active enabled feeds advertised update rate, and config)

    • Allows any number of feed id's on command line.

    • Operates as before (queues MAX_FEEDS feeds) if invoked without feed ids or --loop.

    • Clears queue and exits given --clear

  • Queue "worker" process started by scripts/worker.py takes common logging arguments, stats connection init runs a single queue worker (need to use dokku ps:scale worker=8).

    workers set process title when active, visible by ps, top:

    pbudne@ifill:~$ ps ax | grep pbudne-rss-fetcher
    4121658 ?        Rl    48:13 pbudne-rss-fetcher worker.1 feed 2023073
    4124300 ?        Rl    48:25 pbudne-rss-fetcher worker.2 feed 122482
    4127145 ?        Sl    47:34 pbudne-rss-fetcher worker.3 feed 1461182
    4129593 ?        Sl    49:49 pbudne-rss-fetcher worker.4 feed 459899
    
  • import_feeds script gives each feed a random "next_fetch_attempt" time to (initially) spread workload throughout the minimum requeue time interval.

    *Reorganized /app/storage for non-volatile storage of logs etc;

    /app/storage/db-archive
    	    /logs
    	    /rss-output-files
    	    /saved-input-files
    
  • Log files are persistent across container instances, available (eg; for tail) on host without docker shenanigans in /var/lib/dokku/data/storage/....

  • API server:

    • New endpoints implemented:
    • /api/feeds/N returns None or dict
    • /api/sources/N/feeds returns list of dicts
    • Enhanced endpoints:
    • /api/version return data now includes "git_rev"
    • /api/feeds/N/history takes optional limit=N query parameter
    • Non-API endpoint for RSS files:
    • /rss/FILENAME
  • New feeds table columns

    column use
    http_etag Saved data from HTTP response ETag: header
    http_last_modified Saved data from HTTP response Last-Modified: header
    next_fetch_attempt Next time to attempt to fetch the feed
    queued TRUE if the feed is currently in the work queue
    system_enabled Set to FALSE by fetcher after excess failures
    update_minutes Update period advertised by feed
    http_304 HTTP 304 (Not Modified) response seen from server
    system_status Human readable result of last fetch attempt

    Also: last_fetch_failures is now a float, incremented by 0.5 for "soft" errors that might resolve given some (more) time.

  • Archiver process

    Run from crontab: archives fetch_event and stories rows based on configuration settings.

  • Reports statistics via dokku-graphite plugin, displayed by grafana.

v0.11.12 2022-08-22

Handle some more feed and url parsing errors. Update feed title after fetch. Switch database to merged feeds.

v0.11.11 2022-08-12

Integrate non-news-domain skiplist from mcmetadata library.

v0.11.10 2022-08-04

Increase default fetch frequency to twice a day.

v0.11.9 2022-08-02

Pull in more aggresive URL query param removal for URL normalization.

v0.11.8 2022-08-02

Disable extra verbose debugging. Also update some requirements.

v0.11.7 2022-08-02

Fix requirements bug by forcing a minimum version of mediacloud-metadata library.

v0.11.6 2022-07-31

Skip homepage-like URLs.

v0.11.5 2022-07-27

Safer normalized title/url queries.

v0.11.4 2022-07-27

Refactored database code to support testing. Also handling failure counting more robustly now.

v0.11.3 2022-07-27

Properly save and double-check against normalized URLs for uniqueness.

v0.11.2 2022-07-27

Better testing of RSS generation.

v0.11.1 2022-07-27

Better handling of missing dates in output RSS files.

v0.11.0 2022-07-27

Write out own feed so we can customize error handling and fields outputted more closely. Also fix a small URL validity check bug fix.

v0.10.5 2022-07-25

Fix bug in function call

v0.10.4 2022-07-25

Requirements bump.

v0.10.3 2022-07-19

Don't allow NULL chars in story titles.

v0.10.2 2022-07-19

Make Celery Backend a configuration option. We default to RabbitMQ for Broker and Redis for Backend because that is a super common setup that seems to scale well.

v0.10.1 2022-07-18

Small bug fixes.

v0.10.0 2022-07-15

Add feed history to help debugging, view new FetchEvents objects.

v0.9.4 2022-07-15

Fix some date parsing bugs by using built-in approach from feed parsing library. Also add some more unit tests.

v0.9.3 2022-07-14

Added back in a necessary index for fast querying.

v0.9.2 2022-07-14

More debug logging.

v0.9.1 2022-07-14

Pretending to be a browser in order to see if it fixes a 403 bug.

v0.9.0 2022-07-14

Add fetch_events table for history and debugging. Also move title uniqueness check to software (not DB) to allow for empty title fields.

v0.8.1 2022-07-14

Rewrite main rss fetching task to make logic more obvious, and also try and streamline database handle usage.

v0.8.0 2022-07-14

Switch to FastApi for returning counts to help debug. See /redoc, or /docs for full API documentation and Open API specification file.

v0.7.5 2022-07-11

New option to log RSS info to files on disk, controlled via SAVE_RSS_FILES env-var (1 or 0)

v0.7.4 2022-07-07

Small tweak to skip relative URLs. Also more debug logging.

v0.7.3 2022-07-06

Fix bug that was checking for duplicate titles across all sources within last 7 days, instead of just within one media source.

v0.7.2 2022-07-06

Update requirements and fix bug related to overly aggressive marking failures.

v0.7.1 2022-06-02

Add in more feeds from production server.

v0.7.0 2022-05-26

Check a normalized story URL and title for uniqueness before saving, like we do on our production system. This is a critical de-duplication step.

v0.6.1 2022-05-20

Generate files for yesterday (not 2 days ago) because that will make delivered results more timely.

v0.6.0 2022-05-16

Add in new feed. Prep to show some data on website.

v0.5.5 2022-04-28

More work on concurrency for prod server and related configurations.

v0.5.4 2022-04-27

Tweaks to RSS file generation to make it more robust.

v0.5.3 2022-04-27

Query bug fix.

v0.5.2 2022-04-27

Handle podcast feeds, which don't have links by ignoring them in reporting script (they have enclosures instead)

v0.5.1 2022-04-27

Deployment work for generating daily rss files.

v0.5.0 2022-04-27

Retry feeds that we tried by didn't respond (up to 3 times in a row before giving up).

v0.4.0 2022-04-27

Update dependencies to latest

v0.3.2 2022-03-25

RSS path loaded from env-var

v0.3.1 2022-03-11

Ignore a whole bunch of errors that are expected ones

v0.3.0 2022-03-11

Add title and canonical domain to daily feeds

v0.2.1 2022-02-19

Move max feeds to fetch at a time limit to an env var for easier config (MAX_FEEDS defaults to 1000)

v0.2.0 2022-02-19

Restructured queries to try and solve DB connection leak bug.

v0.1.2 2022-02-18

Production performance-related tweaks.

v0.1.1 2022-02-18

Make sure duplicate story urls don't get inserted (no matter where they are from). This is the quick solution to making sure an RSS feed with stories we have already saved doesn't create duplicates.

v0.1.0 2022-02-18

First release, seems to work.