Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

2025 Q1 Roadmap #174

Open
6 of 24 tasks
Mr0grog opened this issue Feb 19, 2025 · 0 comments
Open
6 of 24 tasks

2025 Q1 Roadmap #174

Mr0grog opened this issue Feb 19, 2025 · 0 comments

Comments

@Mr0grog
Copy link
Member

Mr0grog commented Feb 19, 2025

We’re already halfway through Q1, but I wanted to write out for myself a quick roadmap of priorities and critical issues for the short term, since this project was in hibernation for a while and is now back in active use.

I’ve also made a GH project tracking all ongoing work. This document outlines critical priorities and buckets of work, while the project tracks detailed work in progress: https://github.com/orgs/edgi-govdata-archiving/projects/32/views/1

Overall Status

EDGI’s web monitoring efforts and most of the technical codebases here have been more-or-less shut down for the past 2+ years, but are now back in active use. The last month and a half has been a bit of a scramble to get everything running well again and working within our current constraints. Additionally, there is no technical team — it’s just me (@Mr0grog). We previously operated with several active technical team members, which influenced a lot about how these codebases are broken up and organized. I don’t currently expect this to change.

Given that, prioritization is pretty important. There’s a lot that probably won’t get done or at least won’t get much focus.

Core Near-Term Goals

Critical Issues:

  • Work out data sourcing with IA. We used to rely on them to do all the canonical crawling/archiving, but there have been a whole host of issues with how that’s been managed this time around. We are currently doing our own crawling in addition.
    • Support importing data directly from WARCs (our own crawls and as a more expedient way of loading IA data). (Warcs have useful content too web-monitoring-processing#858)
    • Automate and operationalize our crawls if needed. Need to script and schedule all the pieces I am doing manually M/W/F nights:
      • Keep seed lists updated.
      • Crawl seeds with Browsertrix-Crawler.
      • Resume crawls that died in the middle.
      • Upload crawl output to S3.
      • Upload crawl output to IA.
      • Import crawl data to web-monitoring-db.
      • Analyze logs for errors that need to be handled specially because the server is not responding over HTTP (see below).
  • Automate generation of weekly task sheets. This needs to not rely on me being around and able to run the script.
  • Re-evaluate current deployment for cost optimization. Can we arrange things differently in AWS? Can we use reservations to get cheaper pricing on some resources? etc. Funding is overall down, and this infra is not free.
  • Record versions where the server is no longer responding via HTTP (could be DNS resolution issues, TLS issues, server prematurely closing connections, etc.). We previously had some loose ways of recording page load errors when we were based on Versionista and PageFreezer, and losing the ability to discover and record network errors was something we hadn’t realized we were loosing when we switched to pure Internet Archive as a data source (we had stopped seeing servers just disappear at that point). Now that we’re seeing servers disappear again, this is a big issue.
    • Support recording these network errors in the DB/API. (Add network_error field to Version model web-monitoring-db#1184)
    • Write a tool to check for these (since we don’t proactively find out about them from IA) and import them to the DB.
      • I am currently adding records by hand with the source type edgi_statuscheck_v0 after examining the logs from our crawls.
      • A tool will need to be slow and careful about how it checks (maybe make repeated checks) since these errors are as likely to be bot blocking, firewall rules, proxy issues, etc. as they are to be legit servers going down.
      • The tool could check the crawl logs for candidates like I am currently doing (as long as we keep doing our own crawls).
      • Or it could schedule checks based on seeds with no data coming from IA when we do nightly imports.
      • Or it could check the DB for active pages with no new versions in the last N days.
      • Or some separate system for listing things it should check?
    • Schedule the tool to run automatically.

Cleanup:

Important but Non-Critical:

Plus ongoing bug-fixing for analysts as they find problems.

Other Nice Stuff or Ideas

Not sure it’s likely I’ll get to any of this given how big the core stuff above is. But some things on my mind:

  • Minimize web-monitoring-ui or even merge it into web-monitoring-db.

    The way these are split up makes a lot of things very hard to do or work, makes infrastructure harder (more things to deploy, more CPU + memory requirements), and invites all kinds of weird little problems (CORS, cross-origin caching, login complexity, etc.). We originally designed things as microservices because of the way the team was structured and the skills people had, but that turned out to be over-ambitious in practice (in my opinion). Today, it’s an even more active problem when just one person is maintaining them all.

    We also had all kinds of ambitious ideas about the UI project giving analysts a direct interface to their task lists, being able to post their annotations/analysis directly to the DB, and so on. This never got done, and would require a lot more work, both in the UI, and in ways for the analysts to get their data back out or query it, before it would ever be better than the analysts working directly with spreadsheets as they do today. As things stand today, this stuff would be neat, but I don’t it is ever going to get done.

    If we drop all these ideas, the UI really doesn’t need to be nearly as complex or special as it is. It also doesn’t need its own server. At the simplest, it could be served as a static site (via GH pages, from CloudFront/S3, or even just from the API server as an asset). It could also just be normal front-end code in the web-monitoring-db server, but that requires a lot more rewriting and rethinking (it does pave a nicer, more monolithic path back towards including annotations/analysis forms for analysts, though).

    This would be some nice cleanup, but could turn into a big project. So a bit questionable.

  • Consider whether web-monitoring-db should be rewritten in Python, and be more monolithic. The above stuff about merging away web-monitoring-ui feeds directly into this. Web-monitoring-db is really the odd duck here, written in Ruby and Rails while everything else is Python (or JS if it’s front-end). This was originally done because the first stuff I helped out with at EDGI was Ruby-based, and I thought there was crew of Ruby folks who would be contributing. That turned out not to be true. I think Rails is fantastic, but the plethora of languages and frameworks here has historically made contributing to this project very hard. Rewriting it in Python also makes it easier to pull other pieces (e.g. the differ, all the import and processing scripts, all the task sheet stuff) together, and would reduce some code duplication.

    I don’t expect this to go anywhere — this project is probably much too big and unrealistic at this point. But I want to log it.

  • Get rid of Kubernetes. It’s been clear to me for several years now that managing your own Kubernetes cluster is not worthwhile for a project of this size. (I’m not sure it’s worthwhile for any org that cannot afford a dedicated (dev)ops/SRE person to own it.) Managed Kubernetes (AWS EKS, Google GKE, etc.) is better, but also still tends to be more complicated and obtuse than an infrastructure provider’s own stuff (e.g. AWS ECS+Fargate).

    This is also a big project on it’s that probably won’t happen. Additionally, it’s possible it could be more expensive than the current situation (we have our services very efficiently and tightly packed into 3 EC2 instances, and you can’t make decisions that are quite as granular on ECS, for example), although there are other management tradeoffs.

    Note that a simplified, more monolithic structure as discussed above also makes it easier to run this project on other systems/services/infrastructure types. BUT we are probably somewhat coupled to AWS at this point, where all our data is.

@Mr0grog Mr0grog self-assigned this Feb 19, 2025
@Mr0grog Mr0grog moved this from Inbox to In Progress in Web Monitoring Feb 19, 2025
@Mr0grog Mr0grog pinned this issue Feb 19, 2025
Mr0grog added a commit that referenced this issue Feb 20, 2025
The project is operational again, so the not-actively-maintained banner in the README is no longer accurate. This also updates the project board link to the correct URL.

Part of #174.
Mr0grog added a commit to edgi-govdata-archiving/web-monitoring-ui that referenced this issue Feb 20, 2025
The project is operational again, so the not-actively-maintained banner in the README is no longer accurate. This also updates the project board link to the correct URL.

Part of edgi-govdata-archiving/web-monitoring#174.
Mr0grog added a commit to edgi-govdata-archiving/web-monitoring-processing that referenced this issue Feb 20, 2025
The project is operational again, so the not-actively-maintained banner in the README is no longer accurate. This also updates the project board link to the correct URL.

Part of edgi-govdata-archiving/web-monitoring#174.
Mr0grog added a commit that referenced this issue Feb 20, 2025
The project is operational again, so the not-actively-maintained banner in the README is no longer accurate. This also updates the project board link to the correct URL.

Part of #174.
Mr0grog added a commit to edgi-govdata-archiving/web-monitoring-ui that referenced this issue Feb 20, 2025
The project is operational again, so the not-actively-maintained banner in the README is no longer accurate. This also updates the project board link to the correct URL.

Part of edgi-govdata-archiving/web-monitoring#174.
Mr0grog added a commit to edgi-govdata-archiving/web-monitoring-processing that referenced this issue Feb 20, 2025
The project is operational again, so the not-actively-maintained banner in the README is no longer accurate. This also updates the project board link to the correct URL.

Part of edgi-govdata-archiving/web-monitoring#174.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: In Progress
Development

No branches or pull requests

1 participant