Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Solid Queue Integration #199

Merged
merged 9 commits into from
Apr 25, 2024
Merged

Solid Queue Integration #199

merged 9 commits into from
Apr 25, 2024

Conversation

carlosantoniodasilva
Copy link
Member

@carlosantoniodasilva carlosantoniodasilva commented Apr 22, 2024

This adds an adapter to integrate with SolidQueue, reporting job queue time and busy metrics (if enabled) to Judoscale for autoscaling.

SolidQueue is currently on v0.3, still pretty early on, and there's still some things being figured out, but there's early adoption and we expect more as it becomes a Rails recommendation / default in the future. It only works with Rails v7.1+ and Ruby 2.7+, so that's what this adapter will support initially.

We'll be collecting queue time / latency via the "ready executions" table, and busy via the "claimed executions" table.

SolidQueue moves jobs between different tables as they change "status", in other words, while all jobs have a representation on the main "jobs" table, they also get a record on an associated table that may represent what's happening to them: when they're ready to be picked up for work, they go to "ready executions", when they're claimed by a process worker to be performed, they go to "claimed executions", and if there's a failure (that's not retired by Active Job), they go to "failed executions"; if they're scheduled to run in the future, they go to "scheduled executions" (or if they're being retried by AJ, which is essentially re-scheduling them in the future, until it succeeds or gives up retrying and blows up back to SolidQueue.)

When jobs are finished successfully, they are flagged with a "finished_at" column on the main "jobs" table. As the jobs moves from one to the other "execution" status in the workflow, their previous record is destroyed, so there should be really only one of those "execution" representations at one point in time. (i.e. a job is either scheduled, ready, claimed, failed)

There's also the concept of recurring executions, which are created via config (a cron-like setup), and eventually get added to "ready executions" for every recur.

And finally, there's one thing I have to look a bit more: blocked executions. It seems you can add concurrency limit to jobs, which may lock certain jobs from running (if they're concurrency limited by a certain condition) and will move them to a separate "blocked executions" table. I would like to test this more, because I'm wondering if we need to check this table for jobs in order to calculate queue time as well.

Todo / Questions

  • Investigate "blocked executions" / concurrency limits, to determine whether they should be added to the queue time / latency.
    • I've been playing with this some, and it works as you'd expect: you can setup a job with a concurrency limit, i.e. run only one job at a time, or one job with this set of arguments, or up to X jobs concurrently, etc., and if more jobs are enqueued, instead of going to "ready", they go to "blocked". When jobs are finished, they check for blocked jobs to unblock them, and there's also an additional dispatcher that checks for blocked jobs on a schedule.
    • While initially I thought it'd make sense to consider these for the latency calculation, the more I thought about and played with it, the more it came to mind that having a big list of blocked jobs doesn't mean a need to autoscale: you might simply be limiting the concurrency of those jobs to a point where many are getting enqueued at certain points, but just a few get processed due to the limits imposed. This could cause the blocked execution table to grow temporarily, causing those blocked jobs to have "increased latency", but autoscaling up might be wrong in this case, since more processing power won't make those jobs complete any faster -- they're still limited by their concurrency setup. In other words, I'm thinking that autoscaling should only look for jobs in the "ready execution" table initially.
Sample query I was playing with, for reference

          ::SolidQueue::Job
            .left_joins(:blocked_execution, :ready_execution)
            .merge(::SolidQueue::BlockedExecution.where.not({ id: nil }))
            .or(::SolidQueue::ReadyExecution.where.not({ id: nil }))
            .group(:queue_name)
            .minimum("coalesce(#{::SolidQueue::BlockedExecution.table_name}.created_at, #{::SolidQueue::ReadyExecution.table_name}.created_at)")

It was pointing to a non-existent `good_job_active_record`, where the
library is `good_job`, and is available currently on v3+.
@carlosantoniodasilva carlosantoniodasilva force-pushed the ca-solid-queue branch 2 times, most recently from e276263 to 7916eab Compare April 22, 2024 20:23
This is a copy of other sample apps, but needs to upgrade it to Rails
v7.1 in order to actually install solid queue.
The sample setup helps ensure the reporting works.

We can install mission control manually if we want to inspect jobs, but
I've left it commented out because right now it relies on Rails and
sprockets and we don't really need those dependencies most of the time.
@carlosantoniodasilva carlosantoniodasilva changed the title [WIP] Solid Queue Integration Solid Queue Integration Apr 23, 2024
@carlosantoniodasilva carlosantoniodasilva marked this pull request as ready for review April 23, 2024 21:23
Copy link
Member Author

@carlosantoniodasilva carlosantoniodasilva left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@adamlogic the SolidQueue integration seems to be working well so far, sending you for an initial look.

gemfile:
- Gemfile
ruby:
- "2.7"
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SolidQueue only works with Rails 7.1+, and so only Ruby 2.7+.

@@ -10,6 +10,7 @@
require "action_controller"

class TestRailsApp < Rails::Application
config.load_defaults "#{Rails::VERSION::MAJOR}.#{Rails::VERSION::MINOR}"
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Setting defaults from the current Rails version, eliminates a warning with old cache version format.

spec.required_ruby_version = ">= 2.7.0"

spec.add_dependency "judoscale-ruby", Judoscale::SolidQueue::VERSION
spec.add_dependency "solid_queue", ">= 0.3"
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

0.3 is the latest version at the moment, I think it's best to use it as requirement.

super

queue_names = run_silently do
::SolidQueue::Job.distinct.pluck(:queue_name)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Querying from the job model will include queues from already finished or error jobs -- basically all known SQ jobs that haven't been deleted yet. (they have documented a way to run a cleanup task that deletes finished jobs after a day, but aren't doing it automatically yet.)

time = Time.now.utc

oldest_execution_time_by_queue = run_silently do
::SolidQueue::ReadyExecution.group(:queue_name).minimum(:created_at)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Jobs move to the "ready execution" table when they're ready to be picked up by a worker and processed. (so if you enqueue a job right now, it creates both a "job" and a "ready execution" records, but if you schedule one in the future, it creates a "job" and a "scheduled execution" instead -- that laters get moved to "ready execution" when it's time to run it.)


if track_busy_jobs?
busy_count_by_queue = run_silently do
::SolidQueue::Job.joins(:claimed_execution).group(:queue_name).count
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Jobs move from "ready execution" to "claimed execution" when they're picked up by a worker. That deletes the "ready" record. Once finished, the "job" is tagged with a "finished_at" value, and "claimed" record is deleted. If failed, a "failed execution" record is also created.

# It seems we can't only set it on `DatabaseTasks` as expected, need to set on the `Migrator` directly instead.
ActiveRecord::Migrator.migrations_paths += SolidQueue::Engine.config.paths["db/migrate"].existent
# ActiveRecord::Tasks::DatabaseTasks.migrations_paths += SolidQueue::Engine.config.paths["db/migrate"].existent
ActiveRecord::Tasks::DatabaseTasks.migrate
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was playing with a way to get the migrations to run automatically to the latest, came up with this... not great, but seems better than copying & pasting the whole migration. (we can replicate to the others later, if this doesn't cause any trouble.)

# (A `/jobs` route is added via config/routes.rb if `MissionControl` is detected.)
# Note: mission control requires assets, so we also need sprockets-rails here for now.
# gem "mission_control-jobs"
# gem "sprockets-rails"
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I opted for leaving Mission Control commented out, it's nice but not necessary to test the sample app, and adds the whole rails gem and sprockets as a dependency.


# Require only the frameworks we currently use instead of loading everything.
%w(activerecord actionpack actionview railties activejob activemodel).each { |rails_gem|
gem rails_gem, "~> 7.1.0"
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can upgrade the other sample apps to 7.1 later. (they're on 7.0)

Copy link
Collaborator

@adamlogic adamlogic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is working great! I noted one small annoyance in the sample app setup, but I'm not sure how to get around it. Let's not get too hung up on it if you don't have any immediate ideas.

I created a couple small PR's from this one to consider:

Comment on lines +14 to +20
create_schema "_timescaledb_cache"
create_schema "_timescaledb_catalog"
create_schema "_timescaledb_config"
create_schema "_timescaledb_internal"
create_schema "timescaledb_experimental"
create_schema "timescaledb_information"
create_schema "toolkit_experimental"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These were causing me trouble locally when running db:prepare (via bin/setup):

solid_queue-sample $ bin/rails db:prepare
   (1.5ms)  CREATE SCHEMA "_timescaledb_cache"
bin/rails aborted!
ActiveRecord::StatementInvalid: PG::DuplicateSchema: ERROR:  schema "_timescaledb_cache" already exists (ActiveRecord::StatementInvalid)
/Users/adam/Projects/judoscale-ruby/sample-apps/solid_queue-sample/db/schema.rb:14:in `block in <top (required)>'
/Users/adam/Projects/judoscale-ruby/sample-apps/solid_queue-sample/db/schema.rb:13:in `<top (required)>'

Caused by:
PG::DuplicateSchema: ERROR:  schema "_timescaledb_cache" already exists (PG::DuplicateSchema)
/Users/adam/Projects/judoscale-ruby/sample-apps/solid_queue-sample/db/schema.rb:14:in `block in <top (required)>'
/Users/adam/Projects/judoscale-ruby/sample-apps/solid_queue-sample/db/schema.rb:13:in `<top (required)>'
Tasks: TOP => db:prepare
(See full trace by running task with --trace)

I deleted these lines and ran db:prepare successfully, but the lines were automatically added back to schema.rb.

This is only a problem on first-time setup, but it's annoying. It's also Timescale-specific, which our sample apps don't require at all.

I'm not really sure any way around it, though. If you try to set up a sample app while connected to Postgres with Timescale enabled, you'll get these lines in your schema. 🤔

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I saw some timescale stuff dumped to the schema, however some output was also in the good_job sample app and I didn't pay much attention, but now I can see that create_schema is only on this one... turns out it's something that was added to Rails 7.1:

So when recreating the DB, it will try to recreate these schemas and fail... it works with enable_extension because I believe that adds a "IF NOT EXISTS", but create_schema apparently does not do anything like that... maybe it should.

There's some potentially related changes:

Other than monkey-patching Rails / schema dumper, I can't really think of an option right now 🤔

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the digging! Let's just leave it for now. I think we're the only ones using these sample apps.

@adamlogic
Copy link
Collaborator

While initially I thought it'd make sense to consider these for the latency calculation, the more I thought about and played with it, the more it came to mind that having a big list of blocked jobs doesn't mean a need to

I agree with your analysis of the scheduled executions table.

I guess one thing we'll need to consider is that if someone scales their workers down to zero, their scheduled/blocked executions will never run. That's more of an app UX consideration... just thinking aloud here.

This enables the "/jobs" endpoint for viewing more details about jobs.
This maps the queue priority to the queues we use for jobs in the sample app.
@carlosantoniodasilva
Copy link
Member Author

I guess one thing we'll need to consider is that if someone scales their workers down to zero, their scheduled/blocked executions will never run. That's more of an app UX consideration... just thinking aloud here.

Makes sense, but I think that's a general consideration to have with job/workers that's not specific to SolidQueue, one could argue the same is true for Sidekiq for example, and it's unique enterprise feature.

@carlosantoniodasilva carlosantoniodasilva merged commit 30d95b9 into main Apr 25, 2024
120 checks passed
@carlosantoniodasilva carlosantoniodasilva deleted the ca-solid-queue branch April 25, 2024 18:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants