-
Notifications
You must be signed in to change notification settings - Fork 246
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[batch] Compact And Drop Records from job_group_inst_coll_cancellable_resources
#14645
base: main
Are you sure you want to change the base?
[batch] Compact And Drop Records from job_group_inst_coll_cancellable_resources
#14645
Conversation
…e_resources` Resolves: hail-is#14623
) AS R ON TRUE | ||
WHERE G.time_completed IS NOT NULL | ||
AND C.id IS NULL | ||
LIMIT 1000; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Calling out the limit on both queries. Seems other queries also limit to 1000 but not sure where this comes from. Without compacting, the query to find compacted rows takes for ever as it scans through a large chunk of the db. On the other hand, there are millions of rows so reducing this number would make the background task take longer to churn through records. Suggestions?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FWIW, my 2-week old prod snapshot has 173561655 rows in job_group_inst_coll_cancellable_resources
and 8567769 job_groups
. Assuming (incorrectly) instant execution, It'll will take 100 days to churn through the db.
339c9fe
to
af92cd6
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok. I'm convinced that the SQL works as advertised. I'd love to see an easy query to just delete all the unnecessary records from job_group_inst_coll_cancellable_resources
, but I certainly haven't spent enough time understanding the stored procedures and triggers that cause this table to be updated and what invariants hold for it.
Guess that's a future project that will only be relevant if this table grows faster than we can delete records from it.
Updated queries to return job groups that do not have an ancestor or self job group that has been cancelled. This logic now mirrors that of |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
newest SQL update LGTM
@daniel-goldstein @jigold I believe I've implemented this faithfully to the issue but I'm not confident about any fallout if I've got something wrong. Would you mind taking a look (sorry to drag you into this)? I've grepped through the codebase as @daniel-goldstein suggested and AFAICT, these records are unused after a job group finish. |
I can take a look at this on Monday probably but AFAIK we don't and it wouldn't, hence the copious amounts of garbage. |
Fixes #14660 by using the graphQL API to query github directly. Replaces our current parallel interpretation of reviews into a review decision, which is brittle if we ever change review requirements in github again. Tested by manually updating the live CI to use the test batch generated image. Results: - Review decisions correctly fetched from github, not based on CI's parallel interpretation of individual reviews: ![image](https://github.com/user-attachments/assets/67c03aa9-000a-44e7-91aa-3a42d04238dc) - No merge candidate was being incorrectly nominated (in particular, #14645 is now considered pending, rather than approved, which is what we are currently, incorrectly, calculating)
Records in the job_group_inst_coll_cancellable_resources table are dead once a
job group completes. We already compact records when a job group is cancelled.
We are yet to do this for finished job groups. See the linked issue for a more
detailed motivation.
This change adds two background tasks:
compacts them by summing across the token field.
all associated resources are 0.
The results of both tasks converge to a fixed point where the only remaining
records are for those jobs groups that are unfinished, cancelled or have
resources outstanding.
I've taken care to optimise the underlying SQL queries as best as I can. Both
make heavy use of lateral joins to avoid explodes - the natural implementation
of both are prohibitively expensive.
I've tested these tasks in a dev deploy where I created a number of batches and
observed that records from this table have indeed been compacted and destroyed
on completion. It's not immediately obvious to me how to automate testing for
these. AFAICT, we lack any automated integration testing for these background
tasks.
Resolves: #14623