-
Notifications
You must be signed in to change notification settings - Fork 246
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[batch] Drop dead rows from job_group_inst_coll_cancellable_resources
#14623
Comments
Hi @daniel-goldstein. I've taken a look at a few of the records in this table in prod. Do you know what it means to have negative values for select * from job_group_inst_coll_cancellable_resources where batch_id = X;
+----------+-------+--------------------------+------------------------------+----------------------------+--------------------------------+-----------+-----------------------------+-----------+--------------+
| batch_id | token | n_ready_cancellable_jobs | ready_cancellable_cores_mcpu | n_running_cancellable_jobs | running_cancellable_cores_mcpu | inst_coll | n_creating_cancellable_jobs | update_id | job_group_id |
+----------+-------+--------------------------+------------------------------+----------------------------+--------------------------------+-----------+-----------------------------+-----------+--------------+
| X | 0 | 0 | 0 | 0 | 0 | standard | 0 | 1 | 0 |
| X | 18 | 0 | 0 | -1 | -2000 | standard | 0 | 1 | 0 |
| X | 59 | 0 | 0 | 0 | 0 | standard | 0 | 1 | 0 |
| X | 60 | 0 | 0 | -1 | -2000 | standard | 0 | 1 | 0 |
| X | 74 | 0 | 0 | 0 | 0 | standard | 0 | 1 | 0 |
| X | 172 | 0 | 0 | 0 | 0 | standard | 0 | 1 | 0 |
| X | 185 | 0 | 0 | -1 | -1000 | standard | 0 | 1 | 0 |
+----------+-------+--------------------------+------------------------------+----------------------------+--------------------------------+-----------+-----------------------------+-----------+--------------+ |
This comment was marked as outdated.
This comment was marked as outdated.
Good question, looks like a bug to me as these records should eventually settle to sum to 0. Are these from recent batches? Hopefully this is some bug that has already been fixed but these rows were never cleaned up. I'd also check the code around |
This particular example is from quite an early batch |
This comment was marked as resolved.
This comment was marked as resolved.
…e_resources` Resolves: hail-is#14623
@daniel-goldstein - do you know if compacting or deleting will impact other backgroud processes? Is this table used for anything else after a job group completes? Also, should we also compact failed job groups? |
Here's what I see in a dev deploy that table when I submit a batch:
After scheduling
When done
After compacting
|
I don't believe so, this should be fine. I would usually grep the codebase for the table name. IIRC you should see it used in:
Off the top of my head, I see no reason to distinguish between different |
What happened?
See here for context on why this table exists and how it is used. Records are added or updated in this table whenever jobs are added to the database or after an attempt for a job completes. Records are currently only removed when the records belong to a cancelled job group. If a job group runs to completion, we end up with many rows in the database that no longer serve any purpose, and (if you sum over the
token
column), have 0s for all the job columns. This does not affect correctness, but is a lot of wasted space in the database. This leads to two points that together would save a lot of space in the database (I've not quantified how much butselect count(*)
on this table takes longer than I've been willing to wait.(batch_id, update_id, job_group_id, inst_coll)
but differenttoken
value can be "compacted" into one row with key(batch_id, update_id, job_group_id, inst_coll, 0)
(token 0) where all the other columns are summed. This is most useful for cold rows.n_*_jobs
and*_cancellable_cores_mcpu
columns are 0 can be deleted.We already do 1 for the aggregated billing tables. Use tokens for parallelism on hot rows and then compact records so that records from before the current day always end up only using 1 row.
Implementing 1 should be a big win for the size of this table. Following that up with 2 would eliminate what I presume to be the vast majority of data in this table.
Version
0.2.132
Relevant log output
No response
The text was updated successfully, but these errors were encountered: