Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[batch] Drop dead rows from job_group_inst_coll_cancellable_resources #14623

Open
daniel-goldstein opened this issue Jul 17, 2024 · 8 comments · May be fixed by #14645
Open

[batch] Drop dead rows from job_group_inst_coll_cancellable_resources #14623

daniel-goldstein opened this issue Jul 17, 2024 · 8 comments · May be fixed by #14645
Assignees

Comments

@daniel-goldstein
Copy link
Contributor

daniel-goldstein commented Jul 17, 2024

What happened?

See here for context on why this table exists and how it is used. Records are added or updated in this table whenever jobs are added to the database or after an attempt for a job completes. Records are currently only removed when the records belong to a cancelled job group. If a job group runs to completion, we end up with many rows in the database that no longer serve any purpose, and (if you sum over the token column), have 0s for all the job columns. This does not affect correctness, but is a lot of wasted space in the database. This leads to two points that together would save a lot of space in the database (I've not quantified how much but select count(*) on this table takes longer than I've been willing to wait.

  1. Rows in this table with the same key (batch_id, update_id, job_group_id, inst_coll) but different token value can be "compacted" into one row with key (batch_id, update_id, job_group_id, inst_coll, 0) (token 0) where all the other columns are summed. This is most useful for cold rows.
  2. Rows whose n_*_jobs and *_cancellable_cores_mcpu columns are 0 can be deleted.

We already do 1 for the aggregated billing tables. Use tokens for parallelism on hot rows and then compact records so that records from before the current day always end up only using 1 row.

Implementing 1 should be a big win for the size of this table. Following that up with 2 would eliminate what I presume to be the vast majority of data in this table.

Version

0.2.132

Relevant log output

No response

@ehigham
Copy link
Member

ehigham commented Jul 26, 2024

Hi @daniel-goldstein. I've taken a look at a few of the records in this table in prod. Do you know what it means to have negative values for n_running_cancellable_jobs and/or running_cancellable_cores_mcpu? Here's an example:

select * from job_group_inst_coll_cancellable_resources where batch_id = X;
+----------+-------+--------------------------+------------------------------+----------------------------+--------------------------------+-----------+-----------------------------+-----------+--------------+
| batch_id | token | n_ready_cancellable_jobs | ready_cancellable_cores_mcpu | n_running_cancellable_jobs | running_cancellable_cores_mcpu | inst_coll | n_creating_cancellable_jobs | update_id | job_group_id |
+----------+-------+--------------------------+------------------------------+----------------------------+--------------------------------+-----------+-----------------------------+-----------+--------------+
|        X |     0 |                        0 |                            0 |                          0 |                              0 | standard  |                           0 |         1 |            0 |
|        X |    18 |                        0 |                            0 |                         -1 |                          -2000 | standard  |                           0 |         1 |            0 |
|        X |    59 |                        0 |                            0 |                          0 |                              0 | standard  |                           0 |         1 |            0 |
|        X |    60 |                        0 |                            0 |                         -1 |                          -2000 | standard  |                           0 |         1 |            0 |
|        X |    74 |                        0 |                            0 |                          0 |                              0 | standard  |                           0 |         1 |            0 |
|        X |   172 |                        0 |                            0 |                          0 |                              0 | standard  |                           0 |         1 |            0 |
|        X |   185 |                        0 |                            0 |                         -1 |                          -1000 | standard  |                           0 |         1 |            0 |
+----------+-------+--------------------------+------------------------------+----------------------------+--------------------------------+-----------+-----------------------------+-----------+--------------+

@ehigham ehigham removed the needs-triage A brand new issue that needs triaging. label Jul 26, 2024
@ehigham

This comment was marked as outdated.

@daniel-goldstein
Copy link
Contributor Author

Hi @daniel-goldstein. I've taken a look at a few of the records in this table in prod. Do you know what it means to have negative values for n_running_cancellable_jobs and/or running_cancellable_cores_mcpu? Here's an example:

select * from job_group_inst_coll_cancellable_resources where batch_id = X;

+----------+-------+--------------------------+------------------------------+----------------------------+--------------------------------+-----------+-----------------------------+-----------+--------------+

| batch_id | token | n_ready_cancellable_jobs | ready_cancellable_cores_mcpu | n_running_cancellable_jobs | running_cancellable_cores_mcpu | inst_coll | n_creating_cancellable_jobs | update_id | job_group_id |

+----------+-------+--------------------------+------------------------------+----------------------------+--------------------------------+-----------+-----------------------------+-----------+--------------+

|        X |     0 |                        0 |                            0 |                          0 |                              0 | standard  |                           0 |         1 |            0 |

|        X |    18 |                        0 |                            0 |                         -1 |                          -2000 | standard  |                           0 |         1 |            0 |

|        X |    59 |                        0 |                            0 |                          0 |                              0 | standard  |                           0 |         1 |            0 |

|        X |    60 |                        0 |                            0 |                         -1 |                          -2000 | standard  |                           0 |         1 |            0 |

|        X |    74 |                        0 |                            0 |                          0 |                              0 | standard  |                           0 |         1 |            0 |

|        X |   172 |                        0 |                            0 |                          0 |                              0 | standard  |                           0 |         1 |            0 |

|        X |   185 |                        0 |                            0 |                         -1 |                          -1000 | standard  |                           0 |         1 |            0 |

+----------+-------+--------------------------+------------------------------+----------------------------+--------------------------------+-----------+-----------------------------+-----------+--------------+

Good question, looks like a bug to me as these records should eventually settle to sum to 0. Are these from recent batches? Hopefully this is some bug that has already been fixed but these rows were never cleaned up. I'd also check the code around schedule_job to see if any additions to this table aren't fault tolerant.

@ehigham
Copy link
Member

ehigham commented Jul 30, 2024

This particular example is from quite an early batch

@ehigham

This comment was marked as resolved.

@ehigham
Copy link
Member

ehigham commented Jul 31, 2024

@daniel-goldstein - do you know if compacting or deleting will impact other backgroud processes? Is this table used for anything else after a job group completes?

Also, should we also compact failed job groups?

@ehigham
Copy link
Member

ehigham commented Jul 31, 2024

Here's what I see in a dev deploy that table when I submit a batch:

+----------+-------+--------------------------+------------------------------+----------------------------+--------------------------------+-----------+-----------------------------+-----------+--------------+
| batch_id | token | n_ready_cancellable_jobs | ready_cancellable_cores_mcpu | n_running_cancellable_jobs | running_cancellable_cores_mcpu | inst_coll | n_creating_cancellable_jobs | update_id | job_group_id |
+----------+-------+--------------------------+------------------------------+----------------------------+--------------------------------+-----------+-----------------------------+-----------+--------------+
|       30 |    13 |                       10 |                        10000 |                          0 |                              0 | standard  |                           0 |         1 |            0 |
+----------+-------+--------------------------+------------------------------+----------------------------+--------------------------------+-----------+-----------------------------+-----------+--------------+

After scheduling

+----------+-------+--------------------------+------------------------------+----------------------------+--------------------------------+-----------+-----------------------------+-----------+--------------+
| batch_id | token | n_ready_cancellable_jobs | ready_cancellable_cores_mcpu | n_running_cancellable_jobs | running_cancellable_cores_mcpu | inst_coll | n_creating_cancellable_jobs | update_id | job_group_id |
+----------+-------+--------------------------+------------------------------+----------------------------+--------------------------------+-----------+-----------------------------+-----------+--------------+
|       30 |    11 |                       -1 |                        -1000 |                          1 |                           1000 | standard  |                           0 |         1 |            0 |
|       30 |    13 |                       10 |                        10000 |                          0 |                              0 | standard  |                           0 |         1 |            0 |
|       30 |    39 |                       -1 |                        -1000 |                          1 |                           1000 | standard  |                           0 |         1 |            0 |
|       30 |    62 |                       -1 |                        -1000 |                          1 |                           1000 | standard  |                           0 |         1 |            0 |
|       30 |   121 |                       -1 |                        -1000 |                          1 |                           1000 | standard  |                           0 |         1 |            0 |
|       30 |   125 |                       -1 |                        -1000 |                          1 |                           1000 | standard  |                           0 |         1 |            0 |
|       30 |   135 |                       -1 |                        -1000 |                          1 |                           1000 | standard  |                           0 |         1 |            0 |
|       30 |   152 |                       -1 |                        -1000 |                          1 |                           1000 | standard  |                           0 |         1 |            0 |
|       30 |   166 |                       -1 |                        -1000 |                          1 |                           1000 | standard  |                           0 |         1 |            0 |
|       30 |   183 |                       -1 |                        -1000 |                          1 |                           1000 | standard  |                           0 |         1 |            0 |
|       30 |   195 |                       -1 |                        -1000 |                          1 |                           1000 | standard  |                           0 |         1 |            0 |
+----------+-------+--------------------------+------------------------------+----------------------------+--------------------------------+-----------+-----------------------------+-----------+--------------+

When done

+----------+-------+--------------------------+------------------------------+----------------------------+--------------------------------+-----------+-----------------------------+-----------+--------------+
| batch_id | token | n_ready_cancellable_jobs | ready_cancellable_cores_mcpu | n_running_cancellable_jobs | running_cancellable_cores_mcpu | inst_coll | n_creating_cancellable_jobs | update_id | job_group_id |
+----------+-------+--------------------------+------------------------------+----------------------------+--------------------------------+-----------+-----------------------------+-----------+--------------+
|       30 |    11 |                       -1 |                        -1000 |                          1 |                           1000 | standard  |                           0 |         1 |            0 |
|       30 |    13 |                       10 |                        10000 |                          0 |                              0 | standard  |                           0 |         1 |            0 |
|       30 |    18 |                        0 |                            0 |                         -1 |                          -1000 | standard  |                           0 |         1 |            0 |
|       30 |    37 |                        0 |                            0 |                         -1 |                          -1000 | standard  |                           0 |         1 |            0 |
|       30 |    39 |                       -1 |                        -1000 |                          1 |                           1000 | standard  |                           0 |         1 |            0 |
|       30 |    60 |                        0 |                            0 |                         -1 |                          -1000 | standard  |                           0 |         1 |            0 |
|       30 |    62 |                       -1 |                        -1000 |                          1 |                           1000 | standard  |                           0 |         1 |            0 |
|       30 |   121 |                       -1 |                        -1000 |                          0 |                              0 | standard  |                           0 |         1 |            0 |
|       30 |   125 |                       -1 |                        -1000 |                          1 |                           1000 | standard  |                           0 |         1 |            0 |
|       30 |   128 |                        0 |                            0 |                         -1 |                          -1000 | standard  |                           0 |         1 |            0 |
|       30 |   135 |                       -1 |                        -1000 |                          1 |                           1000 | standard  |                           0 |         1 |            0 |
|       30 |   136 |                        0 |                            0 |                         -2 |                          -2000 | standard  |                           0 |         1 |            0 |
|       30 |   144 |                        0 |                            0 |                         -1 |                          -1000 | standard  |                           0 |         1 |            0 |
|       30 |   152 |                       -1 |                        -1000 |                          1 |                           1000 | standard  |                           0 |         1 |            0 |
|       30 |   166 |                       -1 |                        -1000 |                          1 |                           1000 | standard  |                           0 |         1 |            0 |
|       30 |   177 |                        0 |                            0 |                         -1 |                          -1000 | standard  |                           0 |         1 |            0 |
|       30 |   181 |                        0 |                            0 |                         -1 |                          -1000 | standard  |                           0 |         1 |            0 |
|       30 |   183 |                       -1 |                        -1000 |                          1 |                           1000 | standard  |                           0 |         1 |            0 |
|       30 |   195 |                       -1 |                        -1000 |                          1 |                           1000 | standard  |                           0 |         1 |            0 |
+----------+-------+--------------------------+------------------------------+----------------------------+--------------------------------+-----------+-----------------------------+-----------+--------------+

After compacting

+----------+-------+--------------------------+------------------------------+----------------------------+--------------------------------+-----------+-----------------------------+-----------+--------------+
| batch_id | token | n_ready_cancellable_jobs | ready_cancellable_cores_mcpu | n_running_cancellable_jobs | running_cancellable_cores_mcpu | inst_coll | n_creating_cancellable_jobs | update_id | job_group_id |
+----------+-------+--------------------------+------------------------------+----------------------------+--------------------------------+-----------+-----------------------------+-----------+--------------+
|       30 |     0 |                        0 |                            0 |                          0 |                              0 | standard  |                           0 |         1 |            0 |
+----------+-------+--------------------------+------------------------------+----------------------------+--------------------------------+-----------+-----------------------------+-----------+--------------+

@daniel-goldstein
Copy link
Contributor Author

daniel-goldstein commented Aug 2, 2024

@daniel-goldstein - do you know if compacting or deleting will impact other backgroud processes? Is this table used for anything else after a job group completes?

Also, should we also compact failed job groups?

I don't believe so, this should be fine. I would usually grep the codebase for the table name. IIRC you should see it used in:

  • job insertion, clearly no longer relevant
  • triggers which won't fire on cold batches
  • the scheduler/fair share which should only care about rows with >0 values and active job groups

Off the top of my head, I see no reason to distinguish between different complete states.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants