Periodically delete jobs in Mettle api server #281

Hamishpk · 2024-02-20T13:37:11Z

At times of high load the current method of scheduling actions for deletion using individual goroutines is causing connection/context deadline exceeded errors.

This change hopes to improve this by managing the deletion of jobs through one routine. Every 10 minutes the server will loop through all the jobs and delete it if:

The job is complete and over the retention time (5 minutes)
The job is incomplete, has no listeners and within the expiry time (60 minutes)
The job is incomplete and outside of two times the expiry time (120 minutes)

I remember we had some nasty bugs a few years ago around this so I've added a bunch of tests for shouldDeleteJobs. I think this should cover all our cases but am slightly worried an empty job (done==false, LastUpdate == beginning of time) would be deleted. We never set an empty job so shouldn't be an issue but could be convinced to add a special case for this)

mettle/api/api.go

fische · 2024-02-20T14:59:27Z

I'm not sure about this approach; at the moment there's a data race that would cause a panic, as you're accessing the jobs map without locking the mutex, but I think locking the mutex to go through the list of jobs might be a bit too expensive. I think it would be nicer to do this with a channel. By that, I mean we send all jobs that we want to expire (the old go s.deleteJob(key, j)) on the channel and the background routine appends those to an internal list. That way it can then assume the first item in that list is the next to get deleted and the implementation would look like this:

for {
  select {
  case job := <-channel:
    buffer = append(buffer, job)
  case <-time.After(time.Until(buffer[0].ExpirationTime)):
    // delete job from map
    buffer = buffer[1:]
  }
}

This is of course oversimplified (ie we shouldnt start a new Timer every time but just reuse the same one) but it's the general idea. What do you think?

peterebden · 2024-02-20T15:57:38Z

I wonder if it might be simpler to just lock around the whole map iteration. I would hope it doesn't take too long to iterate it - we don't have to ask a complex question about each item.

Alternatively we could use a sync.Map but probably easier not to start down that rabbit hole...

Hamishpk · 2024-02-21T09:13:01Z

After an in-person chat with @fische it was decided to go for @peterebden suggestion of locking the whole map. I have added a metric to track the time of this and will refactor if it proves to be troublesome

mettle/api/api.go

* periodically delete jobs in Mettle api server * don't defer mutex unlock * Lock whole map when deleting * Add metric * Version + changelog * Remove newline * lock/unlock mutex correctly * Observe time in correct place

periodically delete jobs in Mettle api server

1cebaac

peterebden reviewed Feb 20, 2024

View reviewed changes

mettle/api/api.go Outdated Show resolved Hide resolved

don't defer mutex unlock

d7922fb

Hamishpk added 4 commits February 20, 2024 17:01

Lock whole map when deleting

a3a910c

Add metric

3338c03

Version + changelog

7fdb2ba

Remove newline

caa1b65

lock/unlock mutex correctly

e0da025

fische requested changes Feb 21, 2024

View reviewed changes

mettle/api/api.go Outdated Show resolved Hide resolved

Observe time in correct place

931f7c5

fische approved these changes Feb 21, 2024

View reviewed changes

Hamishpk merged commit 0b744c9 into thought-machine:master Feb 21, 2024
5 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Periodically delete jobs in Mettle api server #281

Periodically delete jobs in Mettle api server #281

Hamishpk commented Feb 20, 2024

fische commented Feb 20, 2024

peterebden commented Feb 20, 2024

Hamishpk commented Feb 21, 2024

Periodically delete jobs in Mettle api server #281

Periodically delete jobs in Mettle api server #281

Conversation

Hamishpk commented Feb 20, 2024

fische commented Feb 20, 2024

peterebden commented Feb 20, 2024

Hamishpk commented Feb 21, 2024