Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Periodically delete jobs in Mettle api server #281

Merged
merged 8 commits into from
Feb 21, 2024

Conversation

Hamishpk
Copy link
Contributor

At times of high load the current method of scheduling actions for deletion using individual goroutines is causing connection/context deadline exceeded errors.

This change hopes to improve this by managing the deletion of jobs through one routine. Every 10 minutes the server will loop through all the jobs and delete it if:

  • The job is complete and over the retention time (5 minutes)
  • The job is incomplete, has no listeners and within the expiry time (60 minutes)
  • The job is incomplete and outside of two times the expiry time (120 minutes)

I remember we had some nasty bugs a few years ago around this so I've added a bunch of tests for shouldDeleteJobs. I think this should cover all our cases but am slightly worried an empty job (done==false, LastUpdate == beginning of time) would be deleted. We never set an empty job so shouldn't be an issue but could be convinced to add a special case for this)

mettle/api/api.go Outdated Show resolved Hide resolved
@fische
Copy link
Contributor

fische commented Feb 20, 2024

I'm not sure about this approach; at the moment there's a data race that would cause a panic, as you're accessing the jobs map without locking the mutex, but I think locking the mutex to go through the list of jobs might be a bit too expensive. I think it would be nicer to do this with a channel. By that, I mean we send all jobs that we want to expire (the old go s.deleteJob(key, j)) on the channel and the background routine appends those to an internal list. That way it can then assume the first item in that list is the next to get deleted and the implementation would look like this:

for {
  select {
  case job := <-channel:
    buffer = append(buffer, job)
  case <-time.After(time.Until(buffer[0].ExpirationTime)):
    // delete job from map
    buffer = buffer[1:]
  }
}

This is of course oversimplified (ie we shouldnt start a new Timer every time but just reuse the same one) but it's the general idea. What do you think?

@peterebden
Copy link
Collaborator

I wonder if it might be simpler to just lock around the whole map iteration. I would hope it doesn't take too long to iterate it - we don't have to ask a complex question about each item.

Alternatively we could use a sync.Map but probably easier not to start down that rabbit hole...

@Hamishpk
Copy link
Contributor Author

After an in-person chat with @fische it was decided to go for @peterebden suggestion of locking the whole map. I have added a metric to track the time of this and will refactor if it proves to be troublesome

mettle/api/api.go Outdated Show resolved Hide resolved
@Hamishpk Hamishpk merged commit 0b744c9 into thought-machine:master Feb 21, 2024
5 checks passed
Hamishpk added a commit to Hamishpk/please-servers that referenced this pull request Mar 7, 2024
* periodically delete jobs in Mettle api server

* don't defer mutex unlock

* Lock whole map when deleting

* Add metric

* Version + changelog

* Remove newline

* lock/unlock mutex correctly

* Observe time in correct place
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants