-
Notifications
You must be signed in to change notification settings - Fork 10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Periodically delete jobs in Mettle api server #281
Conversation
I'm not sure about this approach; at the moment there's a data race that would cause a panic, as you're accessing the jobs map without locking the mutex, but I think locking the mutex to go through the list of jobs might be a bit too expensive. I think it would be nicer to do this with a channel. By that, I mean we send all jobs that we want to expire (the old
This is of course oversimplified (ie we shouldnt start a new Timer every time but just reuse the same one) but it's the general idea. What do you think? |
I wonder if it might be simpler to just lock around the whole map iteration. I would hope it doesn't take too long to iterate it - we don't have to ask a complex question about each item. Alternatively we could use a |
After an in-person chat with @fische it was decided to go for @peterebden suggestion of locking the whole map. I have added a metric to track the time of this and will refactor if it proves to be troublesome |
* periodically delete jobs in Mettle api server * don't defer mutex unlock * Lock whole map when deleting * Add metric * Version + changelog * Remove newline * lock/unlock mutex correctly * Observe time in correct place
At times of high load the current method of scheduling actions for deletion using individual goroutines is causing connection/context deadline exceeded errors.
This change hopes to improve this by managing the deletion of jobs through one routine. Every 10 minutes the server will loop through all the jobs and delete it if:
I remember we had some nasty bugs a few years ago around this so I've added a bunch of tests for shouldDeleteJobs. I think this should cover all our cases but am slightly worried an empty job (done==false, LastUpdate == beginning of time) would be deleted. We never set an empty job so shouldn't be an issue but could be convinced to add a special case for this)