-
Notifications
You must be signed in to change notification settings - Fork 10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update job deletion logic + propagate update times #283
Update job deletion logic + propagate update times #283
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice find!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would you mind swapping 573 and 574? Slight bug in the metrics where it starts recording before it locks. It's on my to do list but haven't found time :/
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice! Thanks Xander
I actually refactored the metrics generation entirely in the next diff, so I think it would be relevant! |
s.mutex.Lock() | ||
j := s.jobs[digest.Hash] | ||
j.LastUpdate = time.Now() | ||
s.mutex.Unlock() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we really update LastUpdate here? This is just sending an event to the client, not updating the actual job that we store in memory
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that makes sense right? If I have an operation that is sending progress events, for each one of them I want to make sure that the jobs liveness property is being incremented.
If I have a job that takes 10 minutes to run for example, I don't want the LastUpdate
to be set only at the start when the job is enqueued IMO?
Might be that we should make what LastUpdate
actually represents clearer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I have a job that takes 10 minutes to run for example, I don't want the LastUpdate to be set only at the start when the job is enqueued IMO?
Why not? To me, this is why we have all those conditions in shouldDeleteJob
:
if !j.Done && len(j.Streams) == 0 && timeSinceLastUpdate > expiryTime {
return true
}
if !j.Done && timeSinceLastUpdate > 2*expiryTime {
return true
}
ie if the job is still in progress, we keep it for the expiration time (1h) and while it still has some streams. Otherwise we keep it for 2h.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm specifically thinking about this condition:
timeSinceLastUpdate := time.Since(j.LastUpdate)
if j.Done && len(j.Streams) == 0 && timeSinceLastUpdate > retentionTime {
// return true to delete
return true
}
If my job takes 10 minutes to run, and we don't update with progress events, once that job is done we immediately remove it (even if a client might attempt to reconnect)
My understanding is that we are retaining the jobs to smooth that out.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To me, once the job is done, so when we update the Done field in the job object, we should update the LastUpdate
field as well, because that's an update to the job, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do see Maximes point here. Streaming an update isn't really an update. Checking for listeners and setting the last update when we reassign a job should fix the issue without this bit anyway right?
The stream check was only on the expiration, not on the deletion routine. Not sure if we're expecting any remaining stream once the action has been completed, but it's probably harmless, even safer, just in case. |
* Change update logic to match previous behaviour, update time in a few locations. * Version + Changelog --------- Co-authored-by: Hamish Pitkeathly <[email protected]>
The previous deletion logic would always check that the streams were zero-ed before deleting the job: 0b744c9#diff-030e4836d83d50a740c86eb47a2780e7a81f170cc5902279ff28ad5122081d37L593
This diff updates the logic to match the previous logic, and updates the
LastUpdated
time in a few places so it's a more accurate measurement.