Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update job deletion logic + propagate update times #283

Merged
merged 3 commits into from
Mar 6, 2024

Conversation

Garbett1
Copy link
Contributor

@Garbett1 Garbett1 commented Mar 3, 2024

The previous deletion logic would always check that the streams were zero-ed before deleting the job: 0b744c9#diff-030e4836d83d50a740c86eb47a2780e7a81f170cc5902279ff28ad5122081d37L593

This diff updates the logic to match the previous logic, and updates the LastUpdated time in a few places so it's a more accurate measurement.

peterebden
peterebden previously approved these changes Mar 3, 2024
Copy link
Collaborator

@peterebden peterebden left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice find!

Copy link
Contributor

@Hamishpk Hamishpk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would you mind swapping 573 and 574? Slight bug in the metrics where it starts recording before it locks. It's on my to do list but haven't found time :/

Hamishpk
Hamishpk previously approved these changes Mar 4, 2024
Copy link
Contributor

@Hamishpk Hamishpk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! Thanks Xander

@Garbett1
Copy link
Contributor Author

Garbett1 commented Mar 4, 2024

I actually refactored the metrics generation entirely in the next diff, so I think it would be relevant!

Comment on lines +375 to +378
s.mutex.Lock()
j := s.jobs[digest.Hash]
j.LastUpdate = time.Now()
s.mutex.Unlock()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we really update LastUpdate here? This is just sending an event to the client, not updating the actual job that we store in memory

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that makes sense right? If I have an operation that is sending progress events, for each one of them I want to make sure that the jobs liveness property is being incremented.

If I have a job that takes 10 minutes to run for example, I don't want the LastUpdate to be set only at the start when the job is enqueued IMO?

Might be that we should make what LastUpdate actually represents clearer.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I have a job that takes 10 minutes to run for example, I don't want the LastUpdate to be set only at the start when the job is enqueued IMO?

Why not? To me, this is why we have all those conditions in shouldDeleteJob:

	if !j.Done && len(j.Streams) == 0 && timeSinceLastUpdate > expiryTime {
		return true
	}
	if !j.Done && timeSinceLastUpdate > 2*expiryTime {
		return true
	}

ie if the job is still in progress, we keep it for the expiration time (1h) and while it still has some streams. Otherwise we keep it for 2h.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm specifically thinking about this condition:

timeSinceLastUpdate := time.Since(j.LastUpdate)
if j.Done && len(j.Streams) == 0 && timeSinceLastUpdate > retentionTime {
                // return true to delete
		return true
}

If my job takes 10 minutes to run, and we don't update with progress events, once that job is done we immediately remove it (even if a client might attempt to reconnect)

My understanding is that we are retaining the jobs to smooth that out.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To me, once the job is done, so when we update the Done field in the job object, we should update the LastUpdate field as well, because that's an update to the job, right?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do see Maximes point here. Streaming an update isn't really an update. Checking for listeners and setting the last update when we reassign a job should fix the issue without this bit anyway right?

@fische
Copy link
Contributor

fische commented Mar 4, 2024

The stream check was only on the expiration, not on the deletion routine. Not sure if we're expecting any remaining stream once the action has been completed, but it's probably harmless, even safer, just in case.

@Hamishpk Hamishpk dismissed stale reviews from peterebden and themself via 2fa6651 March 6, 2024 16:47
@Hamishpk Hamishpk merged commit fa2b0e1 into thought-machine:master Mar 6, 2024
5 checks passed
Hamishpk added a commit to Hamishpk/please-servers that referenced this pull request Mar 7, 2024
* Change update logic to match previous behaviour, update time in a few locations.

* Version + Changelog

---------

Co-authored-by: Hamish Pitkeathly <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants