Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP: ability to kick failed jobs back onto queue. #3

Closed
wants to merge 1 commit into from

Conversation

pda
Copy link
Contributor

@pda pda commented Aug 28, 2014

Background: #2

TL;DR: kicking failed jobs causes them to be instantly re-buried based on their stats.

This change means each kick buys a job one more release or timeout before it's auto-buried.

/cc @lwc.


For reference, here's a sample stats-job output:

---
id: 3
tube: cmdstalk-test-465396aa983f063a
state: buried
pri: 10
age: 34
delay: 0
ttr: 1
time-left: 0
file: 0
reserves: 2
timeouts: 1
releases: 0
buries: 1
kicks: 0

@pda
Copy link
Contributor Author

pda commented Aug 28, 2014

Scenario with job timing out:

Fresh job:
kicks: 0, releases: 0, timeouts: 0
Job takes too long:
kicks: 0, releases: 0, timeouts: 1 (auto-buried on next reserve)
Job kicked:
kicks: 1, releases: 0, timeouts: 1
Now (timeouts - kicks) >= 1 so it'll process.
Job times out again:
kicks: 1, releases: 0, timeouts: 2
Kick again:
kicks: 2, releases: 0, timeouts: 2
Again (timeouts - kicks) < 1 so it'll process.

@pda
Copy link
Contributor Author

pda commented Aug 28, 2014

Scenario with job failing:

Fresh job:
kicks: 0, releases: 0, timeouts: 0
Repeated exit(1):
kicks: 0, releases: 1, timeouts: 0
kicks: 0, releases: 2, timeouts: 0

kicks: 0, releases: 10, timeouts: 0 (auto-buried on next reserve)
Job kicked:
kicks: 1, releases: 10, timeouts: 0
Now (releases - kicks) < 10 so it'll process.
Fails again:
kicks: 1, releases: 11, timeouts: 0 (auto-buried on next reserve)
Kicked again:
kicks: 2, releases: 11, timeouts: 0
Again (releases - kicks) < 10 so it'll process.

t, err := job.Timeouts()
if err != nil {
b.log.Panic(err)
}
if t >= TimeoutTries {
b.log.Printf("job %d has %d timeouts, burying", job.Id, t)
if (t - kicks) >= TimeoutTries {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i reckon this condition and the one on L130 should be pulled out into predicate methods as intent isn't super clear

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep definitely. WIP :)
That said, the problem was already present, this WIP commit just makes it worse.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah sorry, should have paid more attention to subject line.

@lwc
Copy link
Member

lwc commented Aug 28, 2014

So, the only weird case here is if a task has been buried a few times because it was failing, and it then times out:

New job:
kicks: 0, timeouts: 0, releases: 0

Fails until buried:
kicks: 0, timeouts: 0, releases: 1

kicks: 0, timeouts: 0, releases: 10

Is kicked:
kicks: 1, timeouts: 0, releases: 10

Fails again, kicked again:
kicks: 2, timeouts: 0, releases: 11

It now times out:
kicks: 2, timeouts: 1, releases: 11

Reserved again:
(releases - kicks) = 9 so it doesn't get buried for failures, correct.
(timeouts - kicks) = -1 so it doesn't get buried for timeout.

Job runs, eventually either timeouts or releases reaches the threshold.

Is this even a big deal though?


(edited by @pda)

@pda
Copy link
Contributor Author

pda commented Aug 28, 2014

(timeouts - kicks) = -1

Side-note: if we use this approach, some things need to be treated as signed integers.

@pda pda closed this Apr 10, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants