Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Early stopping #2125

Open
vdbergh opened this issue Aug 1, 2024 · 10 comments
Open

Early stopping #2125

vdbergh opened this issue Aug 1, 2024 · 10 comments

Comments

@vdbergh
Copy link
Contributor

vdbergh commented Aug 1, 2024

Lots of tests are stopped early these days, even at LLR=-0.5. One should be aware that this drastically alters the error probabilities of the SPRT compared to their design values of 0.05.
alpha_beta

@peregrineshahin
Copy link
Contributor

Stopping happens for too many reasons.
if the test is bugged or the developer wants to submit other tests, ideas that they think are more interesting/promising instead, or
if the submitter believes the test will be a waste of time anyway, or if they want to save resources for other promising ideas running.
The idea that tests should not be early-stopped is exactly to me identical to the idea that we should make a random monkey patch generator and run random 30 tests that compile each day, and they should not be stopped.
I'm not sure why would anyone believe in a patch that the authors themselves don't, or anyone doesn't believe in. Thanks.

@vdbergh
Copy link
Contributor Author

vdbergh commented Aug 1, 2024

Is a developer does not believe in a test, why then submit it to Fishtest? I can imagine that a developer sometimes changes their mind during the running of a test but this should be quite rare.

Besides the stopping of tests is strongly correlated with the LLR value. So the latter is the main driver.

@vdbergh
Copy link
Contributor Author

vdbergh commented Aug 1, 2024

Rather than stopping early the developers should be able to indicate the beta value they are happy with. For example if a developer routinely stops a test at LLR=-1.0 this means they are happy with beta=0.37. So then they should submit tests with alpha=0.05, beta=0.37. That would make the trade off clear...

@peregrineshahin
Copy link
Contributor

well, the way I look at it is that most users submit tests that are related at once, the idea of the patch is the same but the parameters get tweaked in different directions, there is a relation between the tests one submits when they touch the same code,
in such cases, they assume the effects of the parameters will be continuous which is probably a valid assumption, and make the judgment based on the LLR of such direction setting at a negative value for similar tweaks especially when the other direction is already looking good, so the tests alone might be indeed statistically not valid, but the many of them are pointing to the fail, and the other direction is behaving differently, so it's not a judgment based on the LLR of that specific test alone.

@peregrineshahin
Copy link
Contributor

Note that users are limited to 6 active tests per user, this also leads to many early stopping to not get less TP, do you think such a limitation is a bad thing?

@vdbergh
Copy link
Contributor Author

vdbergh commented Aug 1, 2024

I think we should remain focused. The facts are that many tests get stopped at LLR~-1.0. All I did in the OP is to indicate that stopping early drastically changes beta.

We had this discussion about beta before but then Vondele noted that instead of changing beta one may equivalently change the Elo bounds.

This is mathematically true, but now I realize that there is a psychological difference. With an SPRT[-2.94, 2.94] people will always be tempted to stop a test at LLR=-1.0, no matter what the Elo bounds are.

With an asymmetric SPRT (e.g. SPRT[-0.983, 2.557] which has alpha=0.05 and the same beta value as stopping an SPRT[-2.94, 2.94] at LLR=-1.0) there would be less temptation - I think....

@vdbergh
Copy link
Contributor Author

vdbergh commented Aug 1, 2024

BTW I think the idea of picking a "promising" patch among related patches depending on how the LLR evolves is strongly flawed. The LLR is a much too noisy metric for that.

@peregrineshahin
Copy link
Contributor

If LOS is too noisy and LLR is too noisy, maybe we can fix one, or introduce a one that is not noisy?

@vdbergh
Copy link
Contributor Author

vdbergh commented Aug 1, 2024

If LOS is too noisy and LLR is too noisy, maybe we can fix one, or introduce a one that is not noisy?

As I see it, noisy optimization is intrinsically difficult.

@dubslow
Copy link
Contributor

dubslow commented Aug 1, 2024

I complained about this practice on Discord a couple days ago, and was greeted with mostly silence, and one comment that running tests to -2.94 LLR is too dogmatic (and implicitly corroborated by pere's first comment here).

I'm of the opinion that stopping early too frequently is actually a more effective waste of computation than letting them run to the bounds. I agree that should devs really intend to have a higher false negative rate, then they should indeed use an alternate beta before the test starts (rather than spamming the tests page with inconclusive noise as present). Choosing the stop rule during the test is extra bias over choosing the stop rule before the test and sticking to it

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants