Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Server still allows timelosses to cause incorrect test results #1968

Open
dubslow opened this issue Apr 29, 2024 · 4 comments
Open

Server still allows timelosses to cause incorrect test results #1968

dubslow opened this issue Apr 29, 2024 · 4 comments

Comments

@dubslow
Copy link
Contributor

dubslow commented Apr 29, 2024

I've just had a concrete example of this effect.

Test 1 https://tests.stockfishchess.org/tests/view/662db69c6115ff6764c8065a
This failed, quite surprisingly, in 12k games with more than 1% timelosses across the whole test, with some tasks reaching near 10%.

By contrast, Test A https://tests.stockfishchess.org/tests/view/662db6b36115ff6764c80667 had passed in 50k games, and several other tests in the family had shown that both Tests 1 and A should have passed or failed together.

On that basis I rescheduled Test 1 as Test 2 after the timelossing workers had been fixed: https://tests.stockfishchess.org/tests/view/662edf05e1ff56336c0223d1
This time it went as I expected. It passed (considerably slower than I'd hoped but whatever), which is a stark contrast to failing in 12k games.

Is the difference in these two runs attributable to luck? Yes, certainly, a large portion of it is luck. However a large part is also clearly the direct impact of timelosses, for example the WW/LL rate in the bad test was much, much much higher than in the reschedule or indeed in STCs generally. That can only be attributed directly to the server accepting bad data.

Please note that last year I'd submitted a PR (#1571) to (among other things) tighten up the acceptable data quality, however it was entirely ignored. That PR is now out of date, but as these tests demonstrate, the issue is clearly still a problem on the server, and should be mitigated as soon as possible.

Ideally the server should be able to reject only gamepairs affected, but rejecting tasks with more than 3-5 timelosses would be a good place to start (and ideally such rejection should occur even before a manual or auto purge).

Given the impact of this issue, I'd recommend that anyone who ran STCs while the bad data was being accepted by the server (around a day or so before this issue creation, see the bad test timestamps) should consider rerunning those tests, as results may differ without bad data.

@dubslow
Copy link
Contributor Author

dubslow commented Apr 29, 2024

(I hope it goes without saying that I am most definitely a million percent happy, pleased, deeply thankful and profoundly grateful that we had a new contributor throwing oodles and oodles of new cores at fishtest. I have absolutely no problem with people adding new cores, this is strictly a server problem as far as I'm concerned.)

@peregrineshahin
Copy link
Contributor

Timelosses are part of the development, you can have a perfectly reflective result if you mess up Stockfish code Time management and keep losing on time, it's not like the time losses are generated by the server itself but rather by real-world phenomena (Stockfish being run on some hardware) I don't think the matter is trivial as outright reject all time-losses.

@dubslow
Copy link
Contributor Author

dubslow commented Apr 29, 2024

Sure but the number of patches which touch TM is a very small fraction, and the fact remains that non-engine timelosses directly cause incorrect results on non-TM patches

@dubslow
Copy link
Contributor Author

dubslow commented May 3, 2024

I suppose ultimately, discerning between "this patch affects TM" and "this patch doesn't affect TM" is something that can't be automated, which means we have to have some manual way to mark the difference in the fishtest ui.

Adding a second checkbox "clear timelosses" next to "auto purge" would do the trick in a nonintrusive way, and that also means that autopurge's timeloss barriers could be loosened as well, since even at present autopurge may "false purge" on TM patches (altho they're so rare it hasn't mattered so far).

And since we'd add this manual checkbox to clear timelosses, it would remove any timeloss pair whatsoever from the results (ideally it would remove individual pairs rather than whole tasks).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants