Untested try at rejecting workers submitting bad data #1394

dubslow · 2022-08-02T04:28:21Z

Marked as draft because it needs review/testing but in principle, it should be merged as soon as possible.

This addresses server-side filtering of problematic tasks caused by #1393. Obviously we should also prevent this worker-side problem in the first place, but in the short run this should at least prevent fishtests from being polluted by garbage as they currently are

Marked as draft because it needs review/testing but in principle, it should be merged as soon as possible. This addresses server-side filtering of problematic tasks caused by official-stockfish#1393. Obviously we should also prevent this worker-side problem in the first place, but in the short run this should at least prevent fishtests from being polluted by garbage as they currently are

dubslow · 2022-08-02T20:54:41Z

Well at least it passes CI now, still can't say what it might do in production, I hope someone more knowledgeable can work on this and merge it

vdbergh · 2022-08-07T06:11:26Z

On issue with this PR is that it is run after every update. So if if a task start with a time loss the task will be cancelled, even if the total number of time losses turns out to be less than 1% (easily fixable of course).

As an alternative to this PR we could also simply turn on auto purge again... Note that auto purge rejects tasks with
timelosses > 10% but that would be easy to change.

vdbergh · 2022-08-07T06:35:29Z

An advantage of the current PR would be that the offending workers would be flagged in the event log. The owner could then split the worker into several lower core ones (this would just involve copying the worker directory and starting the subworkers once with a lower concurrence value).

Of course doing this internally in fishtest would be preferable but it is a substantial task.

dubslow · 2022-08-07T07:14:54Z

Ah, I see. Thanks for the review. Yes I concur that rejecting a 1/1 timeloss isn't ideal, altho it would have been an improvement when many workers were doing too many of these.

So purge_run, https://github.com/glinscott/fishtest/blob/d5ea015623a94f95295fa9b71443705a507ba75e/server/fishtest/rundb.py#L1062, when does this function get called? How does a user trigger a purge on fishtest? Is that admin only? Certainly I think dropping that 10% timelosses acceptable down to 1% should be a great improvement.

Why was autopurging disabled? I am quite ignorant about fishtest and its history, as you can tell.

(And of course now seeing that, I should also have simply used crash_or_time in this PR.)

vondele · 2022-08-07T07:19:53Z

both purging and deleting these batches is a hack... this needs to be solved at the root IMO. Cutechess simply can't deal with the concurrency. Even if there are no time losses, this could severely distort time management nevertheless. The only real solution IMO is to internally split the worker.

We know the purging doesn't work well with NNUE, as machines with different hardware architecture are different.

dubslow · 2022-08-07T07:23:13Z

both purging and deleting these batches is a hack... this needs to be solved at the root IMO.

Installing bad-data filters on fishtest isn't a hack, it's simply good engineering.

That said, any sources of bad data should also be solved at the root, prevented from occurring in the first place. By no means are the two approaches mutually exclusive. Both strategies should be pursued to completion

We know the purging doesn't work well with NNUE, as machines with different hardware architecture are different.

Is this a "doesn't run properly" sort of thing, or simply in terms of results out of whack with the test average? If the latter, perhaps we could separate the purging of atypical results from the purging of crashes/timelosses. They are definitely different categories of error, perhaps we should handle them separately.

vdbergh · 2022-08-07T07:56:50Z

@vondele

We know the purging doesn't work well with NNUE, as machines with different hardware architecture are different.

I think this problem is gone now. At least the chi^2 test does not indicate systemic deviating workers (other than time losses).

Perhaps this has to do with the switch to the UHO book. The noob book with its extremely high draw ratio was perhaps more sensitive to small hardware differences.

vondele · 2022-08-07T08:00:13Z

I think this would still show for tests of new NNUE architectures.

However, if we don't fix the problem with concurrency, we'll just remove all contributions from linrock with his 95 core servers, and sebastronomy with 48 cores.

dubslow · 2022-08-07T08:03:15Z

However, if we don't fix the problem with concurrency, we'll just remove all contributions from linrock with his 95 core servers, and sebastronomy with 48 cores.

We should of course fix the cause of the symptom, but nevertheless bad data is bad data and should be removed from test statistics.

That said, I now believe this PR isn't the best way to go about filtering the bad data. I think being able to separately purge crashes/timelosses from "standard" residuals is the way to go. (I'm fine with keeping autopurging off, it's not a very future-proof tool.) The main problem is I don't know where the purge button is for existing tests on the site.

vondele · 2022-08-07T08:05:06Z

The red 'purge' button

dubslow · 2022-08-07T08:06:17Z

Ahhhhhh. I think, back when I was first poking around, I assumed that meant "purge from the DB", not anything relating to individual tasks... thereafter it was forever immediately filtered from my vision whenever I load a page...

vdbergh · 2022-08-07T08:45:37Z

I think this would still show for tests of new NNUE architectures.

However, if we don't fix the problem with concurrency, we'll just remove all contributions from linrock with his 95 core servers, and sebastronomy with 48 cores.

@vondele We would encourage them to to split up their workers into several lower core workers. This is a completely trivial operation (just copy the worker directory and restart the workers with a lower number of cores).

I agree this is not ideal but doing the splitting internally in the worker is not a trivial operation and a volunteer needs to step up to do it.

Note: while the linrock workers seem to genuinely suffer from the high core cutechess issue, I am not so sure about sebastronomy. Technologov has several 56 core workers and these seem to run without issues.

dav1312 · 2022-08-07T09:06:46Z

I assumed that meant "purge from the DB"

I thought so too... maybe the name should be changed to something like "Clear tasks" or just "Purge tasks"?

dubslow mentioned this pull request Aug 2, 2022

Workers with issues. #1360

Closed

dubslow force-pushed the rejectTimeLosses branch from 0e42b18 to 26dca1d Compare August 2, 2022 19:22

fix dict key miss and dangling paren

d5ea015

dubslow force-pushed the rejectTimeLosses branch from 53c0634 to d5ea015 Compare August 2, 2022 20:25

dubslow marked this pull request as ready for review August 2, 2022 21:05

dubslow closed this Aug 7, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Untested try at rejecting workers submitting bad data #1394

Untested try at rejecting workers submitting bad data #1394

dubslow commented Aug 2, 2022

dubslow commented Aug 2, 2022

vdbergh commented Aug 7, 2022 •

edited

Loading

vdbergh commented Aug 7, 2022 •

edited

Loading

dubslow commented Aug 7, 2022 •

edited

Loading

vondele commented Aug 7, 2022

dubslow commented Aug 7, 2022 •

edited

Loading

vdbergh commented Aug 7, 2022

vondele commented Aug 7, 2022

dubslow commented Aug 7, 2022

vondele commented Aug 7, 2022

dubslow commented Aug 7, 2022

vdbergh commented Aug 7, 2022

dav1312 commented Aug 7, 2022 •

edited

Loading

Untested try at rejecting workers submitting bad data #1394

Untested try at rejecting workers submitting bad data #1394

Conversation

dubslow commented Aug 2, 2022

dubslow commented Aug 2, 2022

vdbergh commented Aug 7, 2022 • edited Loading

vdbergh commented Aug 7, 2022 • edited Loading

dubslow commented Aug 7, 2022 • edited Loading

vondele commented Aug 7, 2022

dubslow commented Aug 7, 2022 • edited Loading

vdbergh commented Aug 7, 2022

vondele commented Aug 7, 2022

dubslow commented Aug 7, 2022

vondele commented Aug 7, 2022

dubslow commented Aug 7, 2022

vdbergh commented Aug 7, 2022

dav1312 commented Aug 7, 2022 • edited Loading

vdbergh commented Aug 7, 2022 •

edited

Loading

vdbergh commented Aug 7, 2022 •

edited

Loading

dubslow commented Aug 7, 2022 •

edited

Loading

dubslow commented Aug 7, 2022 •

edited

Loading

dav1312 commented Aug 7, 2022 •

edited

Loading