Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Untested try at rejecting workers submitting bad data #1394

Closed
wants to merge 2 commits into from

Conversation

dubslow
Copy link
Contributor

@dubslow dubslow commented Aug 2, 2022

Marked as draft because it needs review/testing but in principle, it should be merged as soon as possible.

This addresses server-side filtering of problematic tasks caused by #1393. Obviously we should also prevent this worker-side problem in the first place, but in the short run this should at least prevent fishtests from being polluted by garbage as they currently are

@dubslow dubslow mentioned this pull request Aug 2, 2022
Marked as draft because it needs review/testing but in principle, it should be merged as soon as possible.

This addresses server-side filtering of problematic tasks caused by official-stockfish#1393. Obviously we should also prevent this worker-side problem in the first place, but in the short run this should at least prevent fishtests from being polluted by garbage as they currently are
@dubslow
Copy link
Contributor Author

dubslow commented Aug 2, 2022

Well at least it passes CI now, still can't say what it might do in production, I hope someone more knowledgeable can work on this and merge it

@dubslow dubslow marked this pull request as ready for review August 2, 2022 21:05
@vdbergh
Copy link
Contributor

vdbergh commented Aug 7, 2022

On issue with this PR is that it is run after every update. So if if a task start with a time loss the task will be cancelled, even if the total number of time losses turns out to be less than 1% (easily fixable of course).

As an alternative to this PR we could also simply turn on auto purge again... Note that auto purge rejects tasks with
timelosses > 10% but that would be easy to change.

@vdbergh
Copy link
Contributor

vdbergh commented Aug 7, 2022

An advantage of the current PR would be that the offending workers would be flagged in the event log. The owner could then split the worker into several lower core ones (this would just involve copying the worker directory and starting the subworkers once with a lower concurrence value).

Of course doing this internally in fishtest would be preferable but it is a substantial task.

@dubslow
Copy link
Contributor Author

dubslow commented Aug 7, 2022

Ah, I see. Thanks for the review. Yes I concur that rejecting a 1/1 timeloss isn't ideal, altho it would have been an improvement when many workers were doing too many of these.

So purge_run, https://github.com/glinscott/fishtest/blob/d5ea015623a94f95295fa9b71443705a507ba75e/server/fishtest/rundb.py#L1062, when does this function get called? How does a user trigger a purge on fishtest? Is that admin only? Certainly I think dropping that 10% timelosses acceptable down to 1% should be a great improvement.

Why was autopurging disabled? I am quite ignorant about fishtest and its history, as you can tell.

(And of course now seeing that, I should also have simply used crash_or_time in this PR.)

@vondele
Copy link
Member

vondele commented Aug 7, 2022

both purging and deleting these batches is a hack... this needs to be solved at the root IMO. Cutechess simply can't deal with the concurrency. Even if there are no time losses, this could severely distort time management nevertheless. The only real solution IMO is to internally split the worker.

We know the purging doesn't work well with NNUE, as machines with different hardware architecture are different.

@dubslow
Copy link
Contributor Author

dubslow commented Aug 7, 2022

both purging and deleting these batches is a hack... this needs to be solved at the root IMO.

Installing bad-data filters on fishtest isn't a hack, it's simply good engineering.

That said, any sources of bad data should also be solved at the root, prevented from occurring in the first place. By no means are the two approaches mutually exclusive. Both strategies should be pursued to completion

We know the purging doesn't work well with NNUE, as machines with different hardware architecture are different.

Is this a "doesn't run properly" sort of thing, or simply in terms of results out of whack with the test average? If the latter, perhaps we could separate the purging of atypical results from the purging of crashes/timelosses. They are definitely different categories of error, perhaps we should handle them separately.

@vdbergh
Copy link
Contributor

vdbergh commented Aug 7, 2022

@vondele

We know the purging doesn't work well with NNUE, as machines with different hardware architecture are different.

I think this problem is gone now. At least the chi^2 test does not indicate systemic deviating workers (other than time losses).

Perhaps this has to do with the switch to the UHO book. The noob book with its extremely high draw ratio was perhaps more sensitive to small hardware differences.

@vondele
Copy link
Member

vondele commented Aug 7, 2022

I think this would still show for tests of new NNUE architectures.

However, if we don't fix the problem with concurrency, we'll just remove all contributions from linrock with his 95 core servers, and sebastronomy with 48 cores.

@dubslow
Copy link
Contributor Author

dubslow commented Aug 7, 2022

However, if we don't fix the problem with concurrency, we'll just remove all contributions from linrock with his 95 core servers, and sebastronomy with 48 cores.

We should of course fix the cause of the symptom, but nevertheless bad data is bad data and should be removed from test statistics.

That said, I now believe this PR isn't the best way to go about filtering the bad data. I think being able to separately purge crashes/timelosses from "standard" residuals is the way to go. (I'm fine with keeping autopurging off, it's not a very future-proof tool.) The main problem is I don't know where the purge button is for existing tests on the site.

@vondele
Copy link
Member

vondele commented Aug 7, 2022

image

The red 'purge' button

@dubslow
Copy link
Contributor Author

dubslow commented Aug 7, 2022

Ahhhhhh. I think, back when I was first poking around, I assumed that meant "purge from the DB", not anything relating to individual tasks... thereafter it was forever immediately filtered from my vision whenever I load a page...

@dubslow dubslow closed this Aug 7, 2022
@vdbergh
Copy link
Contributor

vdbergh commented Aug 7, 2022

I think this would still show for tests of new NNUE architectures.

However, if we don't fix the problem with concurrency, we'll just remove all contributions from linrock with his 95 core servers, and sebastronomy with 48 cores.

@vondele We would encourage them to to split up their workers into several lower core workers. This is a completely trivial operation (just copy the worker directory and restart the workers with a lower number of cores).

I agree this is not ideal but doing the splitting internally in the worker is not a trivial operation and a volunteer needs to step up to do it.

Note: while the linrock workers seem to genuinely suffer from the high core cutechess issue, I am not so sure about sebastronomy. Technologov has several 56 core workers and these seem to run without issues.

@dav1312
Copy link
Contributor

dav1312 commented Aug 7, 2022

I assumed that meant "purge from the DB"

I thought so too... maybe the name should be changed to something like "Clear tasks" or just "Purge tasks"?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants