Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Verify base signature before new #1893

Closed
wants to merge 1 commit into from
Closed

Verify base signature before new #1893

wants to merge 1 commit into from

Conversation

peregrineshahin
Copy link
Contributor

so that we know if master passed

so that we know if master passed
@ppigazzini
Copy link
Collaborator

I don't grasp what is the improvement introduced by this change:

  • the worker already checks the signature of both "base" and "test"
  • "base" and "test" are not bound to be set to the "master" branch, even if that is the usual use case on fishtest

@peregrineshahin
Copy link
Contributor Author

Are you sure the worker checks both?
suppose the first verify raised an error does the worker continue to verify master?
If so you can close the PR, but at that time looking at the log with the exit failure,, I didn't see a verification happening for master.

The goal was to verify that it's not master that is bugged with unstable bench or exit, since the new patch is based on master most of the time.

@ppigazzini
Copy link
Collaborator

Either with master or PR code the worker raises an Exception if the bench is wrong. The "base" usually is master, with the bench set automatically by the server, while the bench of "test" is written by the user and has a probability of being wrong. So, failing fast verifying first the "test" bench is a better practice IMO.

@peregrineshahin
Copy link
Contributor Author

https://tests.stockfishchess.org/actions?action=&user=&text=EXIT_FAILURE
Does these include any master commit?
See here no EXIT_FAILURE for master that left us clueless, you need real case senario here is the real world scenario, the goal wasn't cosmatics

@peregrineshahin
Copy link
Contributor Author

if as you say these massages does include master commits which I'm lazy to check since it requires looking back to master commits SHA, then I don't see the point of my PR.
i.e. if really we verify both nevertheless

@ppigazzini
Copy link
Collaborator

I suppose that was due to workers without wget/curl, unable to download the net(s) required by master through the makefile.
And in any case, the worker verified the bench of "base:master" and raised that exception.

@peregrineshahin
Copy link
Contributor Author

but still, teach me why it continues to verify master

is it because RunException, and WorkerExcpetion continues the next code normally?

@peregrineshahin
Copy link
Contributor Author

peregrineshahin commented Mar 3, 2024

    if p.returncode != 0:
        if p.returncode == 1:  # EXIT_FAILURE
            raise RunException(
                "Bench of {} exited with EXIT_FAILURE".format(os.path.basename(engine))
            )
        else:  # Signal? It could be user generated so be careful.
            raise WorkerException(
                "Bench of {} exited with error code {}".format(
                    os.path.basename(engine), format_return_code(p.returncode)
                )
            )

@ppigazzini
Copy link
Collaborator

The code verifies the benches of the branch of "test" and the branch of "base", on fishtest we can schedule a test "sf_11" vs "sf_12". No code is verifying expressly the bench of "master", neither your proposed change.

@peregrineshahin
Copy link
Contributor Author

I never said that it verifies master, it's just the most probable case

@Disservin
Copy link
Member

Disservin commented Mar 3, 2024

I suppose that was due to workers without wget/curl, unable to download the net(s) required by master through the makefile.

I don't think it was, I remember checking a workers log and there was no error from make saying that the download failed, it (iirc) even said it downloaded it successfully. My guess at that time was the net was deleted after the download from the makefile. I've attached the log from techno at that time, if you want to take a look.

fishtest-client.zip

@peregrineshahin
Copy link
Contributor Author

Explain how you expect the worker to verify both
when RunException is triggered

@ppigazzini
Copy link
Collaborator

ppigazzini commented Mar 3, 2024

Explain how you expect the worker to verify both when RunException is triggered

Explain you how your code is verifying "master", please.

@ppigazzini ppigazzini closed this Mar 3, 2024
@peregrineshahin
Copy link
Contributor Author

Explain you how your code is verifying "master", please.

If we have the verify of base first, which in most of the time is the master, and even if not it was master it's most probably an older version of the new base
we can assume if the verify of base is done first and no error is raised but the error is raised on the new afterwards then we know as Stockfish developers that master or base is clean and the oroblem is in the new.

Why you took the master so religously?

@peregrineshahin
Copy link
Contributor Author

And what's hard about that to imagine?
Just say that it verifies both
if it doesn't verify both then you are the clueless one and relgously netpicking the master word

@ppigazzini
Copy link
Collaborator

ppigazzini commented Mar 3, 2024

Are you hinting that devs always set a wrong bench for "new", skipping the verification of "base"?
This log is the proof that the bench of "base" is verified.
https://tests.stockfishchess.org/actions?action=&user=&text=EXIT_FAILURE

@peregrineshahin
Copy link
Contributor Author

It's not about devs putting the wrong bench
it's about the binary itself could hold unstable bench because of undefined behavior

@peregrineshahin
Copy link
Contributor Author

once in a while we have this problem of undefined behavior if the base does not get verified before new, then we are cluless.
If it doesn't verify both

@peregrineshahin
Copy link
Contributor Author

At any rate arguing with you is useless at this point.

@ppigazzini
Copy link
Collaborator

Show me the code in this PR that verifies the benches of both "base" and "test" before raising the exception.
Check here how I collected all the missing nets before raising the exception.
#1902

@peregrineshahin
Copy link
Contributor Author

That wasnt the goal.. the goal is telling SF devs that we in fishtest code do verify base before new.. thus with this code knowledge we can tell other people asking if their undefined behavior is from the patch or base that fishtest code if it was base that is bugged then you know it's base and noth their patch that is bugged.

@ppigazzini
Copy link
Collaborator

I suppose that was due to workers without wget/curl, unable to download the net(s) required by master through the makefile.

I don't think it was, I remember checking a workers log and there was no error from make saying that the download failed, it (iirc) even said it downloaded it successfully. My guess at that time was the net was deleted after the download from the makefile. I've attached the log from techno at that time, if you want to take a look.

fishtest-client.zip

If there is the suspect of an instability of the bench of master, IMO is better to write a little script to loop on: building master and verifying the bench.

@Disservin
Copy link
Member

Disservin commented Mar 3, 2024

Dunno how you came to that conclusion, in essence fishtest is already is kinda doing this because multiple workers run a patch. Back then I suspected that the fishtest worker during cleanup deleted the small net after it was downloaded, but that was just a guess

@peregrineshahin
Copy link
Contributor Author

Do you get the idea now?
This PR provides the neccessary knowledge to tell the dev that their patch is bugged and not master.
I lost it at this point not gonnas argue further

@peregrineshahin
Copy link
Contributor Author

And I emphasize on "master" ;)

@ppigazzini
Copy link
Collaborator

Dunno how you came to that conclusion, in essence fishtest is already is kinda doing this because multiple workers run a patch. Back then I suspected that the fishtest worker during cleanup deleted the small net after it was downloaded, but that was just a guess

Ask to him #1893 (comment)

@ppigazzini
Copy link
Collaborator

So master become master passing STC and LTC never failing the bench or on fishtest or in the CI, and then it starts failing the bench.

@Disservin
Copy link
Member

Disservin commented Mar 3, 2024

So master become master passing STC and LTC never failing the bench or on fishtest or in the CI, and then it starts failing the bench.

Well the case was that suddenly tests were failing and new people might think that the test failed because of a change from them. Not everything is tested and sometimes stuff like compiler issues also pop up which are unrelated to the patch, while our CI and any previous test passed successfully.

So pere tried to make it obvious that master has failed first instead of the patch, to signal that there is perhaps a larger problem present. If the code actually does that I have no idea, I didn't check.

@vdbergh
Copy link
Contributor

vdbergh commented Mar 3, 2024

So master become master passing STC and LTC never failing the bench or on fishtest or in the CI, and then it starts failing the bench.

Well the case was that suddenly tests were failing and new people might think that the test failed because of a change from them. Not everything is tested and sometimes stuff like compiler issues also pop up which are unrelated to the patch, while our CI and any previous test passed successfully.

So pere tried to make it obvious that master has failed first instead of the patch, to signal that there is perhaps a larger problem present. If the code actually does that I have no idea, I didn't check.

I guess that makes sense.

@peregrineshahin peregrineshahin deleted the verify branch March 3, 2024 18:19
@ppigazzini
Copy link
Collaborator

ppigazzini commented Mar 3, 2024

I guess that makes sense.

see here:
#1897

a short recap:

  • the first dual net test was submitted on fishtest without asking for fishtest proper changes to support it, crashing for days all the workers missingcurl/wget
  • the EXIT_FAILURE events started with the dual nets tests, and are stopped after adding the proper support to the worker and to the server for the multiple nets

With this PR in place, instead than the false flag of a wrong SF dev, we had people panicking for a big issue in SF master :)
We are already too good in chasing ghosts, instead that searching for the simple root cause: our fragile code or our bad habits.

As I already wrote few messages ago #1893 (comment) to have a clear view of the issue it's necessary to have both the verification of the benches, not only one, and only then raise the exception with all the information. I added a link to a recent PR to be taken as example.

@peregrineshahin
Copy link
Contributor Author

I agree that my PR was lazy and adhoc that doesn't negate the fact that you were not listening to the purpose of the PR and not suggesting code changes.

Yes verifying both and then raise the error is better but that also addresses the goal

@ppigazzini
Copy link
Collaborator

I agree that my PR was lazy and adhoc that doesn't negate the fact that you were not listening to the purpose of the PR and not suggesting code changes.

Yes verifying both and then raise the error is better but that also addresses the goal

No, it simply starts the chase of the big ghost "master is bugged!" when the root cause is nearly 100% in fishtest code/process/worker.

I already linked a code that raises the exception after all the checks, feel free to take inspiration.

@vdbergh
Copy link
Contributor

vdbergh commented Mar 3, 2024

If the build of master suddenly starts failing then this is likely a problem in Fishtest (the idea behind this PR).

But: if all builds suddenly start failing (the case we are discussing) then that also indicates a likely problem in Fishtest....

So from this point of view this PR does not yield new information.

@peregrineshahin
Copy link
Contributor Author

peregrineshahin commented Mar 3, 2024

Well that's not idea of the PR, the idea of the PR is to verify if base has undefined behavior it's not about fishtest.
I wanted to verify that before we exit. but they haven't yet answered my question, does verifying the new patch that can result in a error stops the excution of next code, if yes then we don't know if the problem also exists in master, if no then the PR useless since we will verify both nevertheless.

@vdbergh
Copy link
Contributor

vdbergh commented Mar 3, 2024

Not sure if I understand your question. But with the current code (i.e. not this PR), if verifying the bench of New fails then a RunException is raised and no further verification is done (so base is not checked).

EDIT: assuming the failure is with exit code 1.

@peregrineshahin
Copy link
Contributor Author

Yes, then what my PR does, is giving the benefit of the doubt that the code of the dev is not the bugged one, se we verify master first if it does verify and afterwards his new patch does not verfy then we can say to him your code is bugged and intoduces undefined behavior.
but now without this PR I cannnnnnnot telll anyone that their code is bugged.

@vdbergh
Copy link
Contributor

vdbergh commented Mar 3, 2024

But before master becomes master it has been built many times already. So a problem in master is unlikely and would be at most very intermittent. What we were seeing here is that every DEV build failed. If that were due to a problem in master, then it would mean that suddenly every build of master would also be failing. And as I explained this is unlikely.

@Disservin
Copy link
Member

@vdbergh not sure which case you are discussing right now, but the not all builds failed for the aforementioned patches which resulted in EXIT_FAILURE. Some got to play the games and others didn't.

I see some benefit in easing this up since it's just a minor change and can potentially spot issues. I.e. a compiler miscompilation on very new compilers which only a small % of users have installed, will show us that it first failed for master, thus we have a bug in master.. vs user sees his test code fails for some workers, so they think they did something wrong change X resubmit and issue solved, never to be looked at again.

@peregrineshahin
Copy link
Contributor Author

btw I don't care if this got merged or not, I was only pissed off of the attitude of dealing with it, the only thing I'm surprised about is how being an approver + sf contributor + fishtest contributor helps in being wiser.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants