Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Workers with issues. #1360

Closed
vdbergh opened this issue Jun 14, 2022 · 82 comments · Fixed by #1759
Closed

Workers with issues. #1360

vdbergh opened this issue Jun 14, 2022 · 82 comments · Fixed by #1759

Comments

@vdbergh
Copy link
Contributor

vdbergh commented Jun 14, 2022

I am creating this issue to report workers with issues. E.g. currently.

https://tests.stockfishchess.org/actions?action=failed_task&user=Oakwen&before=1655180683.077&max_actions=100

@vondele
Copy link
Member

vondele commented Jun 14, 2022

This worker is probably running in a bug in latest clang. llvm/llvm-project#55377
I see his workers are based on clang 15.

@vdbergh
Copy link
Contributor Author

vdbergh commented Jun 14, 2022

Yes but Oakwen-5cores-05c1d913 also runs clang 15 on WSL and seems to be doing fine.

@vondele
Copy link
Member

vondele commented Jun 14, 2022

for whatever reason the TLS might be handled differently or the code aligned properly by luck, depending on the OS.

@vdbergh
Copy link
Contributor Author

vdbergh commented Jun 15, 2022

Oakwen-3cores-8708609b switched to g++ so the problem is solved.

@vdbergh
Copy link
Contributor Author

vdbergh commented Jun 15, 2022

Another issue: The matches of Dantist-7cores-3e5ab901 always finish with "Finished match uncleanly":

https://tests.stockfishchess.org/actions?action=failed_task&user=Dantist&before=1655265828.358&max_actions=100

This has been going on since forever. I have no idea how it is possible,

@vondele
Copy link
Member

vondele commented Jun 15, 2022

hard to guess, would need a more detailed error message.

@vdbergh
Copy link
Contributor Author

vdbergh commented Jun 16, 2022

The matches with "Finished match uncleanly" have no games but also no crashes. This suggests that cutechess-cli failed to start the engine(s). I would be good to have access to the worker output of Dantist-7cores-3e5ab901 so that we can see what's going on.

@vondele
Copy link
Member

vondele commented Jun 16, 2022

@Dantist can you provide such output?

@vdbergh
Copy link
Contributor Author

vdbergh commented Jun 17, 2022

@noobpwnftw Your worker ChessDBCN-16cores-f3dad03d is now suffering from throttling. See https://tests.stockfishchess.org/actions?action=failed_task&user=ChessDBCN&before=1655459532.996&max_actions=33 .

Note: The hexadecimal number f3dad03d is the first 8 characters of the UUID (which are constant). It can be found as a comment in the config file and also in uuid.txt.

@Dantist
Copy link

Dantist commented Jun 17, 2022

@vdbergh @vondele

This suggests that cutechess-cli failed to start the engine(s).

That actually looks correct: run.log


I have to say that I have a somewhat unique setup.. I planned to deploy the worker to many servers, so I dockerized it with the alpine:edge base image.

In general, it worked great - the latest GCC, Python, cutechess-cli (compiled from source).
Only the worker's auto-update feature is what often broke things down (sometimes it auto-updated normally, and sometimes it pulled the cutechess-cli binary, which did not work with musl). I just had to monitor the workers and rebuild the docker image so that cutechess-cli was again compiled from the source. Sadly, I often noticed this after a few days of workers' inactivity.
It would be cool if "the actual version of cutechess-cli was checked prior to updating" or "make this feature optional" or "verify that updated cutechess-cli binary can be executed prior to replacing the working one".
I was occasionally reading Discord and saw that someone noticed the issue with cutechess-cli on my setup and correctly identified that my workers were running under Alpine and musl.

Anyway, currently. I see that this docker image is no longer working, and rebuilding doesn't fix the issue, and sadly enough, I have no time to fix it. Unfortunately, I haven't looked after my workers for some time and have fallen out of life, because now I have to defend my country from putin's barbaric invaders, but if I get out alive, I'll definitely fix everything.

I'll attach my docker setup below for your convenience (in troubleshooting), but you might want to amend it and add this run method as one of the option to the "Running the worker" wiki page, or even make your own official docker image and push it to hub.docker.com so people can run docker with a single CLI command without manually downloading anything.
This should work on any OS/Arch where Docker is supported but there is a drawback - if everyone uses this method it will reduce the diversity of fishtest workers' setups.

Tiny Alpine Docker image: fishtest-docker.zip

I hope this can be of any help.
Stay safe, take care, and send armor to Ukraine my western friends. If russia stops fighting there will be no more war. If Ukrainians stop fighting there will be no more Ukraine.

@vondele
Copy link
Member

vondele commented Jun 17, 2022

best of luck, and stay healthy.

@ppigazzini
Copy link
Collaborator

I wish you way more than luck @Dantist

@vdbergh
Copy link
Contributor Author

vdbergh commented Jun 18, 2022

@Dantist Thanks for the logs. They are very helpful. And good luck!

@vdbergh
Copy link
Contributor Author

vdbergh commented Jun 18, 2022

@noobpwnftw The worker ChessDBCN-16cores-97544138 is also suffering from throttling. See https://tests.stockfishchess.org/actions?action=failed_task&user=ChessDBCN&before=1655554389.898&max_actions=100 .

@vdbergh
Copy link
Contributor Author

vdbergh commented Jun 20, 2022

@ppigazzini
Copy link
Collaborator

ppigazzini commented Jun 26, 2022

https://tests.stockfishchess.org/actions?actions=failed_task&user=technologov&max_actions=1&before=1656244248.6

See https://stackoverflow.com/questions/71580631/how-can-i-get-code-coverage-with-clang-13-0-1-on-mac

A MacOS worker is running fine with clang, perhaps has a x86_64 CPU, so at the moment skip the profiled build only for Apple silicon with #1370

@noobpwnftw
Copy link
Contributor

I have removed worker f3dad03d. Now the others seemed less frequent.

@vdbergh
Copy link
Contributor Author

vdbergh commented Jul 5, 2022

@ppigazzini I have noticed this AssertionError once before. I did a code review then but could not find what might cause it. So it is a mystery. I suspect it is some kind of race condition....

@vdbergh
Copy link
Contributor Author

vdbergh commented Jul 23, 2022

@noobpwnftw The worker ChessDBCN-16cores-97544138 still suffers quite heavily from throttling. See https://tests.stockfishchess.org/actions?action=failed_task&user=ChessDBCN&before=1658567399.302&max_actions=100 .

@vdbergh
Copy link
Contributor Author

vdbergh commented Jul 27, 2022

Worker technologov-28cores-r345 suffers quite badly from throttling

https://tests.stockfishchess.org/actions?action=failed_task&user=technologov&before=1658909315.519&max_actions=100

@vdbergh
Copy link
Contributor Author

vdbergh commented Jul 27, 2022

technologov-56cores-r101 suffers from "Finished match uncleanly". It plays no games so this suggests that cutechess is unable to start the engines (the same issue Dantist had, but in this case it was fixed with the new cutechess binary).

https://tests.stockfishchess.org/actions?action=failed_task&user=technologov&max_actions=1&before=1658888465.102

EDIT: However I checked that technologov-56cores-r101 does not always suffer from this. In many cases it can execute a task.

@ppigazzini
Copy link
Collaborator

Here are some past analysis about the "Finished match uncleanly" problem:
#1110
#1116

@dubslow
Copy link
Contributor

dubslow commented Aug 2, 2022

Since it's not yet been discussed here (see discord), one technologov worker as well as most or every worker of linrock and sebastronomy have severe time loss problems. This is of course yet another symptom of known cutechess concurrency issues, however until the worker or cutechess is fixed, this is causing substantial pollution of fishtest data (timelosses causing higher-than-nominal pairwise-"draws", in the form of 1-0 1-0, thereby biasing test elos towards 0).

See also #1393, for implementing a worker-side workaround of cutechess problems, and #1394 for server-side filtering of bad data.

@vdbergh
Copy link
Contributor Author

vdbergh commented Aug 2, 2022

@dubslow This issue is specifically for documenting ill behaving workers. So it is best to refer to a worker by its full name (as has been done in the earlier comments).

For documentation purposes it would be nice if there were a method in fishtest to link to a task (in a similar way that it is possible to link to an event). Currently we can only link to a run.

@vdbergh
Copy link
Contributor Author

vdbergh commented Jan 23, 2023

@MinetaS @silversolver1 Thanks for reporting. Unfortunately there is currently not really a strategy for dealing with time losses (or other undesirable worker behavior). However since yesterday excessive time losses are recorded in the event log. This will make it easier to follow such workers.

https://tests.stockfishchess.org/actions?action=crash_or_time&user=&text=

@vdbergh
Copy link
Contributor Author

vdbergh commented Jan 24, 2023

An excessive number of time losses by Wencey-32cores

https://tests.stockfishchess.org/actions?action=crash_or_time&user=&text=%22Wencey-32cores%22

@vdbergh
Copy link
Contributor Author

vdbergh commented Jan 24, 2023

I assume that the issue is that on some systems the communication between the engine and cutechess-cli steals time from the engine. There really should be an option in cutechess-cli that makes it trust the time reported by the engines. Perhaps someone can implement this?

@vdbergh
Copy link
Contributor Author

vdbergh commented Feb 8, 2023

This task has more time losses than played games https://tests.stockfishchess.org/tests/view/63df697473223e7f52ad5d79?show_task=1097

Must be a bug...

@vdbergh
Copy link
Contributor Author

vdbergh commented Apr 8, 2023

@noobpwnftw Currently there are many of your workers that have the same uuid prefix which is not really desirable. I assume they were all started with a config file with the same private section. The private section looks like this:

[private]
hw_seed = 2564186689

The private section is generated once and then saved in the config file. If it is deleted then it is regenerated. The hw_seed, which is simply a random number, is the last line of defense to distinguish workers that are otherwise completely identical (e.g. running from the same virtual OS image).

@dav1312
Copy link
Contributor

dav1312 commented Apr 20, 2023

@vondele
Copy link
Member

vondele commented Apr 20, 2023

I have written an email to sebastronomy.

@mcbastian
Copy link

Hi, that's me. It seems to be the NUMA issue discussed a while ago on discord. I had to update kernel and OS of the machine and restarted the worker without numactl. I restarted with numactl. and will keep an eye on it.

@mcbastian
Copy link

Fixed now. Faulty CPU cooling on one CPU. reduced workload on this CPU, locked the workers to certain CPU cores (numactl). No more crashes since then. Will have to call DELL to fix this. Fans all OK. looks like problem with cooler itself

@vdbergh
Copy link
Contributor Author

vdbergh commented May 8, 2023

@vondele
Copy link
Member

vondele commented May 8, 2023

That's an issue that the newest clang compiler doesn't recognize one of the options we used in older versions of SF. The newer Makefile works, but not for the regression test against SF15

@vdbergh
Copy link
Contributor Author

vdbergh commented Aug 11, 2023

technologov-56cores-r116 appears to be generating a large number of dead tasks

https://tests.stockfishchess.org/actions?action=dead_task&user=&text=%22technologov-56cores-r116%22

Edit: the tasks don't have any games. After each reported dead task the worker restarts.

@vdbergh
Copy link
Contributor Author

vdbergh commented Aug 12, 2023

technologov-56cores-r116 is still generating a large number of dead tasks. Perhaps someone can write an email?

@ppigazzini
Copy link
Collaborator

ppigazzini commented Aug 12, 2023

technologov-56cores-r116 is still generating a large number of dead tasks. Perhaps someone can write an email?

I wrote him on Discord, the user has already answered.

@vdbergh
Copy link
Contributor Author

vdbergh commented Aug 23, 2023

There seem to be two workers named okrout-28cores-0a0cde5b. I.e. they have the same UUID prefix (but different UUID). This is harmless but not nice. It is also not easy to achieve accidentally.

There should be a server side mechanism to avoid duplicate UUID prefixes but this is not so easy as one one has to take for example dead tasks into account which can linger for quite some time.

vdbergh added a commit to vdbergh/fishtest that referenced this issue Aug 24, 2023
During request_task, we check if there are active tasks
for workers with the same uuid prefix, which have recently been
updated (i.e. they are not dead). If so then we return an
error.

Should fix official-stockfish#1360.
vdbergh added a commit to vdbergh/fishtest that referenced this issue Aug 24, 2023
During request_task, we check if there are active tasks
for workers with the same name, which have recently been
updated (i.e. they are not dead). If so then we return an
error.

Should fix official-stockfish#1360.
vdbergh added a commit to vdbergh/fishtest that referenced this issue Aug 24, 2023
During request_task, we check if there are active tasks
for workers with the same name, which have recently been
updated (i.e. they are not dead). If so then we return an
error.

Should fix official-stockfish#1360.
vdbergh added a commit to vdbergh/fishtest that referenced this issue Aug 25, 2023
During request_task, we check if there are active tasks
for workers with the same name, which have recently been
updated (i.e. they are not dead). If so then we return an
error.

Should fix official-stockfish#1360.
vdbergh added a commit to vdbergh/fishtest that referenced this issue Aug 25, 2023
During request_task, we check if there are active tasks
for workers with the same name, which have recently been
updated (i.e. they are not dead). If so then we return an
error.

Should fix official-stockfish#1360 (comment) ,
vdbergh added a commit to vdbergh/fishtest that referenced this issue Aug 25, 2023
During request_task, we check if there are active tasks
for workers with the same name, which have recently been
updated (i.e. they are not dead). If so then we return an
error.

Should fix official-stockfish#1360 (comment) .
ppigazzini pushed a commit that referenced this issue Aug 25, 2023
During request_task, we check if there are active tasks
for workers with the same name, which have recently been
updated (i.e. they are not dead). If so then we return an
error.

Should fix #1360 (comment) .
@vdbergh
Copy link
Contributor Author

vdbergh commented Sep 7, 2023

This worker has a problem with his make installation https://tests.stockfishchess.org/actions?action=&user=&text=+%22maximmasiutin-2cores-f21151d3%22

@vdbergh
Copy link
Contributor Author

vdbergh commented Sep 7, 2023

I can block him if you give me approver rights.

@ppigazzini
Copy link
Collaborator

I can block him if you give me approver rights.

Done :)

@vdbergh
Copy link
Contributor Author

vdbergh commented Oct 9, 2023

The strange case of okrout-28cores-ca072243.

Despite having been blocked, and having been sent an email about the blocking 3 days ago, the worker keeps dutifully reconnecting every 15 minutes, trying to get a task (see https://tests.stockfishchess.org/workers/show).

Presumable the worker runs unattended and the email address is stale.

It is not a problem. Just strange.

EDIT The worker has been taken offline now.

EDIT2 The worker came briefly back online again and was then replaced by okrout-28cores-ba47b84b which suffers from the same problem (I blocked it again). It seems the owner does read email and also not the console messages.

@vdbergh
Copy link
Contributor Author

vdbergh commented Jan 11, 2024

I suspect some users create multiple workers whose total number of cores is larger than the number of cores in the system, with predictably bad results. Perhaps this happens here https://tests.stockfishchess.org/actions?action=&user=tolkki963&text= ?

It seems feasible to count the total number of cores in use by the workers using (named) shared memory (available in Python 3.8). For Python <= 3.12 this requires a monkey patch to work around some bugs in resource tracking.

python/cpython#82300 (comment)

I tested this code on python 3.10 and it works.

As usual there are two annoying issues to deal with

  • workers that are terminated using SIGKILL (they are unable to clean up properly)
  • synchronization

@vdbergh
Copy link
Contributor Author

vdbergh commented Jan 11, 2024

I am thinking that is easier for the server to sort it out (if the workers send the necessary information).

The server needs to know which workers run on the same machine but this can be done by using a random number in some fixed temporary file.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.