split worker for reduced concurrency #1393

vondele · 2022-07-30T20:49:17Z

it seems like the time loss problem is getting more acute with multiple large core workers.
See e.g. https://tests.stockfishchess.org/tests/view/62e523e2b383a712b1386193
We know this is probably due to cutechess not being able to deal with a large concurrency, and probably our best workaround is to split the worker internally (so not visible from the user side), to have multiple cutechess processes each with reduced concurrency.

dubslow · 2022-07-30T21:20:19Z

No matter what the prevention solution is -- fixing cutechess or making workarounds inside the worker -- fishtest should be able to manage tasks with timelosses separately from high residual tasks (e.g. rejecting them or purging them etc)

vdbergh · 2022-07-31T09:30:31Z

technologov-28cores-r345 has lots of issues. See e.g. #1360.

vondele · 2022-07-31T09:37:18Z

I've pinged him on discord. But there are at least two other workers with a lot of losses.

Marked as draft because it needs review/testing but in principle, it should be merged as soon as possible. This addresses server-side filtering of problematic tasks caused by official-stockfish#1393. Obviously we should also prevent this worker-side problem in the first place, but in the short run this should at least prevent fishtests from being polluted by garbage as they currently are

ppigazzini · 2022-08-08T23:39:53Z

Script for linux:

creates the "fishtest" user
creates 5 copies of the "worker" directory to be able to run 5 workers
creates an unit systemd file (name "fishtest") that takes the worker ID as parameter
start, stop, check the status for any single worker with sudo systemctl start fishtest@3
start, stop all the workers with sudo systemctl start fishtest@{0..4}
set the auto start for the workers with sudo systemctl enable fishtest-worker@{0..4}.service

To revert:

sudo systemctl disable fishtest-worker@{0..4}.service
sudo rm /etc/systemd/system/[email protected]
sudo deluser --remove-home fishtest (this delete the fishtest user and all his files/folders)

#!/bin/bash
# setup_worker.sh
# to setup a fishtest worker on Ubuntu 20.04, simply run: 
# sudo bash setup_workers.sh 2>&1 | tee setup_workers.sh.log

# print CPU information
cpu_model=$(grep "^model name" /proc/cpuinfo | sort | uniq | cut -d ':' -f 2)
n_cpus=$( grep "^physical id" /proc/cpuinfo | sort | uniq | wc -l)
online_cores=$(grep "^bogo" /proc/cpuinfo | wc -l)
n_siblings=$(grep "^siblings" /proc/cpuinfo | sort | uniq | cut -d ':' -f 2)
n_cpu_cores=$(grep "^cpu cores" /proc/cpuinfo | sort | uniq | cut -d ':' -f 2)
total_siblings=$((${n_cpus} * ${n_siblings}))
total_cpu_cores=$((${n_cpus} * ${n_cpu_cores}))
printf "CPU model : ${cpu_model}\n"
printf "CPU       : %3d  -  Online cores    : %3d\n" ${n_cpus} ${online_cores}
printf "Siblings  : %3d  -  Total siblings  : %3d\n" ${n_siblings} ${total_siblings}
printf "CPU cores : %3d  -  Total CPU cores : %3d\n" ${n_cpu_cores} ${total_cpu_cores}

# read the fishtest credentials and the number of cores to be contributed
echo
echo "Write your fishtest username:"
read usr_name
echo "Write your fishtest password:"
read usr_pwd
echo "Write the number of cores to be contributed to fishtest:"
echo "(max suggested 'Total CPU cores - 1')"
read n_cores

# install required packages
apt update && apt full-upgrade -y && apt autoremove -y && apt clean
apt install -y python3 python3-venv git build-essential 

# new linux account used to run the worker
worker_user='fishtest'
# create user for fishtest
useradd -m -s /bin/bash ${worker_user}

# add the bash variable for the python virtual env
sudo -i -u ${worker_user} << 'EOF'
echo export VENV=${HOME}/fishtest/worker/env >> .profile
EOF

# download fishtest
sudo -i -u ${worker_user} << EOF
git clone --single-branch --branch master https://github.com/glinscott/fishtest.git
cd fishtest
git config user.email "[email protected]"
git config user.name "your_name"
EOF

# fishtest worker setup and first start only to write the "fishtest.cfg" configuration file
sudo -i -u ${worker_user} << EOF
python3 -m venv \${VENV}
\${VENV}/bin/python3 -m pip install --upgrade pip setuptools wheel
\${VENV}/bin/python3 -m pip install requests

\${VENV}/bin/python3 \${HOME}/fishtest/worker/worker.py --concurrency ${n_cores} ${usr_name} ${usr_pwd} --only_config True && echo "concurrency successfully set" || echo "Restart the script using a proper concurrency value"
EOF

# copy the worker directory N=5 times (change according your needs)
sudo -i -u ${worker_user} << 'EOF'
cd fishtest
for ((k=0; k<=4; k++)); do
  cp -r worker worker${k}  
done 
EOF

echo
echo "Setup fishtest-worker as a service"
read -p "Press <Enter> to continue or <CTRL+C> to exit ..."

# install fishtest-worker as systemd service
# start/stop the worker with:
# sudo systemctl start fishtest-worker@{0..4}
# sudo systemctl stop fishtest-worker@{0..4}
# check the log with:
# sudo journalctl -u [email protected]
# the service uses the worker configuration file "fishtest.cfg"

# get the worker_user $HOME
worker_user_home=$(sudo -i -u ${worker_user} << 'EOF'
echo ${HOME}
EOF
)

cat << EOF > /etc/systemd/system/[email protected]
[Unit]
Description=Fishtest worker %i
After=multi-user.target

[Service]
Type=simple
StandardOutput=file:${worker_user_home}/fishtest/worker%i/worker.log
StandardError=inherit
ExecStart=${worker_user_home}/fishtest/worker%i/env/bin/python3 ${worker_user_home}/fishtest/worker%i/worker.py
User=${worker_user}
WorkingDirectory=${worker_user_home}/fishtest/worker%i

[Install]
WantedBy=multi-user.target
EOF

systemctl daemon-reload

echo
echo "Start fishtest-worker service"
read -p "Press <Enter> to continue or <CTRL+C> to exit ..."
systemctl start fishtest-worker@{0..4}.service

echo
echo "Enable fishtest-worker service auto start"
read -p "Press <Enter> to continue or <CTRL+C> to exit ..."
systemctl enable fishtest-worker@{0..4}.service

vdbergh · 2022-08-09T05:53:10Z

I think the cleanest way to achieve splitting is to create a --pool N argument for the worker. Default N=1 which does nothing (current behaviour). N=0 means that the worker is a clone worker and N>=2 means that the worker is a master (see below).

If N>=2 the worker would quietly create N clone copies of itself (in subdirectories), with slightly adapted fishtest.cfg (the memory, concurrency and uuid_prefix options) and then start these clones using popen (with --pool 0).

For each clone there should probably be a controlling thread in the master to manage its life cyle... I am a bit worried about Crtl-C handling though (we want the clones to quit if the master worker receives Ctrl-C).

Clone workers do not upgrade. If the master upgrades then the clone workers are stopped and deleted. They will be recreated when the master restarts.

The main reason for doing it this way would be to keep the error handling manageable.

If instead we would be starting multiple copies of cutechess within a single worker, the error handling would be a nightmare I think.

vondele · 2022-08-09T06:43:54Z

yes, I agree that this could be done at a higher level like you describe.

vdbergh · 2022-08-09T06:47:57Z

Allowing the clone workers to update would lead to pretty bad race conditions. So I adapted the proposal accordingly.

ppigazzini · 2022-08-09T09:09:24Z

Problems:

the user willingness to start N workers, with or --pool or systemd. The big workers are contributed by people with high skillset, surely able to setup an unit systemd, but they never did it (also when provided with the script above)
how to stop or restart only 1-2 workers without stopping or restating all the workers, so losing the games played

vdbergh · 2022-08-09T09:24:38Z

to stop or restart only 1-2 workers without stopping or restating all the workers, so losing the games played

I was thinking that from the point of view of the user the result of --pool would be a single worker. So the clones live and die together. If one has a 95 core worker then one can also not restart 20 cores.

The two solutions (--pool and the user manually splitting the worker) are not mutually exclusive.

vdbergh · 2022-08-09T20:20:14Z

We can set the default value of pool to ceil(concurrency/32). In that way nothing would change for workers with <= 32 cores.

A 33 core worker would split up as a 16 core worker and a 17 core worker.

This was referenced Aug 2, 2022

Untested try at rejecting workers submitting bad data #1394

Closed

Workers with issues. #1360

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

split worker for reduced concurrency #1393

split worker for reduced concurrency #1393

vondele commented Jul 30, 2022

dubslow commented Jul 30, 2022

vdbergh commented Jul 31, 2022

vondele commented Jul 31, 2022

ppigazzini commented Aug 8, 2022

vdbergh commented Aug 9, 2022 •

edited

Loading

vondele commented Aug 9, 2022

vdbergh commented Aug 9, 2022 •

edited

Loading

ppigazzini commented Aug 9, 2022

vdbergh commented Aug 9, 2022 •

edited

Loading

vdbergh commented Aug 9, 2022 •

edited

Loading

split worker for reduced concurrency #1393

split worker for reduced concurrency #1393

Comments

vondele commented Jul 30, 2022

dubslow commented Jul 30, 2022

vdbergh commented Jul 31, 2022

vondele commented Jul 31, 2022

ppigazzini commented Aug 8, 2022

vdbergh commented Aug 9, 2022 • edited Loading

vondele commented Aug 9, 2022

vdbergh commented Aug 9, 2022 • edited Loading

ppigazzini commented Aug 9, 2022

vdbergh commented Aug 9, 2022 • edited Loading

vdbergh commented Aug 9, 2022 • edited Loading

vdbergh commented Aug 9, 2022 •

edited

Loading

vdbergh commented Aug 9, 2022 •

edited

Loading

vdbergh commented Aug 9, 2022 •

edited

Loading

vdbergh commented Aug 9, 2022 •

edited

Loading