Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

split worker for reduced concurrency #1393

Open
vondele opened this issue Jul 30, 2022 · 10 comments
Open

split worker for reduced concurrency #1393

vondele opened this issue Jul 30, 2022 · 10 comments

Comments

@vondele
Copy link
Member

vondele commented Jul 30, 2022

it seems like the time loss problem is getting more acute with multiple large core workers.
See e.g. https://tests.stockfishchess.org/tests/view/62e523e2b383a712b1386193
We know this is probably due to cutechess not being able to deal with a large concurrency, and probably our best workaround is to split the worker internally (so not visible from the user side), to have multiple cutechess processes each with reduced concurrency.

@dubslow
Copy link
Contributor

dubslow commented Jul 30, 2022

No matter what the prevention solution is -- fixing cutechess or making workarounds inside the worker -- fishtest should be able to manage tasks with timelosses separately from high residual tasks (e.g. rejecting them or purging them etc)

@vdbergh
Copy link
Contributor

vdbergh commented Jul 31, 2022

technologov-28cores-r345 has lots of issues. See e.g. #1360.

@vondele
Copy link
Member Author

vondele commented Jul 31, 2022

I've pinged him on discord. But there are at least two other workers with a lot of losses.

dubslow added a commit to dubslow/fishtest that referenced this issue Aug 2, 2022
Marked as draft because it needs review/testing but in principle, it should be merged as soon as possible.

This addresses server-side filtering of problematic tasks caused by official-stockfish#1393. Obviously we should also prevent this worker-side problem in the first place, but in the short run this should at least prevent fishtests from being polluted by garbage as they currently are
dubslow added a commit to dubslow/fishtest that referenced this issue Aug 2, 2022
Marked as draft because it needs review/testing but in principle, it should be merged as soon as possible.

This addresses server-side filtering of problematic tasks caused by official-stockfish#1393. Obviously we should also prevent this worker-side problem in the first place, but in the short run this should at least prevent fishtests from being polluted by garbage as they currently are
@ppigazzini
Copy link
Collaborator

Script for linux:

  • creates the "fishtest" user
  • creates 5 copies of the "worker" directory to be able to run 5 workers
  • creates an unit systemd file (name "fishtest") that takes the worker ID as parameter
  • start, stop, check the status for any single worker with sudo systemctl start fishtest@3
  • start, stop all the workers with sudo systemctl start fishtest@{0..4}
  • set the auto start for the workers with sudo systemctl enable fishtest-worker@{0..4}.service

To revert:

  • sudo systemctl disable fishtest-worker@{0..4}.service
  • sudo rm /etc/systemd/system/[email protected]
  • sudo deluser --remove-home fishtest (this delete the fishtest user and all his files/folders)
#!/bin/bash
# setup_worker.sh
# to setup a fishtest worker on Ubuntu 20.04, simply run: 
# sudo bash setup_workers.sh 2>&1 | tee setup_workers.sh.log

# print CPU information
cpu_model=$(grep "^model name" /proc/cpuinfo | sort | uniq | cut -d ':' -f 2)
n_cpus=$( grep "^physical id" /proc/cpuinfo | sort | uniq | wc -l)
online_cores=$(grep "^bogo" /proc/cpuinfo | wc -l)
n_siblings=$(grep "^siblings" /proc/cpuinfo | sort | uniq | cut -d ':' -f 2)
n_cpu_cores=$(grep "^cpu cores" /proc/cpuinfo | sort | uniq | cut -d ':' -f 2)
total_siblings=$((${n_cpus} * ${n_siblings}))
total_cpu_cores=$((${n_cpus} * ${n_cpu_cores}))
printf "CPU model : ${cpu_model}\n"
printf "CPU       : %3d  -  Online cores    : %3d\n" ${n_cpus} ${online_cores}
printf "Siblings  : %3d  -  Total siblings  : %3d\n" ${n_siblings} ${total_siblings}
printf "CPU cores : %3d  -  Total CPU cores : %3d\n" ${n_cpu_cores} ${total_cpu_cores}

# read the fishtest credentials and the number of cores to be contributed
echo
echo "Write your fishtest username:"
read usr_name
echo "Write your fishtest password:"
read usr_pwd
echo "Write the number of cores to be contributed to fishtest:"
echo "(max suggested 'Total CPU cores - 1')"
read n_cores

# install required packages
apt update && apt full-upgrade -y && apt autoremove -y && apt clean
apt install -y python3 python3-venv git build-essential 

# new linux account used to run the worker
worker_user='fishtest'
# create user for fishtest
useradd -m -s /bin/bash ${worker_user}

# add the bash variable for the python virtual env
sudo -i -u ${worker_user} << 'EOF'
echo export VENV=${HOME}/fishtest/worker/env >> .profile
EOF

# download fishtest
sudo -i -u ${worker_user} << EOF
git clone --single-branch --branch master https://github.com/glinscott/fishtest.git
cd fishtest
git config user.email "[email protected]"
git config user.name "your_name"
EOF

# fishtest worker setup and first start only to write the "fishtest.cfg" configuration file
sudo -i -u ${worker_user} << EOF
python3 -m venv \${VENV}
\${VENV}/bin/python3 -m pip install --upgrade pip setuptools wheel
\${VENV}/bin/python3 -m pip install requests

\${VENV}/bin/python3 \${HOME}/fishtest/worker/worker.py --concurrency ${n_cores} ${usr_name} ${usr_pwd} --only_config True && echo "concurrency successfully set" || echo "Restart the script using a proper concurrency value"
EOF

# copy the worker directory N=5 times (change according your needs)
sudo -i -u ${worker_user} << 'EOF'
cd fishtest
for ((k=0; k<=4; k++)); do
  cp -r worker worker${k}  
done 
EOF

echo
echo "Setup fishtest-worker as a service"
read -p "Press <Enter> to continue or <CTRL+C> to exit ..."

# install fishtest-worker as systemd service
# start/stop the worker with:
# sudo systemctl start fishtest-worker@{0..4}
# sudo systemctl stop fishtest-worker@{0..4}
# check the log with:
# sudo journalctl -u [email protected]
# the service uses the worker configuration file "fishtest.cfg"

# get the worker_user $HOME
worker_user_home=$(sudo -i -u ${worker_user} << 'EOF'
echo ${HOME}
EOF
)

cat << EOF > /etc/systemd/system/[email protected]
[Unit]
Description=Fishtest worker %i
After=multi-user.target

[Service]
Type=simple
StandardOutput=file:${worker_user_home}/fishtest/worker%i/worker.log
StandardError=inherit
ExecStart=${worker_user_home}/fishtest/worker%i/env/bin/python3 ${worker_user_home}/fishtest/worker%i/worker.py
User=${worker_user}
WorkingDirectory=${worker_user_home}/fishtest/worker%i

[Install]
WantedBy=multi-user.target
EOF

systemctl daemon-reload

echo
echo "Start fishtest-worker service"
read -p "Press <Enter> to continue or <CTRL+C> to exit ..."
systemctl start fishtest-worker@{0..4}.service

echo
echo "Enable fishtest-worker service auto start"
read -p "Press <Enter> to continue or <CTRL+C> to exit ..."
systemctl enable fishtest-worker@{0..4}.service

@vdbergh
Copy link
Contributor

vdbergh commented Aug 9, 2022

I think the cleanest way to achieve splitting is to create a --pool N argument for the worker. Default N=1 which does nothing (current behaviour). N=0 means that the worker is a clone worker and N>=2 means that the worker is a master (see below).

If N>=2 the worker would quietly create N clone copies of itself (in subdirectories), with slightly adapted fishtest.cfg (the memory, concurrency and uuid_prefix options) and then start these clones using popen (with --pool 0).

For each clone there should probably be a controlling thread in the master to manage its life cyle... I am a bit worried about Crtl-C handling though (we want the clones to quit if the master worker receives Ctrl-C).

Clone workers do not upgrade. If the master upgrades then the clone workers are stopped and deleted. They will be recreated when the master restarts.

The main reason for doing it this way would be to keep the error handling manageable.

If instead we would be starting multiple copies of cutechess within a single worker, the error handling would be a nightmare I think.

@vondele
Copy link
Member Author

vondele commented Aug 9, 2022

yes, I agree that this could be done at a higher level like you describe.

@vdbergh
Copy link
Contributor

vdbergh commented Aug 9, 2022

Allowing the clone workers to update would lead to pretty bad race conditions. So I adapted the proposal accordingly.

@ppigazzini
Copy link
Collaborator

Problems:

  • the user willingness to start N workers, with or --pool or systemd. The big workers are contributed by people with high skillset, surely able to setup an unit systemd, but they never did it (also when provided with the script above)
  • how to stop or restart only 1-2 workers without stopping or restating all the workers, so losing the games played

@vdbergh
Copy link
Contributor

vdbergh commented Aug 9, 2022

to stop or restart only 1-2 workers without stopping or restating all the workers, so losing the games played

I was thinking that from the point of view of the user the result of --pool would be a single worker. So the clones live and die together. If one has a 95 core worker then one can also not restart 20 cores.

The two solutions (--pool and the user manually splitting the worker) are not mutually exclusive.

@vdbergh
Copy link
Contributor

vdbergh commented Aug 9, 2022

We can set the default value of pool to ceil(concurrency/32). In that way nothing would change for workers with <= 32 cores.

A 33 core worker would split up as a 16 core worker and a 17 core worker.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants