-
Notifications
You must be signed in to change notification settings - Fork 0
Setup instructions
Please note that this project is an active work in progress, documentation and features are subject to frequent changes.
The following guide was produced using the Ubuntu 22.04 Server
Linux distribution for the cluster setup. Please adapt command and configuration parameters to your own distribution. The Controller and Worker should use the same version of Slurm to ensure they are compatible (this is easily achieved by using the same Linux distribution across all cluster members).
TODO: Lay down some basic notations: configuration parameters, commands to run, configuration file contents...
The pakupi setup guide refers to a number of roles assumed by machines in the cluster. We recommend the use of distinct machines for each role, to mitigate the impact of reboots or issues during installation. This guide identifies the following roles:
- Host: the machine used to configure the cluster
- Controller: the main node for providing cluster services and interracting with the cluster
- Worker: any node whose processsing power is made available to cluster
The Host is the main entry point to the cluster. It will ensure all machines in the cluster are correctly configured through Ansible. The Host should have access to the Internet to download the required packages, and network access to the cluster.
-
Install Ansible: Ansible is an IT automation framework which allows users to manage an inventory of machines, configure systems, and deploy software. The pakupi setup process relies on Ansible to check the configuration of the cluster, and run additional scripts if required. For more information on how to setup Ansible, see the Ansible installation guide. Under Ubuntu 22.04 you can install Ansible using the command:
apt install ansible
.CHECK: Run the following command in your terminal to ensure Ansible is correctly installed:
ansible --version
-
Install pakupi Ansible requirements: The pakupi setup process uses community-provided Ansible roles and collections (see the Ansible user guide on roles). The requirements are described in the
requirements-galaxy.yml
at the root of the repository. The packages are available for download and review through Ansible galaxy. You can install the required packages using the command:ansible-galaxy install -r requirements-galaxy.yml
.CHECK: Running the installation command:
ansible-galaxy install -r requirements-galaxy.yml
should reportNothing to do
-
Prepare a SSH key: SSH allows remote connections to the machines in the cluster, and it is used by Ansible to configure the system. Authentication through SSH can be password-based, using the target account password, or key-based, using a list of authorised public keys. Key-based authentication is recommended as it does not prompt the user for a password. To generate a new SSH key, use the
ssh-keygen
command. The created key will be placed in your home directory under~/.ssh/id_rsa.pub
andid_rsa
for the private key (which should not be distributed).If you reuse an existing key, or specify a different target files for the generated key using the
-f
flag ofssh-keygen
, please remember to add the key to your SSH agent (seeman ssh-add
).CHECK: Your public and private key files exist (respectively under
~/.ssh/id_rsa.pub
and~/.ssh/id_rsa.pub
by default).
The Controller is the main node for interacting with the cluster. It schedules work on the workers, and provide required service for cluster operation such as file sharing, job submission, dashboard, etc. During its configuration, the Controller should have access to the Internet to download the required packages, and network access to the cluster.
The pakupi setup scripts assume the Controller is running the Ubuntu 22.04 Server
Linux distribution. The scripts may be adapted to your favourite flavour of Linux, but please make sure the same version of Slurm will be available on the Controller and the Workers. There are no minimum requirements on the Controller, but it should have enough computing power, memory, and storage to provide the required services. Please consult the Ubuntu Server Installation Guide for more information. Note that a Raspberry Pi could be devoted to that role, preferrably with external storage. We discuss the setup process in the Worker section of this guide.
The following need to be considered during the setup:
- Configure a fixed IP for the Controller on the interface used to interact with the cluster. The Controller needs a fixed address to generate the workers configuration. It will provide dynamic IP allocation for other cluster machines.
TODO: Test behaviour of
network-config
for Raspberry Pi and x86 Ubuntu Server setups (see https://cloudinit.readthedocs.io/en/latest/topics/network-config-format-v2.html#network-config-v2) and document if required
TODO: Document configuration through
/etc/netplan/...
if required
- Create a user with administrative rights. This should be the default for the user created during installation. The pakupi setup will need to install new packages on the Controller and configure services which required administrative rights. Please make note of the created username and password as those will be required to run Ansible commands.
-
Enable SSH access on the Controller. SSH allows remote access to the Controller for configuration and monitoring purposes. If your controller is a Rasperry Pi, the service can be enabled from the Raspberry Pi Imager (see How to Install Ubuntu Server on your Raspberry Pi). Alternatively, you can modify the
user-data
file to include the clause:ssh_pwauth: true
.user-data
is acloud-init
file used to initialise the system upon boot (see the [documentation on user-data]), especially on headless systems. It should be accessible on the SD card for a Raspberry Pi installation.
TODO: Check if user-data enabled by default on non-pi machines
TODO: Check if IP address range restricted by setup scripts
-
Add the Host SSH Key to authorised keys to allow password-less access from the Host to the Controller. As for enabling the SSH access, this can be configured either:
- in the settings of the Raspberry Pi Imager, by copying the contents of the
id_rsa.pub
in the SSH authorised keys field; - in the
user-data
file, by adding assh_authorized_keys
for the user e.g. from the examples of user-data:users: - name: foobar ssh_authorized_keys: - <CONTENTS OF THE id_rsa.pub FILE>
- or once the Controller is started using the
ssh-copy-id
command (runman ssh-copy-id
for more information).
- in the settings of the Raspberry Pi Imager, by copying the contents of the
CHECK: Run the command
ping -c 5 CONTROLLER_IP
on the Host to check the Controller is running and accessible, replacingCONTROLLER_IP
with the address set during setup.
CHECK: Run the
ssh CONTROLLER_USER@CONTROLLER_IP
command from the Host to connect to the Controller, replacingCONTROLLER_USER
with the user set during the setup. You might be prompted to accept the identity of the Controller upon the first connection.
Once the Controller is ready, it can be added to the Ansible inventory for the cluster. The inventory describes the machines managed through Ansible. It provides additional facts required for connection and configuration, as well as groups to which the machine belong. We recommend setting the ansible_host
and ansible_user
for the Controller in the inventory, to the values defined during configuration.
Groups are used to specify specific roles or capabilities for a machine, and different groups might be configured differently. The pakupi setup scripts identify the controller using the mocha-master
group or host identifier. More information on how to build an inventory is available on the Ansible User Guide on Inventory
Consider the following inventory.ini
file as an example
controller ansible_host=CONTROLLER_IP ansible_user=CONTROLLER_USER
[mocha-master]
controller
[other-group]
controller
Run 01-dhcp
Specify -K
flag optionally
Identify commands to check the DHCP server is running
- Keep user/password
- Setup worker for SSH access
- Setup DHCP on Worker (`/boot/firmware/network-config`)
Add SSH key
Add SSH key if not during OS setup
- Run NIS setup playbook (02)
- Run slurm setup playbook (03)
- Run NFS setup playbook (04)
- Run worker configuration playbook (05)
- Run grafana setup playbook (06)
- Log into grafana
http://<master>:3000
(admin:admin
) - Add slurm dashboard (https://grafana.com/grafana/dashboards/4323)
- Add mocha dashboard (???)
- Log into grafana