Skip to content

Setup instructions

Gaudeval edited this page Jul 25, 2022 · 18 revisions

Please note that this project is an active work in progress, documentation and features are subject to frequent changes.

The following guide was produced using the Ubuntu 22.04 Server Linux distribution for the cluster setup. Please adapt command and configuration parameters to your own distribution. The Controller and Worker should use the same version of Slurm to ensure they are compatible (this is easily achieved by using the same Linux distribution across all cluster members).

TODO: Lay down some basic notations: configuration parameters, commands to run, configuration file contents...

TODO: Document which image for the Pi (64 vs 32) and how to reach it from the Imager

TODO: Check how to get ssh to automatically accept a fingerprint for a new machine/changed info on a machine

Cluster roles

The pakupi setup guide refers to a number of roles assumed by machines in the cluster. We recommend the use of distinct machines for each role, to mitigate the impact of reboots or issues during installation. This guide identifies the following roles:

  • Host: the machine used to configure the cluster
  • Controller: the main node for providing cluster services and interracting with the cluster
  • Worker: any node whose processsing power is made available to cluster

Preparing the Host

The Host is the main entry point to the cluster. It will ensure all machines in the cluster are correctly configured through Ansible. The Host should have access to the Internet to download the required packages, and network access to the cluster.

  • Install Ansible: Ansible is an IT automation framework which allows users to manage an inventory of machines, configure systems, and deploy software. The pakupi setup process relies on Ansible to check the configuration of the cluster, and run additional scripts if required. For more information on how to setup Ansible, see the Ansible installation guide. Under Ubuntu 22.04 you can install Ansible using the command: apt install ansible.

    CHECK: Run the following command in your terminal to ensure Ansible is correctly installed: ansible --version

  • Install pakupi Ansible requirements: The pakupi setup process uses community-provided Ansible roles and collections (see the Ansible user guide on roles). The requirements are described in the requirements-galaxy.yml at the root of the repository. The packages are available for download and review through Ansible galaxy. You can install the required packages using the command: ansible-galaxy install -r requirements-galaxy.yml.

    CHECK: Running the installation command: ansible-galaxy install -r requirements-galaxy.yml should report Nothing to do

  • Prepare a SSH key: SSH allows remote connections to the machines in the cluster, and it is used by Ansible to configure the system. Authentication through SSH can be password-based, using the target account password, or key-based, using a list of authorised public keys. Key-based authentication is recommended as it does not prompt the user for a password. To generate a new SSH key, use the ssh-keygen command. The created key will be placed in your home directory under ~/.ssh/id_rsa.pub and id_rsa for the private key (which should not be distributed).

    If you reuse an existing key, or specify a different target files for the generated key using the -f flag of ssh-keygen, please remember to add the key to your SSH agent (see man ssh-add).

    CHECK: Your public and private key files exist (respectively under ~/.ssh/id_rsa.pub and ~/.ssh/id_rsa.pub by default).

Preparing the Controller

The Controller is the main node for interacting with the cluster. It schedules work on the workers, and provide required service for cluster operation such as file sharing, job submission, dashboard, etc. During its configuration, the Controller should have access to the Internet to download the required packages, and network access to the cluster.

Setup the Operating System

The pakupi setup scripts assume the Controller is running the Ubuntu 22.04 Server Linux distribution. The scripts may be adapted to your favourite flavour of Linux, but please make sure the same version of Slurm will be available on the Controller and the Workers. There are no minimum requirements on the Controller, but it should have enough computing power, memory, and storage to provide the required services. Please consult the Ubuntu Server Installation Guide for more information. Note that a Raspberry Pi could be devoted to that role, preferrably with external storage. We discuss the setup process in the Worker section of this guide.

The following need to be considered during the setup:

  • Configure a fixed IP for the Controller on the interface used to interact with the cluster. The Controller needs a fixed address to generate the workers configuration. It will provide dynamic IP allocation for other cluster machines.

TODO: Test behaviour of network-config for Raspberry Pi and x86 Ubuntu Server setups (see https://cloudinit.readthedocs.io/en/latest/topics/network-config-format-v2.html#network-config-v2) and document if required

network: 
  version: 2
  ethernets:
    eth0:
      addresses:
      - 192.168.64.1/24
      gateway4: 192.168.64.2
      nameservers:
        addresses:
        - 8.8.8.8
        search: []

TODO: Document configuration through /etc/netplan/... if required

version: 2
ethernets:
  eth0:
    addresses:
    - 192.168.64.1/24
    gateway4: 192.168.64.2
    nameservers:
      addresses:
      - 8.8.8.8
      search: []
  • Create a user with administrative rights. This should be the default for the user created during installation. The pakupi setup will need to install new packages on the Controller and configure services which require administrative rights. Please make note of the created username and password as those will be required to run Ansible commands.
  • Enable SSH access on the Controller. SSH allows remote access to the Controller for configuration and monitoring purposes. If your controller is a Rasperry Pi, the service can be enabled from the Raspberry Pi Imager (see How to Install Ubuntu Server on your Raspberry Pi). Alternatively, you can modify the user-data file to include the clause: ssh_pwauth: true. user-data is a cloud-init file used to initialise the system upon boot (see the [documentation on user-data]), especially on headless systems. It should be accessible on the SD card for a Raspberry Pi installation (under /boot/firmware/user-data).

TODO: Check if user-data enabled by default on non-pi machines (it is through cloud-init)

TODO: Check if IP address range restricted by setup scripts (they are)

  • Add the Host SSH Key to authorised keys to allow password-less access from the Host to the Controller. As for enabling the SSH access, this can be configured either:
    • in the settings of the Raspberry Pi Imager, by copying the contents of the id_rsa.pub in the SSH authorised keys field;
    • in the user-data file, by adding a ssh_authorized_keys for the user e.g. from the examples of user-data:
      users:
      - name: foobar
        ssh_authorized_keys:
        - <CONTENTS OF THE id_rsa.pub FILE>
    • or once the Controller is started using the ssh-copy-id command (run man ssh-copy-id for more information).

CHECK: Run the command ping -c 5 CONTROLLER_IP on the Host to check the Controller is running and accessible, replacing CONTROLLER_IP with the address set during setup.

CHECK: Run the ssh CONTROLLER_USER@CONTROLLER_IP command from the Host to connect to the Controller, replacing CONTROLLER_USER with the user set during the setup. You might be prompted to accept the identity of the Controller upon the first connection.

Adding the Controller to the inventory

Once the Controller is ready, it can be added to the Ansible inventory for the cluster. The inventory describes the machines managed through Ansible. It provides additional facts required for connection and configuration, as well as groups to which the machine belong. We recommend setting the ansible_host and ansible_user for the Controller in the inventory, to the values defined during configuration.

Groups are used to specify specific roles or capabilities for a machine, and different groups might be configured differently. The pakupi setup scripts identify the controller using the mocha-master group or host identifier. More information on how to build an inventory is available on the Ansible User Guide on Inventory

Consider the following inventory.ini file as an example

controller ansible_host=CONTROLLER_IP ansible_user=CONTROLLER_USER

[mocha-master]
controller 

[other-group]
controller

CHECK: Run the following command in your Host terminal to ensure Ansible can read your inventory: ansible-inventory -i inventory.ini --list

CHECK: Run the following command in your Host terminal to ensure Ansible can access the Controller: ansible -m ping -i inventory.ini mocha-master

Setup the DHCP server

The Controller allocates IP addresses to workers in the cluster by acting as a DHCP server. This ensures all workers are correctly listed in the inventory and have a uniquely allocated IP address. The Controller also provides a default IP address range to allow for connecting to new workers, and collecting the required information for their inclusion in the inventory.

The playbook 01-dhcp.yaml is in charge of setting up and configuring the Controller as a DHCP server. It uses information from the inventory to assign an IP address to each worker and generate the server configuration. If workers are present in the inventory, they should have a MAC address assigned before applying the playbook. The following commands applies the playbook:

ansible-playbook -i inventory.ini 01-dhcp.yaml

Note that Ansible might complain missing the administrator password. You can add the -K flag to force Ansible to prompt for the password on the Controller, or use a vault to store it. As an alternative, the playbook util-nopasswd.yaml configures the administrator account on the target machines such that no password is required on sudo commands (applying util-nopasswd.yaml should require the password once).

CHECK: Run the following Ansible command to check if the DHCP server is running. It should report no change required: ansible -C -i inventory.ini -m service -a "state=stopped name=isc-dhcp-server" mocha-master

CHECK: Alternatively, run the systemctl status isc-dhcp-server command from the Controller to assess if the service is active and running.

CHECK: If any worker is configured (and assuming an interface called eth0), check it has an IP address using ip addr show eth0 and ensuring the inet field is correct. You might need to run the dhclient command on the worker to refresh the DHCP lease, e.g. dhclient eth0

Preparing a worker

Workers offer their computation power and resources to the cluster, using the services provided by the Controller to ease interactions with the cluster.

Setup the Operating System

The pakupi setup scripts assume the Controller are running the Ubuntu 22.04 Server Linux distribution. The scripts may be adapted to your favourite flavour of Linux, but please make sure the same version of Slurm will be available on the Controller and the Workers. We discuss in the following the setup of a Raspberry Pi worker, but the configuration should follow similar steps on other platforms.

As for the Controller setup (see the Controller Operating System setup guide), the following steps need to be considered:

  • Create a user with administrative rights.
  • Enable SSH access on the Worker.
  • Add the Host SSH Key to authorised keys if the option is available during setup. If not, consider using the ssh-copy-id once the Worker has been added to the inventory.

Configuring the network connection

The Worker configuration relies on DHCP to retrieve an IP address. Workers using the same platform can thus rely on a single, unified configuration process. As an alternative, you can setup each worker with a unique address, e.g. either manually during setup, or by generating a configuration for each worker. We consider in the following the configuration of the worker as a DHCP client.

Network configuration on Ubuntu 22.04 Server uses the netplan utility. A YAML configuration files specificies how an interface should be configured (see the cloud-init documentation on Network configuration. Consider using one of these options to define the network configuration file:

  • /boot/firmware/network-config on a Raspberry Pi is a cloud-init configuration file applied on system boot. It is useful on headless systems as it is acccessible by mounting the SD card on another host.
  • The /etc/netplan/ folder contains one or more configuration files. The files are read and applied netplan, in alpha-numerical as output by the ls /etc/netplan command as an example. You can create a new file in the directory to configure the desired interface. Use the netplan apply command or reboot to apply changes to the configuration.
  • The /etc/cloud/cloud.cfg.d/ folder contains one or more configuration files, like the netplan option. The configuration will be applied by cloud-init on boot. Note however that cloud-init must be active, as well as its network configuration, and netplan configurations might override these settings. See the cloud init documentation for more information.

The following example configures the Ethernet interface eth0 to use DHCP to get an IP address:

network: # <-- remove this line if using netplan directly
  version: 2
  ethernets:
    eth0:
      dhcp4: true

CHECK: With the worker started, run the command dhcp-lease-list on the Controller, it should show one entry per out-of-inventory worker.

CHECK: If you have access to the worker (and assuming an interface called eth0), check it has an IP address using ip addr show eth0 and ensuring the inet field is correct. You might need to run the dhclient command to refresh the DHCP lease, e.g. dhclient eth0

Adding a worker to the inventory

The addition of a Worker to the inventory requires knowing its MAC address, to configure its allocated IP address. With the worker started and configured, run dhcp-lease-list on the Controller to list all DHCP clients, their assigned IP, and their MAC address (a 6 hex-digits identifier). Consider adding workers one by one to ease their identification. The inventory should include the following elements for the Worker (see the Ansible User Guide on Inventory):

  • A new entry naming the worker
  • A target IP for the worker, using the ansible_host variable. It should be unique in the cluster, and on the same network as the other machines.
  • A MAC address for the worker, using the macaddress variable. Use the value identified through the DHCP lease.
  • Set the worker as part of the mocha-worker group. It is used to identify workers by the pakupi setup scripts.

Consider the following inventory.ini file as an example

controller ansible_host=CONTROLLER_IP ansible_user=CONTROLLER_USER
worker ansible_host=WORKER_IP ansible_user=WORKER_USER macaddress=11:22:33:44:55:66

[mocha-master]
controller 

[mocha-worker]
worker

[other-group]
controller

The setup scripts use the inventory information to generate the DHCP allocation configuration and ensure each worker is leased with the corred IP:

  • Run the DHCP setup playbook again to complete the Worker setup.
  • Reboot the worker or run the command dhclient eth0 on the Worker to force a new DHCP lease (this might interrupt your remote connection).
  • Copy the Host SSH key on the worker, if this was not done during setup.

CHECK: Run the command ping -c 5 WORKER_IP on the Host to check the Worker is running and accessible, replacing WORKER_IP with the address set in the inventory.

CHECK: Run the following command in your Host terminal to ensure Ansible can access all Workers: ansible -m ping -i inventory.ini mocha-worker

Configure the cluster

  • Run NIS setup playbook (02)
  • Run slurm setup playbook (03)
  • Run NFS setup playbook (04)
  • Run worker configuration playbook (05)

Configure the dashboard

Clone this wiki locally