Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HTCondor clusters playbook #951

Merged
merged 2 commits into from
Oct 31, 2023
Merged

Conversation

kysrpex
Copy link
Contributor

@kysrpex kysrpex commented Oct 18, 2023

Rather than hacking together a single-use playbook for the HTCondor migration, after seeing the mess I was creating, I decided to make something longer-lasting and hopefully easy to manage.

This PR adds a playbook (htcondor.yml) that manages both the primary and secondary HTCondor clusters. The idea is to take advantage of Ansible's group and host vars features to configure both clusters with little verbosity. It is also meant to replace the way HTCondor is being configured at the moment (the sn06.yml playbook). Let's dive a bit deeper into how this works.

The literature on how to write an Ansible inventory is full of examples with group names such as dbservers or webservers. I do not think this is a coincidence or just "one more way to do it". I believe organizing the inventory and playbooks in terms of the role the machines play in the infrastructure makes better sense than organizing them around the servers themselves and makes things simpler.

This PR adds the following groups to the inventory,

  • htcondor: machines that are part of the primary HTCondor cluster.
  • htcondor-manager: machine that plays the central manager role in the primary HTCondor cluster.
  • htcondor-submit: machines that play the submit role in the primary HTCondor cluster.
  • htcondor-secondary: machines that are part of the secondary HTCondor cluster.
  • htcondor-secondary-manager: machine that plays the central manager role in the secondary HTCondor cluster.
  • htcondor-secondary-submit: machines that play the submit role in the secondary HTCondor cluster.

with the following parent-child relationships.

  • htcondor
    • htcondor-manager
      • htcondor-secondary-manager
    • htcondor-submit
      • htcondor-secondary-submit
    • htcondor-secondary
      • htcondor-secondary-manager
      • htcondor-secondary-submit

Meaning that the secondary cluster is a copy of the primary cluster, just with the needed small deviations (such as the hostname of the central manager) defined as variables of the htcondor-secondary group and subgroups.

The setup of both clusters is achieved running a single playbook (htcondor.yml) on the htcondor group. Each machine is configured as part of the appropriate cluster (primary or secondary) and plays the appropriate role within the cluster just due to the group it has been assigned to.

Note that at the moment the management of the headnode is disabled (hosts: htcondor:!sn06.galaxyproject.eu). This is because a couple of group variables have been moved from the sn06 group to the htcondor group, and thus the headnode should belong to the htcondor group so that those group variables still apply to it, at least until the migration is complete, because the usegalaxy_eu.htcondor role on sn06.yml still needs to be able to read them. Once the migration is complete, it is even possible to adjust the group variables and inventory in a rather trivial way so that there is even no trace in the playbooks and inventory that a secondary cluster ever existed (read just below).

Advantages of this approach:

  • Adding or removing clusters (i.e. having a secondary or tertiary cluster) is as easy as adding or removing a group (i.e. htcondor-secondary) and its subgroups to/from the inventory and overriding the necessary group variables. No playbooks nor roles need to be modified.
  • A single playbook is managing three hosts (and can scale to as many as desired), yet it is very short and simple (it runs just a single Ansible role).

In addition, sometimes I have said that if I wanted to have a test Galaxy instance that is as close as possible to the production one, I would not create another repository and attempt to replicate what is in this one. I have been thinking in the background and this may be achievable in much simpler way using the following Ansible features:

  • Multiple inventories (check the link, it explicitly talks about having staging and production environments). This allows reusing production variables in staging, yet also overriding them when needed.
  • Inventory directories enable fine grained control of how variables are overridden.
  • Dynamic inventories. This would allow even to get rid of the VGCN repo and put worker node playbooks also in this repo. Look at how worker images are provisioned. It's just Ansible groups. That means you can either continue to create images and boot them in OpenStack, skip the images entirely and manage the workers with Ansible, or even do both at the same time: boot the worker from a provisioned image and change its configuration without a reboot, using the same playbook to provision the image and for the dynamic group.

But again, this only holds as long as we organize playbooks and the inventory in terms of the roles that machines play in our infrastructure and not in terms of the machines we have in our infrastructure. This PR is just a little step forward towards this view.

@kysrpex kysrpex self-assigned this Oct 18, 2023
@kysrpex kysrpex changed the title Secondary HTCondor cluster playbook (central manager) Secondary HTCondor cluster playbook Oct 25, 2023
@kysrpex kysrpex changed the title Secondary HTCondor cluster playbook HTCondor cluster playbook Oct 25, 2023
@kysrpex kysrpex changed the title HTCondor cluster playbook HTCondor clusters playbook Oct 25, 2023
This commit adds a playbook (htcondor.yml) that manages both the primary and secondary HTCondor clusters. The idea is to take advantage of Ansible's group and host vars features to configure both clusters with little verbosity and great flexibility. It is also meant to replace the way HTCondor is being configured at the moment (the sn06.yml playbook).

See usegalaxy-eu#951 for an in-depth explanation of the motivation behind the changes.
@kysrpex kysrpex force-pushed the htcondor_secondary branch from 22c49ba to f150b2e Compare October 25, 2023 13:25
@kysrpex kysrpex requested review from sanjaysrikakulam and mira-miracoli and removed request for sanjaysrikakulam and mira-miracoli October 25, 2023 14:29
@kysrpex kysrpex marked this pull request as ready for review October 26, 2023 13:13
@kysrpex
Copy link
Contributor Author

kysrpex commented Oct 26, 2023

I already ran this playbook and added a worker machine to the secondary cluster. I was able to run a "hello world" job (I did not try to run a Galaxy job yet). If little adjustments are needed (e.g. uids, gids) I guess they can be done later.

I also ran this playbook

---
- name: UseGalaxy.eu
  hosts: sn06
  become: true
  become_user: root
  vars:
    # The full internal name.
    hostname: sn06.galaxyproject.eu
    # The nginx user needed into the galaxyproject.nginx role
    nginx_conf_user: galaxy
    # This server has multiple CNAMEs that are important. Additionally it
    # provides proxying for many of the other services run by Galaxy Europe.
    # These server_names are passed to certbot. They generally should not need
    # to be updated unless you add a new domain. They *only* work with the
    # route53 provider, so if we want to do usegalaxy.xy, it may require
    # refactoring / multiple certbot runs.
    #
    #
    # The best way to expand them is to run the playbook, it will leave a message with the command it would have run (look for `skipped, since /etc/letsencrypt/renewal/usegalaxy.eu.conf exists`)
    #
    # Then take this command to the command line (root@sn04) and run it with `--expand`. E.g. (DO NOT COPY PASTE (in case the config changes))
    #
    # $ /opt/certbot/bin/certbot certonly --non-interactive --dns-route53 \
    #     -m [email protected] --agree-tos -d 'usegalaxy.eu,*.usegalaxy.eu,galaxyproject.eu,*.galaxyproject.eu,*.interactivetoolentrypoint.interactivetool.usegalaxy.eu,*.interactivetoolentrypoint.interactivetool.live.usegalaxy.eu,*.interactivetoolentrypoint.interactivetool.test.usegalaxy.eu' --expand
    # Saving debug log to /var/log/letsencrypt/letsencrypt.log
    # Credentials found in config file: ~/.aws/config
    # ....
    # IMPORTANT NOTES:
    #  - Congratulations! Your certificate and chain have been saved at:
    #
    # And you're done expanding the certs.
    #
    # The nginx user needed into the galaxyproject.nginx role
    server_names:
      - "usegalaxy.eu"
      - "*.usegalaxy.eu"
      - "galaxyproject.eu"
      - "*.galaxyproject.eu"
      - "*.interactivetoolentrypoint.interactivetool.usegalaxy.eu"
      - "*.interactivetoolentrypoint.interactivetool.live.usegalaxy.eu"
      - "*.interactivetoolentrypoint.interactivetool.test.usegalaxy.eu"
      - "*.aqua.usegalaxy.eu"
      - "*.interactivetoolentrypoint.interactivetool.aqua.usegalaxy.eu"
      - "*.ecology.usegalaxy.eu"
      - "*.interactivetoolentrypoint.interactivetool.ecology.usegalaxy.eu"
      - "*.earth-system.usegalaxy.eu"
      - "*.interactivetoolentrypoint.interactivetool.earth-system.usegalaxy.eu"
  vars_files:
    - group_vars/tiaas.yml # All of the training infrastructure
    - group_vars/custom-sites.yml # Subdomains are listed here
    - group_vars/gxconfig.yml # The base galaxy configuration
    - group_vars/toolbox.yml # User controlled toolbox
    - secret_group_vars/sentry.yml # Sentry SDK init url
    - secret_group_vars/aws.yml # AWS creds
    - secret_group_vars/pulsar.yml # Pulsar + MQ Connections
    - secret_group_vars/oidc.yml # OIDC credentials (ELIXIR, keycloak)
    - secret_group_vars/object_store.yml # Object Store credentils (S3 etc ...)
    - secret_group_vars/db-main.yml # DB URL + some postgres stuff
    - secret_group_vars/file_sources.yml # file_sources_conf.yml creds
    - secret_group_vars/all.yml # All of the other assorted secrets...
    - secret_group_vars/keys.yml # SSH keys
    - templates/galaxy/config/job_conf.yml
    - mounts/dest/all.yml
    - mounts/mountpoints.yml
  post_tasks:
    - ansible.builtin.debug:
        var: "{{ item }}"
      loop:
        - condor_host
        - condor_fs_domain
        - condor_uid_domain
        - condor_allow_write
        - condor_daemons
        - condor_allow_negotiator
        - condor_allow_administrator
        - condor_system_periodic_hold
        - condor_system_periodic_remove
        - condor_network_interface
        - condor_extra

both on the master branch and this PR's branch to make sure that I am not messing up the HTCondor configuration on the headnode (since some variables were moved). Diffing the two outputs yields no difference, it should be ok to merge this. Nevertheless, I am posting a warning so that none of you merge this at 5 p.m.

@kysrpex
Copy link
Contributor Author

kysrpex commented Oct 26, 2023

⚠️ DO NOT MERGE THIS if you do not have time to fix things if the system breaks, IT IS DANGEROUS.

@kysrpex
Copy link
Contributor Author

kysrpex commented Oct 31, 2023

@mira-miracoli @sanjaysrikakulam @bgruening Can any of you say if you are ok with this? If done right it does not have any influence in prod stuff, when merged with care it should be ok.

@bgruening
Copy link
Member

I have no clue but trust you all!

@kysrpex
Copy link
Contributor Author

kysrpex commented Oct 31, 2023

I have no clue but trust you all!

Let's go then.

@kysrpex kysrpex merged commit 4517c21 into usegalaxy-eu:master Oct 31, 2023
@kysrpex kysrpex deleted the htcondor_secondary branch October 31, 2023 12:37
@kysrpex
Copy link
Contributor Author

kysrpex commented Oct 31, 2023

It seems like it worked fine (already ran the playbook).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants