-
Notifications
You must be signed in to change notification settings - Fork 98
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
HTCondor clusters playbook #951
Conversation
This commit adds a playbook (htcondor.yml) that manages both the primary and secondary HTCondor clusters. The idea is to take advantage of Ansible's group and host vars features to configure both clusters with little verbosity and great flexibility. It is also meant to replace the way HTCondor is being configured at the moment (the sn06.yml playbook). See usegalaxy-eu#951 for an in-depth explanation of the motivation behind the changes.
22c49ba
to
f150b2e
Compare
I already ran this playbook and added a worker machine to the secondary cluster. I was able to run a "hello world" job (I did not try to run a Galaxy job yet). If little adjustments are needed (e.g. uids, gids) I guess they can be done later. I also ran this playbook ---
- name: UseGalaxy.eu
hosts: sn06
become: true
become_user: root
vars:
# The full internal name.
hostname: sn06.galaxyproject.eu
# The nginx user needed into the galaxyproject.nginx role
nginx_conf_user: galaxy
# This server has multiple CNAMEs that are important. Additionally it
# provides proxying for many of the other services run by Galaxy Europe.
# These server_names are passed to certbot. They generally should not need
# to be updated unless you add a new domain. They *only* work with the
# route53 provider, so if we want to do usegalaxy.xy, it may require
# refactoring / multiple certbot runs.
#
#
# The best way to expand them is to run the playbook, it will leave a message with the command it would have run (look for `skipped, since /etc/letsencrypt/renewal/usegalaxy.eu.conf exists`)
#
# Then take this command to the command line (root@sn04) and run it with `--expand`. E.g. (DO NOT COPY PASTE (in case the config changes))
#
# $ /opt/certbot/bin/certbot certonly --non-interactive --dns-route53 \
# -m [email protected] --agree-tos -d 'usegalaxy.eu,*.usegalaxy.eu,galaxyproject.eu,*.galaxyproject.eu,*.interactivetoolentrypoint.interactivetool.usegalaxy.eu,*.interactivetoolentrypoint.interactivetool.live.usegalaxy.eu,*.interactivetoolentrypoint.interactivetool.test.usegalaxy.eu' --expand
# Saving debug log to /var/log/letsencrypt/letsencrypt.log
# Credentials found in config file: ~/.aws/config
# ....
# IMPORTANT NOTES:
# - Congratulations! Your certificate and chain have been saved at:
#
# And you're done expanding the certs.
#
# The nginx user needed into the galaxyproject.nginx role
server_names:
- "usegalaxy.eu"
- "*.usegalaxy.eu"
- "galaxyproject.eu"
- "*.galaxyproject.eu"
- "*.interactivetoolentrypoint.interactivetool.usegalaxy.eu"
- "*.interactivetoolentrypoint.interactivetool.live.usegalaxy.eu"
- "*.interactivetoolentrypoint.interactivetool.test.usegalaxy.eu"
- "*.aqua.usegalaxy.eu"
- "*.interactivetoolentrypoint.interactivetool.aqua.usegalaxy.eu"
- "*.ecology.usegalaxy.eu"
- "*.interactivetoolentrypoint.interactivetool.ecology.usegalaxy.eu"
- "*.earth-system.usegalaxy.eu"
- "*.interactivetoolentrypoint.interactivetool.earth-system.usegalaxy.eu"
vars_files:
- group_vars/tiaas.yml # All of the training infrastructure
- group_vars/custom-sites.yml # Subdomains are listed here
- group_vars/gxconfig.yml # The base galaxy configuration
- group_vars/toolbox.yml # User controlled toolbox
- secret_group_vars/sentry.yml # Sentry SDK init url
- secret_group_vars/aws.yml # AWS creds
- secret_group_vars/pulsar.yml # Pulsar + MQ Connections
- secret_group_vars/oidc.yml # OIDC credentials (ELIXIR, keycloak)
- secret_group_vars/object_store.yml # Object Store credentils (S3 etc ...)
- secret_group_vars/db-main.yml # DB URL + some postgres stuff
- secret_group_vars/file_sources.yml # file_sources_conf.yml creds
- secret_group_vars/all.yml # All of the other assorted secrets...
- secret_group_vars/keys.yml # SSH keys
- templates/galaxy/config/job_conf.yml
- mounts/dest/all.yml
- mounts/mountpoints.yml
post_tasks:
- ansible.builtin.debug:
var: "{{ item }}"
loop:
- condor_host
- condor_fs_domain
- condor_uid_domain
- condor_allow_write
- condor_daemons
- condor_allow_negotiator
- condor_allow_administrator
- condor_system_periodic_hold
- condor_system_periodic_remove
- condor_network_interface
- condor_extra both on the master branch and this PR's branch to make sure that I am not messing up the HTCondor configuration on the headnode (since some variables were moved). Diffing the two outputs yields no difference, it should be ok to merge this. Nevertheless, I am posting a warning so that none of you merge this at 5 p.m. |
|
@mira-miracoli @sanjaysrikakulam @bgruening Can any of you say if you are ok with this? If done right it does not have any influence in prod stuff, when merged with care it should be ok. |
I have no clue but trust you all! |
Let's go then. |
It seems like it worked fine (already ran the playbook). |
Rather than hacking together a single-use playbook for the HTCondor migration, after seeing the mess I was creating, I decided to make something longer-lasting and hopefully easy to manage.
This PR adds a playbook (htcondor.yml) that manages both the primary and secondary HTCondor clusters. The idea is to take advantage of Ansible's group and host vars features to configure both clusters with little verbosity. It is also meant to replace the way HTCondor is being configured at the moment (the sn06.yml playbook). Let's dive a bit deeper into how this works.
The literature on how to write an Ansible inventory is full of examples with group names such as
dbservers
orwebservers
. I do not think this is a coincidence or just "one more way to do it". I believe organizing the inventory and playbooks in terms of the role the machines play in the infrastructure makes better sense than organizing them around the servers themselves and makes things simpler.This PR adds the following groups to the inventory,
htcondor
: machines that are part of the primary HTCondor cluster.htcondor-manager
: machine that plays the central manager role in the primary HTCondor cluster.htcondor-submit
: machines that play the submit role in the primary HTCondor cluster.htcondor-secondary
: machines that are part of the secondary HTCondor cluster.htcondor-secondary-manager
: machine that plays the central manager role in the secondary HTCondor cluster.htcondor-secondary-submit
: machines that play the submit role in the secondary HTCondor cluster.with the following parent-child relationships.
htcondor
htcondor-manager
htcondor-secondary-manager
htcondor-submit
htcondor-secondary-submit
htcondor-secondary
htcondor-secondary-manager
htcondor-secondary-submit
Meaning that the secondary cluster is a copy of the primary cluster, just with the needed small deviations (such as the hostname of the central manager) defined as variables of the
htcondor-secondary
group and subgroups.The setup of both clusters is achieved running a single playbook (htcondor.yml) on the
htcondor
group. Each machine is configured as part of the appropriate cluster (primary or secondary) and plays the appropriate role within the cluster just due to the group it has been assigned to.Note that at the moment the management of the headnode is disabled (
hosts: htcondor:!sn06.galaxyproject.eu
). This is because a couple of group variables have been moved from the sn06 group to the htcondor group, and thus the headnode should belong to the htcondor group so that those group variables still apply to it, at least until the migration is complete, because theusegalaxy_eu.htcondor
role on sn06.yml still needs to be able to read them. Once the migration is complete, it is even possible to adjust the group variables and inventory in a rather trivial way so that there is even no trace in the playbooks and inventory that a secondary cluster ever existed (read just below).Advantages of this approach:
htcondor-secondary
) and its subgroups to/from the inventory and overriding the necessary group variables. No playbooks nor roles need to be modified.In addition, sometimes I have said that if I wanted to have a test Galaxy instance that is as close as possible to the production one, I would not create another repository and attempt to replicate what is in this one. I have been thinking in the background and this may be achievable in much simpler way using the following Ansible features:
But again, this only holds as long as we organize playbooks and the inventory in terms of the roles that machines play in our infrastructure and not in terms of the machines we have in our infrastructure. This PR is just a little step forward towards this view.