Skip to content

Deployment of build.opensuse.org

Dani Donisa edited this page Aug 10, 2021 · 98 revisions

Prerequisites

You should need to do this only once, unless something in the setup changes.

  1. Set the correct SSH hostname in inventory/production.yml. The SSH hostname needs to match the one that points to the reference instance in your SSH config usually found in ~/.ssh/config. Ask someone in the team if you don't.
  2. Set the correct credentials and information regarding the GitHub Deployments and Rocket Chat webhooks as described here.
  3. Ensure obs is configured as one of your hosts, so can access the reference server via SSH.
  4. If you are not on openSUSE Tumbleweed/Leap, run the following command to install zypper-related commands for Ansible: ansible-galaxy collection install community.general

Deploying

  1. Check that your VPN is working and you can access the reference server.

  2. Check the diff of the changes you are going to introduce and if there are database/data migrations or anything requiring special care, it will be easier to solve any possible failure after deployment. Two possible ways to do this, it depends on what you prefer:

    • obs_deploy check-diff (this is from obs_deploy and it's in the bin directory of ansible-obs).
    • Directly on GitHub at https://github.com/openSUSE/open-build-service/compare/commit1...commit2. commit1 is the version currently deployed, you get it from obs_deploy deployed-version. commit2 is the version to be deployed, you get it from obs_deploy available-package by taking the commit SHA from the output (as highlighted here: Available package: obs-api-2.11~alpha.20210623T143055.cb94d5bb1c-12578.1.noarch.rpm)
  3. Check if there is a monkey patch in the server and act accordingly.

  4. Deploy! Use the correct playbook depending on the changes to be deployed:

    • Most of the time we deploy only code changes that don't introduce any changes on the database schema or require to stop the Apache Server for other reasons. Ansible will abort the operation in case the package contains one/multiple database/data migration(s). If that is the case, check the other options below.

      ansible-playbook -i inventory/production.yml deploy_without_migration.yml -vv
      
    • Some database/data migrations are non-disruptive and therefore don't cause downtimes. The corresponding Ansible playbook will skip the step of putting the Apache server into maintenance mode. You need to check upfront if the database/data migration is a non-disruptive one, Ansible is not able to distinguish between those two cases. Once you've confirmed that there won't be downtimes, go ahead. Otherwise, see the other option below.

      ansible-playbook -i inventory/production.yml deploy_with_migration_without_downtime.yml -vv
      
    • In many cases, database migrations require to stop all interactions of the application with the database while they are getting executed. Therefore causing downtimes. Database migrations with downtime should run in the maintenance window Thursday 8AM - 10AM CET/CEST

      ansible-playbook -i inventory/production.yml deploy_with_migration.yml -vv
      

Monkey-patching on production

If you need to monkey-patch something you'll want to lock deployments before you have a propper fix.

To do it, you need to use ObsGithubDeployments.

Ansible will detect any locks set and refuse to deploy, so you will be safe.

Using docker-compose

As an alternative you can user docker as explained in the README

Extras

We can use obs_deploy which provides useful information to anyone about to deploy:

  • obs_deploy check-diff displays the changes that are going to apply.
  • obs_deploy pending-migrations will let you know if there is any pending migration.
  • obs_deploy available-package Displays the currently available package version.
  • obs_deploy systemctl --host=<the host to connect to> --user=<username> Return the status of some vital systemd services we use for the OBS service. The host should be the same one configured in inventory/production.yml. The user should be the user configured in your ~/.ssh/config for that host.
  • obs_deploy --help to discover more interesting commands.

We can use also obs_github_deployments to check for deploy locks or to set/unset them. Please, read the Usage section to learn how to do it.

Troubleshooting

Delayed Jobs

For delayed jobs, clockworkd and indexing sphinx there is a systemd target:

systemctl obs-api-support.target start|stop|status|restart

To make sure everything is running fine:

# The systemd target should display as active (running), in green
systemctl status obs-api-support.target
# All dependant services should display active (running), in green
systemctl list-dependencies obs-api-support.target

If one of this is already running and systemctl obs-api-support.target stop does not stop it, you have to manually kill it (ps aux | grep delayed) otherwise restarting will fail.

To get an overview about remaining jobs, you can run:

run_in_api rails runner script/delayed_job_stats.rb

Assets issues

From time to time we have some issues with the CSS/JS assets. If application.css or application.js are missing (you will notice it, when you see unusual errors in your javascript console, specially a 404 when trying to retrieve it) then there are probably more than one sprocket manifest in production. Go to the public folder, check which one comes from the package and delete the one which doesn't. After that reload the application.

cd public/assets
rpm -qf .sprockets-manifest*
rm .sprockets-manifest-$SOMEHASH.json
cd ../..
touch tmp/restart.txt

RabbitMQ

When you see errors like SSL_connect SYSCALL returned=5 errno=0 state=unknown state in Errbit this usually means that there is some issue with the RabbitMQ server / connection. The maintenance window of the RabbitMQ server is Thursday, 8:00am to 10:00am CET. This can also cause this issue.

Errbit could also report AMQ::Protocol::EmptyResponseError: Empty response received from the server. errors. This happened not when the reference server is deployed, but when other machines are updated (regular kernel updates, for example). In this case a deployment of the reference server restored the connection. A reload of apache was not enough.

Airbrake/Errbit

To make sure that Airbrake is working, run the following command run_in_api rake airbrake:test and it will send an airbrake event to our errbit. It should show up in the OBS frontend App.

Disabling the rpm test suite

In case that the deployment caused some breakage and you might need to build some new packages quickly. In that case you might want to temporarily disable the test suite in our rpm spec. To do that add %disable_obs_test_suite 1 to the project config of OBS:Server:Unstable. Important: It needs to go into the Macros: ... :Macros section.

Downgrading packages

It might happen that a deployment breaks OBS badly and you need to get it to work again quickly. In that case you can check if zypper still has the old packages in it's cache. Just run zypper se --details and verify that the package version you want is still available. If that's the case, run zypper in --oldpackage obs-api-$VERSION.$ARCH to downgrade the package.

Disabling OBS integration in GitHub

OBS packages are build whenever a PR get's merged to master. This might delay publishing of built packages. To prevent this, disable the OBS integration in GitHub:

  1. Go to settings of the OBS GitHub project
  2. Select the 'Integration & services' tab and click on 'Edit' in the OBS column
  3. Uncheck the 'Activate' checkbox and 'Update services'
  4. Once the deployment is done activate the checkbox again ;-)

Manual Fixing on Production

As far as possible, when a bug is found we should act as usual: create a Pull Request, wait for it to be reviewed and merged and then wait until the changes can be deployed.

This process usually takes so long that, sometimes, we can not wait for it to finish. For example, when the bug is blocking someone's work or even the whole application.

Only in such cases, we can apply the changes manually on production (monkey patch) following these steps:

  • Access production server via SSH.
  • Go to the application's directory.
  • Apply the fix manually.
  • Run touch tmp/restart.txt to restart the server (Passenger).
  • Add to /etc/motd the link to the Pull Request that fixes the problem. In the next deployment, that PR is going to be applied and will replace the manual changes.
  • Block the deployment script with ogd lock --repository $GITHUB_REPOSITORY --token $GITHUB_TOKEN --reason "Same information as you added on /etc/motd"

Deployment of the obs-cloud-uploader package

  • ssh to our cloud uploader instance
  • Run zypper up obs-cloud-uploader
Clone this wiki locally