ConfigurationManagement

Configuration Management

These are my (@robnagler) thoughts on configuration management as of today (4/27/16).

My History

I've been working with configuration management systems since the early 80s, including distributed system deployment. My first major devops implementation was the V System distribution system on the Stanford network in 1982 to 1983. Over the subsequent three decades, I fell into the role of devops engineer, which was often not my main job. It was very important for me to be effective in my role so that I didn't have to spend my days fixing our devops automation systems.

I've learned a lot during my part-time devops career, including how important security is. The rdist network I created was used during the breakin of 1986. While it may be obvious now, security was not an issue back then. Also, the problem of securing software distribution networks hasn't changed at all.

I should also note that Stanford is still having break-ins: 2013 and 2016. This too is an important part of the puzzle.

Bivio's History

Bivio has relied on devops automation for 17 years. We have assumed Red Hat/CentOS as our base operating system. Our devops is driven by several modules in BOP:

HTTPConf generates multi-tenant Apache proxy and middle tier config
LinuxConfig provides configuration editors
NamedConf creates Bind config
NetConf creates ifconfig files
Release: build and distribute RPMs using a preprocessor to promote sharing between RPMs

Some negatives:

Push is not supported
RPMs must be built before config can be deployed
Acceptance testing is based on application level tests
Many of the edits have been ad hoc in the RPMs themselves
Some config is declarative, and others are imperative
Getting an overview is difficult

Goals

While Bivio started out running OLTP systems, we are now focusing on computational science, which are OLAP systems. Computational jobs can run for days. We also are running 3rd party applications/codes.

Most modern configuration management (CM) assumes services can be restarted at any time. This greatly simplifies things, but it doesn't work in our case.

We are relying on Docker to distribute all our applications. This allows for greater flexibility on system configuration. For example, if an application needs to run periodic (cron) jobs, it can install cron in a container. There's no need to maintain a centralized cron job. Even the CM client could be run from a (privileged) container.

We aren't running many nodes, and the nodes are heterogeneous. Entire OLAP clusters must be homogeneous. We also will eventually want to run arbitrary customer code on our clusters, just like supercomputer centers do. We won't be able to vet the code. We can require it to be compiled in a Docker image, however.

Our goals are therefore rather different than webapp environments. Here's the list in alphabetical (not priority) order:

Atomic updates for clusters
Authoritative management (master imposes state on nodes)
Automatic bootstrap (PXE)
Continuous deployment to alpha and beta
Declarative CM state with modular extensions
Deferred updates for busy nodes
Docker
Fedora
Pull or push
Remote execution: arbitrary commands
Staging of OS updates on our schedule
Staging: dev, alpha, beta, prod
Test: unit and acceptance
Undo: delete configuration upon removal from master

Some of this is not required today.

Implementation

Salt is an excellent remote execution engine. It also manages CM state securely. What it doesn't do well is authorative management. There is no easy "undo" of all (or any) operations. It doesn't handle deferred udpates. Updates are "atomic enough" for clusters, because you can target machines, and the clients do all the work.

Another big problem with authoritative management is that the configuration selectors are too complicated and more importantly insecure. According to the last SaltStack FAQ entry, you can't trust Grains and instead you should use the Minion ID. This subtle design flaw is critically important to the overall architecture.

See our Salt configuration for more details

....GIBBERISH FOLLOWS....

Deferred updates are not necessary today. Salt (nor any other CM) supports this afaik. We will have to build something on top of the CM to handle this case.

Continuous deployment has to be built on top of the CM.

CM staging is unclear.

We could use Copr to automate bootstrap and staged OS updates.

Deleting: Every state could output a state file to run that would reverse the operation of the state that was run. Perhaps this could be an sh file. If the state runs again and outputs the same file, it does nothing. If however, the state run does not create the file (old timestamp) it would be run to delete the state. The name of the file would have to be something relative to object being created. Perhaps it is one object per file, e.g. docker.jupyterhub..sh would be the name of the service being installed. Need dependencies on the files.. so that needs to carry through. Could use reactor to catch the event a job (state.apply/highstate) was run. This would then run a cleanup job on the minion that would remove

prerequisites (make) triggering events (triggers) https://docs.chef.io/resource_common.html#notifications before, delay, immediately dependening on events (dependencies "watch" salt) execute operation one time (make) data values: secret, not secret shared function

make => salt state: declarative in YAML/Makefile, but recipe is imperative

YAML is a side issue, because you need to know how to program, that's what devops is about.

Make has the problem that dependencies are not limited

Complexity is around stateful servers. Db upgrades may need all transactions to stop, which may need application level support so site stays up, or ideally have upgrades happen incrementally with server up. Major schema changes are tricky.

Rolling updates need to happen when servers are freed up. This can be days in computational science.

http://www.nersc.gov/events/nersc-scheduled-system-outages/

inter-node dependencies not handled except by simple orchestration

restarts need scheduling. You can't just restart JupyterHub, because all the Jupyter servers will need to be killed and restarted.

Paul: dependencies in sirepo

http://www.mattfischer.com/blog/?p=619 WHAT DO OPERATORS DO ALL DAY?

One of my first concerns that I expressed when interviewing for this job was that we’d have Openstack setup in a year and then we’d be done. That has been far from the truth. In reality, the life of an Openstack operator is always interesting. There’s no shortages of things to fix, things to improve, and things to learn, and that’s why I love it. Although each release of Openstack generally makes things easier and more robust, they also always add more features along the edges to keep us busy.

major release syndrome; seamless system upgrades

quiescence

deterministic: eventual consistency is not sufficient

idempotent (other): only necessary changes

dependencies are easy: once

undo is hard and less necessary to do with docker around, but it needs to be coordinated. CoreOS & RH Atomic do not resolve this issue for the whole system, just for the OS. OS rollback is much less of an issue now that all apps are deployed with Docker.

Building docker containers is CM. You need to know which versions are in the container and when to update them. Everything needs to be coordinated and managed for it to be reproducible. The building and distribution of containers needs to be coordinated with configuration changes (often).

docker eliminates many issues, because it is "compile oriented"

VMs increase flexibility, too, but are not a panacea. You don't want to constantly change IP addresses of backend services.

Kubernetes doesn't manage long running tasks. It assumes processes are ephemeral, OLTP, not OLAP.

Operating system updates (yum update) not solved with this, how do you version an operating system update(?)

Internal package update dependencies. Trigger updates based on builds, but this needs to be coordinated across images and again need orchestration

Specific updates: how do you know what operations to execute when a security update is required. You might not want to restart the entire system.

Multi-tenancy: don't bring down all apps when one app needs updating. This comes down to what people are thinking about.

Complexity in nubmer of apps, not necessily in horizontal scaling. For OLAP it's not typically used.

Atomicity. Need to know that all updates are correct and to tag that as the current version to used by CM

Continuous Deployment in OLAP doesn't work. You have to coordinate updates.

Pipleline management: what's on alpha migrates to beta then to prod. The migration triggers have to be manual to prod, and probably to beta.

Security of sensitive data like PWs.

Completely automatic bootstrap is not that important, but if it's there, it needs to be coordinated with all the moving parts.

ConfigurationManagement