[toc]
If a swarm is already running, jump to deploying the api, otherwise use the instructions below to get the swarm running.
- Install docker on target ec2 instances (assuming running Ubuntu 16.04):
Setup the Docker apt
repository:
> sudo apt-get update
> sudo apt-get install -y apt-transport-https ca-certificates curl software-properties-common
> curl -fsSL https://download.docker.com/linux/ubuntu/gpg | sudo apt-key add -
Now verify that the key fingerprint for the Docker apt
repository is correct:
> sudo apt-key fingerprint 0EBFCD88
Now setup the stable repository:
> sudo add-apt-repository "deb [arch=amd64] https://download.docker.com/linux/ubuntu $(lsb_release -cs) stable"
And install docker:
> sudo apt-get update
> sudo apt-get install -y docker-ce
The docker-compose
file is, for some horrible reason, installed separately from stand alone Docker. Find the latest release on the docker-compose release page, and curl the latest release:
> sudo curl -L https://github.com/docker/compose/releases/download/1.19.0/docker-compose-Linux-x86_64 -o /usr/local/bin/docker-compose
> sudo chmod +x /usr/local/bin/docker-compose
Replace the 1.19.0
with the latest stable release number.
When setting up a swarm environment on AWS, the following ports will need to be open between all machines in the cluster: 2181, 2377, 4789, 5000, 7946, 9455, and 9555.
The ports used for communications with the API will need to be open as well, so remotes can connect to the API. This includes ports 17141 and 17504.
Full port list: 2181, 2377, 4789, 5000, 7946, 9455, 9555, 17141, 17504.
The utility of each of these ports:
- 2181: Used by Zookeeper and Kafka for service management (TCP/UDP)
- 2377: Used by Docker Swarm for cluster management (TCP/UDP)
- 4789: Used by Docker Swarm overlay network (TCP/UDP)
- 5000: Temporary Docker Registry port for container distribution to Docker Swarm nodes (HTTP)
- 7946: Docker Swarm control traffic
- 9455: Swarm internal port for Kafka communication (HTTP/TLS)
- 9555: Swarm external port for Kafka communication (HTTP/TLS)
- 17141: Sensing API Insecure port (HTTP)
- 17504: Sensing API Secure port (HTTP/TLS)
As well, internal DNS (as handled by Route 53) must include the following routes, for now all pointing at the Swarm Master node:
- sensing-api.savior.internal
- sensing-ca.savior.internal
- sensing-kafka.savior.internal
On the machine that will be the swarm manager node, run:
sudo docker swarm init --advertise-addr $(ifconfig | awk '/inet addr/{print substr($2,6)}' | grep 10.)
and record the output of the swarm init, which will look something like:
To add a worker to this swarm, run the following command:
docker swarm join --token SWMTKN-1-0ktxq9kcuz9t6kvw9559wq5r3i2qsqk5lx0f55y0rilnw719p1-02y6jh213sezys4cwn4uvv2dg 10.0.4.61:2377
On all of the worker nodes, use the following command to join the swarm (replacing the --token
value with the actual value generated by the manager at startup):
sudo docker swarm join --token SWMTKN-1-0ktxq9kcuz9t6kvw9559wq5r3i2qsqk5lx0f55y0rilnw719p1-02y6jh213sezys4cwn4uvv2dg 10.0.4.61:2377
If you didn't record the join token from the manager when you started the swarm, you can retrieve it on the manager with the command:
sudo docker swarm join-token worker
Removing worker nodes from the swarm is straight-forward, and must be run from the node to be removed from the swarm:
sudo docker swarm leave
Start the external docker overlay network
> sudo docker network create --driver overlay --attachable --subnet 192.168.1.0/24 apinet
Notice that we're directly setting a subnet for use in the Swarm network - if we don't do this, the default network used in swarm has conflicts with the default subnet in the AWS VPC, that is overlapping 10.0.1.0/24
segments, which wreaks havoc with DNS and container routing. The name of this network, apinet
, must match the defined external network name in the docker-compose-swarm.yml
compose file.
Our Virtue/SAVIOR repository is a private github repo, so you'll need either a checkout/clone link with an embedded access token, or you can export your token to the Bash environment with:
export GITHUB_TOKEN=<your token here>
Checkout a copy of the Savior repository:
git clone "https://$GITHUB_TOKEN@github.com/twosixlabs/savior.git"
Make sure you're on the branch you intend to run from.
Moving containers built with docker-compose
between the different nodes of a docker swarm requires a registry. Rather than using the global Docker Hub registry, we spin up our own registry as part of our deploy step. Start the registry with:
> sudo docker service create --name registry --publish 5000:5000 registry:2
You can confirm that the registry is running with:
> curl http://localhost:5000/v2/
{}
The empty JSON dictionary return is the expected result.
We need to build our containers and push the results to our local registry:
sudo /usr/local/bin/docker-compose -f docker-compose-swarm.yml build
sudo /usr/local/bin/docker-compose -f docker-compose-swarm.yml push
Before deploying to the swarm or building the swarm network, source the swarm_setup.sh
script to prep the host environment:
. ./bin/swarm_setup.sh
Instead of directly invoking the docker-compose
command, we'll deploy the API as described by the docker-compose-swarm.yml
compose file using the docker stack
interface to the Swarm.
Deploy everything with:
> sudo docker stack deploy --compose-file docker-compose-swarm.yml savior-api
You can check what's running in the service with:
sudo docker stack services savior-api
You can generally check that things are running smoothly by looking for errors in the API logs:
> sudo docker service logs -f savior-api_api
For debugging the current state of services, you can get a non-truncated PS result from the stack with:
> sudo docker stack ps savior-api --no-trunc
Tear down the stack with:
> sudo docker stack rm savior-api
Tear down the network
> sudo docker network rm apinet
If this is the first run for the API on this swarm, the database may need to be seeded with sensor configurations:
> ./bin/load_sensor_configurations.py
Individual services can be restarted/updated with:
> sudo docker service update --force savior-api_api
Where the savior-api_
prefix is determined by the name we gave the deployed stack, and the suffix is the name of the service in the docker compose file.
Get logs for individual services:
> sudo docker service logs -f savior-api_api
Where the savior-api_
prefix is determined by the name we gave the deployed stack, and the suffix is the name of the service in the docker compose file.
Enter into an interactive bash session on any of the services:
> sudo docker exec -ti savior-api_api.1.$(sudo docker service ps -f 'name=savior-api_api.1' savior-api_api -q) /bin/bash
This is more complicated than standard docker exec
commands because of the naming format for service deployments.
The Docker registry:2
service we deploy on the swarm should be reachable from any node in the swarm at localhost:5000
, due to the load-balancing done by the Swarm for exposed service ports. Sometimes, though, things go wrong.
If you consistently see no such image localhost:5000/...
type errors from docker service ps savior_api-
calls for any image, chances are the Swarm routing overlay network isn't communicating, or is otherwise unable to load balance. The easiest way to restore service (after checking that network ACLs haven't changed), is to remove the Swarm node workers with docker swarm leave
and then have each node re-join the swarm. This will reset the overlay routing.