High availability in Swarm

High availability in Docker Swarm

In Docker Swarm, the Swarm manager is responsible for the entire cluster and manages the resources of multiple Docker hosts at scale. If the Swarm manager dies, you must create a new one and deal with an interruption of service.

The High Availability feature allows a Docker Swarm to gracefully handle the failover of a manager instance. Using this feature, you can create a single primary manager instance and multiple replica instances.

A primary manager is the main point of contact with the Docker Swarm cluster. You can also create and talk to replica instances that will act as backups. Requests issued on a replica are automatically proxied to the primary manager. If the primary manager fails, a replica takes away the lead. In this way, you always keep a point of contact with the cluster.

Setup primary and replicas

This section explains how to set up Docker Swarm using multiple managers.

Assumptions

You need either a Consul, etcd, or Zookeeper cluster. This procedure is written assuming a Consul server running on address 192.168.42.10:8500. All hosts will have a Docker Engine configured to listen on port 2375. We will be configuring the Managers to operate on port 4000. The sample Swarm configuration has three machines:

manager-1 on 192.168.42.200
manager-2 on 192.168.42.201
manager-3 on 192.168.42.202

Create the primary manager

You use the swarm manage command with the --replication and --advertise flags to create a primary manager.

  user@manager-1 $ swarm manage -H :4000 <tls-config-flags> --replication --advertise 192.168.42.200:4000 consul://192.168.42.10:8500/nodes
  INFO[0000] Listening for HTTP addr=:4000 proto=tcp
  INFO[0000] Cluster leadership acquired
  INFO[0000] New leader elected: 192.168.42.200:4000
  [...]

The --replication flag tells Swarm that the manager is part of a multi-manager configuration and that this primary manager competes with other manager instances for the primary role. The primary manager has the authority to manage cluster, replicate logs, and replicate events happening inside the cluster.

The --advertise option specifies the primary manager address. Swarm uses this address to advertise to the cluster when the node is elected as the primary. As you see in the command’s output, the address you provided now appears to be the one of the elected Primary manager.

Create two replicas

Now that you have a primary manager, you can create replicas.

user@manager-2 $ swarm manage -H :4000 <tls-config-flags> --replication --advertise 192.168.42.201:4000 consul://192.168.42.10:8500/nodes
INFO[0000] Listening for HTTP                            addr=:4000 proto=tcp
INFO[0000] Cluster leadership lost
INFO[0000] New leader elected: 192.168.42.200:4000
[...]

This command creates a replica manager on 192.168.42.201:4000 which is looking at 192.168.42.200:4000 as the primary manager.

Create an additional, third manager instance:

user@manager-3 $ swarm manage -H :4000 <tls-config-flags> --replication --advertise 192.168.42.202:4000 consul://192.168.42.10:8500/nodes
INFO[0000] Listening for HTTP                            addr=:4000 proto=tcp
INFO[0000] Cluster leadership lost
INFO[0000] New leader elected: 192.168.42.200:4000
[...]

Once you have established your primary manager and the replicas, create Swarm agents as you normally would.

List machines in the cluster

Typing docker info should give you an output similar to the following:

user@my-machine $ export DOCKER_HOST=192.168.42.200:4000 # Points to manager-1
user@my-machine $ docker info
Containers: 0
Images: 25
Storage Driver:
Role: Primary  <--------- manager-1 is the Primary manager
Primary: 192.168.42.200
Strategy: spread
Filters: affinity, health, constraint, port, dependency
Nodes: 3
 swarm-agent-0: 192.168.42.100:2375
  └ Containers: 0
  └ Reserved CPUs: 0 / 1
  └ Reserved Memory: 0 B / 2.053 GiB
  └ Labels: executiondriver=native-0.2, kernelversion=3.13.0-49-generic, operatingsystem=Ubuntu 14.04.2 LTS, storagedriver=aufs
 swarm-agent-1: 192.168.42.101:2375
  └ Containers: 0
  └ Reserved CPUs: 0 / 1
  └ Reserved Memory: 0 B / 2.053 GiB
  └ Labels: executiondriver=native-0.2, kernelversion=3.13.0-49-generic, operatingsystem=Ubuntu 14.04.2 LTS, storagedriver=aufs
 swarm-agent-2: 192.168.42.102:2375
  └ Containers: 0
  └ Reserved CPUs: 0 / 1
  └ Reserved Memory: 0 B / 2.053 GiB
  └ Labels: executiondriver=native-0.2, kernelversion=3.13.0-49-generic, operatingsystem=Ubuntu 14.04.2 LTS, storagedriver=aufs
Execution Driver:
Kernel Version:
Operating System:
CPUs: 3
Total Memory: 6.158 GiB
Name:
ID:
Http Proxy:
Https Proxy:
No Proxy:

This information shows that manager-1 is the current primary and supplies the address to use to contact this primary.

Test the failover mechanism

To test the failover mechanism, you shut down the designated primary manager. Issue a Ctrl-C or kill the current primary manager (manager-1) to shut it down.

Wait for automated failover

After a short time, the other instances detect the failure and a replica takes the lead to become the primary manager.

For example, look at manager-2’s logs:

user@manager-2 $ swarm manage -H :4000 <tls-config-flags> --replication --advertise 192.168.42.201:4000 consul://192.168.42.10:8500/nodes
INFO[0000] Listening for HTTP                            addr=:4000 proto=tcp
INFO[0000] Cluster leadership lost
INFO[0000] New leader elected: 192.168.42.200:4000
INFO[0038] New leader elected: 192.168.42.201:4000
INFO[0038] Cluster leadership acquired               <--- We have been elected as the new Primary Manager
[...]

Because the primary manager, manager-1, failed right after it was elected, the replica with the address 192.168.42.201:4000, manager-2, recognized the failure and attempted to take away the lead. Because manager-2 was fast enough, the process was effectively elected as the primary manager. As a result, manager-2 became the primary manager of the cluster.

If we take a look at manager-3 we should see those logs:

user@manager-3 $ swarm manage -H :4000 <tls-config-flags> --replication --advertise 192.168.42.202:4000 consul://192.168.42.10:8500/nodes
INFO[0000] Listening for HTTP                            addr=:4000 proto=tcp
INFO[0000] Cluster leadership lost
INFO[0000] New leader elected: 192.168.42.200:4000
INFO[0036] New leader elected: 192.168.42.201:4000   <--- manager-2 sees the new Primary Manager
[...]

At this point, we need to export the new DOCKER_HOST value.

Switch the primary

To switch the DOCKER_HOST to use manager-2 as the primary, you do the following:

user@my-machine $ export DOCKER_HOST=192.168.42.201:4000 # Points to manager-2
user@my-machine $ docker info
Containers: 0
Images: 25
Storage Driver:
Role: Replica  <--------- manager-2 is a Replica
Primary: 192.168.42.200
Strategy: spread
Filters: affinity, health, constraint, port, dependency
Nodes: 3

You can use the docker command on any Docker Swarm primary manager or any replica.

If you like, you can use custom mechanisms to always point DOCKER_HOST to the current primary manager. Then, you never lose contact with your Docker Swarm in the event of a failover.

Links:

https://docs.docker.com/v1.11/swarm/multi-manager-setup/

doc_docker

2025-01-10 15:47:30

Comments