MDEV-25855 Added support for Galera replication with cluster auto bootstrapping #377

tymonx · 2021-05-21T11:44:23Z

This patch add support for Galera replication. It fixes #28 Support Galera Replication.

Features:

it detects if Galera replication was enabled using mysql
configuration files or provided mysqld command line arguments
on default it enables cluster auto bootstrap feature
on default the first cluster node is used for cluster auto bootstrapping
based on the wsrep_cluster_address parameter from mysql
configuration files, mysqld command line arguments or by setting the
WSREP_CLUSTER_ADDRESS environment variable
cluster auto bootstrap feature can be disabled by setting the
WSREP_SKIP_AUTO_BOOTSTRAP environment variable
use the WSREP_AUTO_BOOTSTRAP_ADDRESS environment variable to explicitly
choice other node for cluster bootstrapping
cluster node hostnames or IP addresses must be valid to enable cluster
auto bootstrapping

How to use it.

Prepare mysql configuration file galera.cnf:

[galera]
wsrep_on                       = ON
wsrep_sst_method               = rsync
wsrep_provider                 = /usr/lib/libgalera_smm.so
bind-address                   = 0.0.0.0
binlog_format                  = row
default_storage_engine         = InnoDB
innodb_doublewrite             = 1
innodb_autoinc_lock_mode       = 2
innodb_flush_log_at_trx_commit = 2

Remove write permission for others (it fixes Warning: World-writable config file):

chmod o-w galera.cnf

Prepare Docker Compose file docker-compose.yml:

services:
    node:
        image: mariadb
        restart: always
        environment:
            WSREP_CLUSTER_ADDRESS: "${WSREP_CLUSTER_ADDRESS:-}"
            MYSQL_ROOT_PASSWORD: example
        volumes:
            - ./galera.cnf:/etc/mysql/conf.d/10-galera.cnf:ro,z
        command:
            - --wsrep-cluster-address=gcomm://db_node_1,db_node_2,db_node_3
        deploy:
            replicas: 3

Start Docker Compose:

docker-compose --project-name db up

To start N MariaDB instances using environment variable:

WSREP_CLUSTER_ADDRESS="gcomm://db_node_1,db_node_2,db_node_3,db_node_4,db_node_5"
docker-compose --project-name db up --scale node="$(echo "${WSREP_CLUSTER_ADDRESS}" | tr ',' ' ' | wc -w)"

To start N MariaDB instances using mysql configuration file:

docker-compose --project-name db up --scale node="$(grep -i wsrep_cluster_address <name>.cnf | tr -d ' ' | tr ',' ' ' | wc -w)"

To start N MariaDB instances using POSIX script helper:

#!/usr/bin/env sh

# usage: scale.sh <project-name> <service-name> <scale>
#    ie: scale.sh db node 5

PROJECT_NAME="${1:-db}"
SERVICE_NAME="${2:-node}"
SCALE="${3:-3}"

WSREP_CLUSTER_ADDRESS="gcomm://${PROJECT_NAME}_${SERVICE_NAME}_1"

for i in $(seq 2 "${SCALE}"); do
    WSREP_CLUSTER_ADDRESS="${WSREP_CLUSTER_ADDRESS},${PROJECT_NAME}_${SERVICE_NAME}_${i}"
done

docker-compose --project-name "${PROJECT_NAME}" up --scale "${SERVICE_NAME}"="${SCALE}"

Example usage:

./scale.sh db node 5

julienfritsch44 · 2021-05-27T07:24:43Z

@janlindstrom do you think you can review this, please?

janlindstrom · 2021-06-04T10:08:56Z

I must say I do not know much about docker but changes do look reasonable.

grooverdan · 2021-06-23T08:46:51Z

Thanks @janlindstrom.

@tymonx sorry I've been so slow, I am progressing. I've been podman{,-compose} testing being a userspace only limits some for the things like unique IP addresses per node (probably will have a way eventually), and I've been reacquainting myself with galera and compose to ensure that its the right design.

I'm pretty happy so far. Just been composing test cases.

Success:

detection of volume state and the initialization

Not Yet (to be fixed eventually):

ports on the cluster address should be ignored (very small change to docker_address_match).

What was the rational behind the order in: docker_ip_match "$resolved" || docker_ip_match "$1" || docker_hostname_match "$resolved" || docker_hostname_match "$1" ? Wouldn't you take direct $1 matches before a resolution?

ChristianCiach · 2021-06-27T00:26:37Z

Hi @tymonx! Thank you for doing this! We are currently evaluating bitnami/mariadb-galera, but we are seeing quite a lot of bugs. Some of these bugs happen because this image is not designed for host-networking --network host and using IP addresses instead of hostnames for the wsrep-cluster-address (even though this is recommended by the galera documentation).

Please make sure that your PR also works in these cases.

Also, you may want to provide an option to force a container into bootstrap mode. When the whole cluster crashes, it may happen that no node is safe_to_bootstrap. When this happens, one node must be forced to bootstrap. On native mariadb installations, you would just run mysqld --wsrep-new-cluster again after editing the grastate.dat to set set_to_bootstrap=1. The Bitnami-image image provides the environment variable MARIADB_GALERA_FORCE_SAFETOBOOTSTRAP (see https://github.com/bitnami/bitnami-docker-mariadb-galera/blob/3b93659e7d0647a5bf3810cc204d71d834120266/10.5/debian-10/rootfs/opt/bitnami/scripts/libmariadbgalera.sh#L99).

But after thinking about this for a minute, this is probably not necessary here, because the user could just pass --wsrep-new-cluster as a command to docker run, right? (This is not possible when using the Bitnami image, which is probably why they invented the environment variable).

ChristianCiach · 2021-06-27T00:30:18Z

It would be nice if you could provide a way to force a node into bootstrap mode just once. In case of a cluster crash, I want a node to force-bootstrap just once to repair the cluster. But when I do docker restart when the cluster is working again, I don't want the container to force-bootstrap again.

Edit: I have no idea how this could be archived...

grooverdan · 2021-06-27T03:30:49Z

@ChristianCiach thanks for your interest and describing the requirements/use cases. The number of variants is what is taking this so long to review. While the aim is not to be comprehensive on the first functionality I do aim to use an implementation that needs will be stable.

Yes --wsrep-new-cluster can be passed as an argument as a force option, but like what you mentioned on restart this isn't desired, so a different option/variable is needed.

I'm going to consider this bootstrap first, and then recovery as the next step.

ChristianCiach · 2021-06-27T08:36:38Z

Bitnami's MARIADB_GALERA_FORCE_SAFETOBOOTSTRAP has the same issue, as it also doesn't remove itself. When using this environment variable, you have to remember to re-deploy the container without this variable after the cluster has recovered.

tymonx · 2021-06-27T15:53:59Z

I'm back :)

ports on the cluster address should be ignored (very small change to docker_address_match).

Fixed. I have also added line for striping cluster addresses options ?option1=value1[&option2=value2] :

# it removes URI schemes like gcomm://
address="${address#[[:graph:]]*://}"

# it removes port suffix per address
address="${address/:[0-9]*//}"

# it removes options suffix ?option1=value1[&option2=value2]
address="${address%\?[[:graph:]]*}"

What was the rational behind the order in: docker_ip_match "$resolved" || docker_ip_match "$1" || docker_hostname_match "$resolved" || docker_hostname_match "$1" ? Wouldn't you take direct $1 matches before a resolution?

I have just randomly hitting on my keyboard. No specific reasons. I have already changed order, first hostnames.

I have added new changes after some intense testing on various environments, Docker Compose, Docker Swarm, QEMU, Fedora CoreOS, with/without virtualization or physical machines.

DNS resolve lookups for IP -> hostname and hostname -> IP. This will allow to correctly match IP address or hostname node.

Reasons:

Docker Compose/Swarm creates implicitly two hostnames <service-name>-<id>.<network-name> and random hash. This will allow to match with <service-name>-<id>.<network-name> or <service-name>-<id>
Virtual machines like QEMU hides guest (container with MariaDB) in own network with own IP. It is possible to set hostname from Compose/Swarm like this -netdev user,id=<name>,hostname=$(hostname) -device virtio-net,netdev=<name> and use <service-name>-<id>.<network-name> or <service-name>-<id>
The machine hostname can have any name that is not reachable from network. DNS reverse lookup resolves that

I have fixed YAML example in PR description. Proper SELinux label should be :ro,z not :ro,Z Configure the selinux label

To Do:

Checking the $wsrepdir/gvwstate.dat file is not enough. On graceful container shutdown this file is removed by the MariaDB daemon. This will cause to run bootstrapping again. I'm currently looking into that to improve this.

ChristianCiach · 2021-06-27T16:26:42Z

To be honest, I don't fully trust your ip/hostname detection logic. There are too many "but what if"s. For example, what happens if the machine has multiple network devices and the container is deployed using "host networking"? Also, I've seen many environments where dns reverse lookup is just not possible.

I would like to be able to explicitly define the node address of the current container. For example, if wsrep_cluster_address is gcomm://172.28.180.96,172.28.180.97,172.28.180.98, I would like to be able to explicitly define the node address of the second node to 172.28.180.97. If you already know the node address of the current node, there is no need to guess anymore. In fact, I already do pass the node address to the container using --wsrep_node_address.

tymonx · 2021-06-27T16:32:47Z

@ChristianCiach no problem, I can add a comparison with the wsrep-node-address value.

It depends on user needs. For example wsrep-node-address is useless when someone is using replicas or global mode. Because it requires to somehow set the wsrep-node-address per each created container.

ChristianCiach · 2021-06-27T16:36:38Z

Yes, of course, I agree with you :) It is not always possible to have different configurations for each node. For example, if you want to scale your cluster up/down dynamically (for example using Docker Swarm services or Kubernetes StatefulSet), then it is very hard or even impossible to set wsrep-node-address.

I think it would be awesome if you could at least look at wsrep-node-address if it is set, just like you said! Also, please support both cases, where wsrep-node-address is defined inside a .cnf file or passed as a command by using --wsrep-node-address.

Again, thank you so much for doing this. It already looks very promising!.

tymonx · 2021-06-27T16:41:06Z

I think it would be awesome if you could at least look at wsrep-node-address if it is set, just like you said! Also, please support both cases, where wsrep-node-address is defined inside a .cnf file or passed as a command by using --wsrep-node-address

Sure. It is very reasonable to do that. I was thinking about the same.

tymonx · 2021-06-27T18:17:15Z

@ChristianCiach I have already added support for the --wsrep-node-address.

When someone will provide the wsrep-node-address from configuration files or command line it will skip auto Docker address match mechanism to select proper node for bootstrapping. On default it compares to the first value from the wsrep-cluster-address. To choice other node, use the WSREP_AUTO_BOOTSTRAP_ADDRESS environment variable.

grooverdan · 2021-06-28T07:27:31Z

Just to share some rough stuff I've been looking at (that covers other galera options) and needing to reread the above:

diff --git a/docker-entrypoint.sh b/docker-entrypoint.sh
index 1b10dc2..e51dc02 100755
--- a/docker-entrypoint.sh
+++ b/docker-entrypoint.sh
@@ -359,7 +359,25 @@ docker_ip_match() {
 #    ie: docker_address_match node1
 # it returns true if provided value match with container IP address or container hostname. Otherwise it returns false
 docker_address_match() {
-       local resolved="$(resolveip --silent "$1" 2>/dev/null)" # it converts hostname to ip or vice versa
+       local host=${1%%:*}
+       local port=${1#*:}
+       if [ -n "$port" ]; then
+               local wsrep_provider_options="$(mysql_get_config wsrep_provider_options)"
+               wsrep_provider_options=( ${wsrep_provider_options//,/ } )
+               for opt in "${wsrep_provider_options=[@]}"; do
+                       if [[ "$opt" =~ gmcast.listen_addr.* ]]; then
+                               local val="${opt#*=[[:graph:]]*://}"
+                               case "$val" in
+                                       ${host}:${port})        return 1 ;;
+                                       0.0.0.0:${port})        break ;;
+                                       *:${port})              break ;;
+                                       *)                      return 0;;
+                               esac
+                       fi
+               done
+
+       fi
+       local resolved="$(resolveip --silent "$host" 2>/dev/null)" # it converts hostname to ip or vice versa
 
        docker_ip_match "$resolved" || docker_ip_match "$1" || docker_hostname_match "$resolved" || docker_hostname_match "$1"
 }

As a crude hack with:

#!/bin/bash
podman pod stop db && podman pod rm db
podman pod create --name=db  --share net
for n in 1 2 3
do
	podman create --name=db_node_$n --pod=db \
	       	--security-opt label=disable --label io.podman.compose.config-hash=123 --label io.podman.compose.project=db --label io.podman.compose.version=0.0.1 --label com.doc
ker.compose.container-number=$n --label com.docker.compose.service=node \
		-e MARIADB_ROOT_PASSWORD=example \
		--add-host node:127.0.0.1 --add-host db_node_1:127.0.0.1 --add-host db_node_2:127.0.0.1 --add-host db_node_3:127.0.0.1 \
		--restart always \
		mariadb:testgalera --port $(( 3306 - 1 + $n )) --wsrep_cluster_address=gcomm://db_node_1:4567,db_node_2:4577,db_node_3:4587 --wsrep-node-address=127.0.0.1 --wsrep_
provider_options="gmcast.listen_addr=tcp://0.0.0.0:$(( 4567 + ( $n - 1 ) * 10 ))" --wsrep-on=1 --wsrep-provider=/usr/lib/libgalera_smm.so --binlog_format=ROW
done

Is there a point at which the autobootstrap is (always?) applied if you are actually starting from an empty datadir? Anything else is recovery.

Should non-first nodes not initialize with /docker-entrypoint-initdb.d/ (and rely on galera sst)?

tymonx · 2021-06-28T07:35:00Z

Is there a point at which the autobootstrap is (always?) applied if you are actually starting from an empty datadir?

Docker Daemon (I don't know about Podman) always creates a volume for container. If container stops and starts again (including restarting), files are still present. Bootstrapping will not fire.

I have also tested and confirmed that graceful shutdown docker --kill SIGTERM <container> the mysqld daemon will remove the gvwstate.dat file.

I'm looking into more proper solution to handle this.

tymonx · 2021-06-28T08:07:54Z

For Podman I cannot simple strip port numbers from wsrep-cluster-address. It should be also included for comparison. Because Podman works on 127.0.0.1 vs Docker that always creates container with own IP address.

tymonx · 2021-06-28T19:22:40Z

Working Podman example script to start N containers in db pod for commit 45149e2:

#!/usr/bin/env sh

NODES="${1:-3}"

options="--add-host db_node_1:127.0.0.1"
address="db_node_1:4567"

for i in $(seq 2 "${NODES}"); do
    options="${options} --add-host db_node_$i:127.0.0.1"
    address="${address},db_node_$i:$(( 4567 + ( $i - 1 ) * 10 ))"
done

podman pod stop db
podman pod rm db
podman pod create --name=db --share net

for i in $(seq 1 "${NODES}"); do
    podman create \
        --pod=db \
        --name=db_node_$i \
        --security-opt label=disable \
        --env MARIADB_ROOT_PASSWORD=example \
        --restart always \
        ${options:+${options}} \
        mariadb:dev \
        --port $(( 3305 + $i )) \
        --wsrep_cluster_address="gcomm://${address}" \
        --wsrep-node-address="db_node_$i:$(( 4567 + ( $i - 1 ) * 10 ))" \
        --wsrep-on=on \
        --wsrep-provider=/usr/lib/libgalera_smm.so \
        --binlog_format=row
done

podman pod start db

View logs:

podman logs --follow db_node_1

Output:

View:
  id: b98b33bc-d845-11eb-99df-0245217d5d15:2
  status: primary
  protocol_version: 4
  capabilities: MULTI-MASTER, CERTIFICATION, PARALLEL_APPLYING, REPLAY, ISOLATION, PAUSE, CAUSAL_READ, INCREMENTAL_WS, UNORDERED, PREORDERED, STREAMING, NBO
  final: no
  own_index: 2
  members(3):
        0: b988176f-d845-11eb-994b-576602aed1c3, c38f5df9273f
        1: b98a6f95-d845-11eb-a223-e6bc7724aaf5, 7268f0eb1373
        2: b98aa0de-d845-11eb-945b-e21ff9f0f09e, 81f7c1cfefbe

tymonx · 2021-06-29T07:02:46Z

Added support for the safe_to_bootstrap from the grastate.dat file. This will work in case of graceful shutdown of all nodes but step-by-step. Galera writes 1 to the last gracefully shutdown node.

For Docker Compose users after docker-compose up they should call manually docker stop db_node_<n> per each node. Invoking the docker-compose stop command or hitting CTRL + C combination on the keyboard will gracefully shutdown all nodes at the same time and Galera cannot handle this properly.

Dockerfile.template

grooverdan · 2022-02-10T08:08:39Z

I've based and squashed the commits up. Shell check changed a few things. As a basic bootstrap its ok. I'm still looking at what crash recovery would look like. Probably need to make our own state transition diagram.

https://galeracluster.com/library/documentation/crash-recovery.html

This patch add support for Galera replication. Features: - It detects if Galera replication was enabled wsrep_on=ON - By default it enables cluster auto bootstrap feature - By default the first cluster node is used for cluster auto bootstrapping based on the wsrep_cluster_address parameter or by setting the `WSREP_CLUSTER_ADDRESS` environment variable - cluster auto bootstrap feature can be disabled by setting the `WSREP_SKIP_AUTO_BOOTSTRAP` environment variable - use the `WSREP_AUTO_BOOTSTRAP_ADDRESS` environment variable to explicitly choice other node for cluster bootstrapping - cluster node hostnames or IP addresses must be valid to enable cluster auto bootstrapping How to use it. 1. Prepare MariaDB configuration file `galera.cnf`: ```plaintext [galera] wsrep_on = ON wsrep_sst_method = mariabackup wsrep_provider = /usr/lib/libgalera_smm.so binlog_format = row default_storage_engine = InnoDB innodb_doublewrite = 1 innodb_autoinc_lock_mode = 2 ``` 2. Make it read-only: ```plaintext chmod 444 galera.cnf ``` 3. Prepare Docker Compose file `docker-compose.yml`: ```yaml services: node: image: mariadb restart: always security_opt: - label=disable environment: WSREP_CLUSTER_ADDRESS: "${WSREP_CLUSTER_ADDRESS:-}" MARIADB_ROOT_PASSWORD: example volumes: - ./galera.cnf:/etc/mysql/conf.d/10-galera.cnf:ro command: - --wsrep-cluster-address=gcomm://db_node_1,db_node_2,db_node_3 deploy: replicas: 3 ``` 4. Start Docker Compose: ```plaintext docker-compose --project-name db up ``` To start N MariaDB instances using environment variable: ```plaintext WSREP_CLUSTER_ADDRESS="gcomm://db_node_1,db_node_2,db_node_3,db_node_4,db_node_5" docker-compose --project-name db up --scale node="$(echo "${WSREP_CLUSTER_ADDRESS}" | tr ',' ' ' | wc -w)" ``` To start N MariaDB instances using MariaDB configuration file: ```plaintext docker-compose --project-name db up --scale node="$(grep -i wsrep_cluster_address <name>.cnf | tr -d ' ' | tr ',' ' ' | wc -w)" ``` Closes: MariaDB#28

grooverdan · 2022-02-15T10:34:13Z

@ChristianCiach et all. I welcome any summary of the test cases needed. MDEV-25855 (preferred) or here. I have looked though the bitnami galera issue referenced above, and the blog from which I'll derive some cases too.

jozefrebjak · 2022-07-07T08:00:22Z

Hello, any news with this PR ?

julienfritsch44 assigned janlindstrom May 27, 2021

grooverdan changed the title ~~Added support for Galera replication with cluster auto bootstrapping~~ MDEV-25855 Added support for Galera replication with cluster auto bootstrapping Jun 4, 2021

julienfritsch44 assigned grooverdan and unassigned janlindstrom Jun 7, 2021

grooverdan reviewed Jun 29, 2021

View reviewed changes

Dockerfile.template Outdated Show resolved Hide resolved

grooverdan mentioned this pull request Aug 30, 2021

Can I use this version of Mariadb to deploy clusters? #389

Closed

grooverdan mentioned this pull request Oct 11, 2021

2 nodes of galera cluster (3 nodes totally) restart periodically #398

Closed

grooverdan force-pushed the feature-support-galera-replication branch from 84b089e to 5fbf4c6 Compare February 10, 2022 08:02

tymonx and others added 2 commits February 15, 2022 13:02

MDEV-25667: check gvwstate.dat before bootstrap

38bebe0

grooverdan force-pushed the feature-support-galera-replication branch from 5fbf4c6 to 38bebe0 Compare February 15, 2022 05:02

mmontes11 mentioned this pull request Nov 3, 2022

[Feature] HA via Galera mariadb-operator/mariadb-operator#4

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MDEV-25855 Added support for Galera replication with cluster auto bootstrapping #377

MDEV-25855 Added support for Galera replication with cluster auto bootstrapping #377

tymonx commented May 21, 2021 •

edited

Loading

julienfritsch44 commented May 27, 2021

janlindstrom commented Jun 4, 2021

grooverdan commented Jun 23, 2021

ChristianCiach commented Jun 27, 2021 •

edited

Loading

ChristianCiach commented Jun 27, 2021 •

edited

Loading

grooverdan commented Jun 27, 2021

ChristianCiach commented Jun 27, 2021 •

edited

Loading

tymonx commented Jun 27, 2021

ChristianCiach commented Jun 27, 2021 •

edited

Loading

tymonx commented Jun 27, 2021

ChristianCiach commented Jun 27, 2021 •

edited

Loading

tymonx commented Jun 27, 2021

tymonx commented Jun 27, 2021

grooverdan commented Jun 28, 2021

tymonx commented Jun 28, 2021

tymonx commented Jun 28, 2021 •

edited

Loading

tymonx commented Jun 28, 2021 •

edited

Loading

tymonx commented Jun 29, 2021 •

edited

Loading

grooverdan commented Feb 10, 2022

grooverdan commented Feb 15, 2022

jozefrebjak commented Jul 7, 2022

MDEV-25855 Added support for Galera replication with cluster auto bootstrapping #377

Are you sure you want to change the base?

MDEV-25855 Added support for Galera replication with cluster auto bootstrapping #377

Conversation

tymonx commented May 21, 2021 • edited Loading

julienfritsch44 commented May 27, 2021

janlindstrom commented Jun 4, 2021

grooverdan commented Jun 23, 2021

ChristianCiach commented Jun 27, 2021 • edited Loading

ChristianCiach commented Jun 27, 2021 • edited Loading

grooverdan commented Jun 27, 2021

ChristianCiach commented Jun 27, 2021 • edited Loading

tymonx commented Jun 27, 2021

ChristianCiach commented Jun 27, 2021 • edited Loading

tymonx commented Jun 27, 2021

ChristianCiach commented Jun 27, 2021 • edited Loading

tymonx commented Jun 27, 2021

tymonx commented Jun 27, 2021

grooverdan commented Jun 28, 2021

tymonx commented Jun 28, 2021

tymonx commented Jun 28, 2021 • edited Loading

tymonx commented Jun 28, 2021 • edited Loading

tymonx commented Jun 29, 2021 • edited Loading

grooverdan commented Feb 10, 2022

grooverdan commented Feb 15, 2022

jozefrebjak commented Jul 7, 2022

tymonx commented May 21, 2021 •

edited

Loading

ChristianCiach commented Jun 27, 2021 •

edited

Loading

ChristianCiach commented Jun 27, 2021 •

edited

Loading

ChristianCiach commented Jun 27, 2021 •

edited

Loading

ChristianCiach commented Jun 27, 2021 •

edited

Loading

ChristianCiach commented Jun 27, 2021 •

edited

Loading

tymonx commented Jun 28, 2021 •

edited

Loading

tymonx commented Jun 28, 2021 •

edited

Loading

tymonx commented Jun 29, 2021 •

edited

Loading