VerneMQ cluster partition

Demonstration of issue to aid responders to this issue

Whilst investigating VerneMQ's data loss characteristics in the event of node loss and partitioning, I've encountered Erlang CRASH REPORT occurrences in the debug logs.

CRASH REPORT Process <0.1065.0> with 0 neighbours crashed with reason: no case clause matching {{case_clause,{badrpc,nodedown}},[{vmq_reg,'-register_subscriber/4-fun-0-',2,[{file,"/opt/vernemq/distdir/1.0.1/_build/default/lib/vmq_server/src/vmq_reg.erl"},{line,182}]},{vmq_reg,block_until_migrated,4,[{file,"/opt/vernemq/distdir/1.0.1/_build/default/lib/vmq_server/src/vmq_reg.erl"},{line,249}]},{vmq_reg,register_subscriber,4,[{file,"/opt/vernemq/distdir/1.0.1/_build/default/lib/vmq_server/src/vmq_reg.erl"},{line,194}]},{vmq_reg_sync_action,'-init/1-fun-0-',1,[{file,"/opt/vernemq/distdi..."},...]}]} in vmq_mqtt_fsm:check_user/2 line 553

This occurs when a subscriber reconnects following a node outage (see detailed steps below).

I'd like to understand whether this is expected behaviour or a bug.

Do the occurrences of such reports compromise the integrity and service-level of a cluster in any way?

Prerequisites

To bring up the demonstration cluster via docker-compose you'll need an AWS account and the aws CLI tool.

Whilst everything runs in a local Docker composition, a remote bucket is used for discovery purposes.

Cost footprint should be negligible.

Instructions

Create the bucket into which node discovery information will be placed.

 aws s3api create-bucket --bucket vernemq-discovery --region eu-west-1

Build the vernemq Docker image - this implements a crude form of S3-based service-discovery
```
 make docker-build
```
Bring up the cluster
```
 make docker-up
```

Log output looks sane: https://gist.github.com/codeasone/ff2952f773481e998e46dbc94e46e3e0

HAProxy reports healthy:

Confirm cluster formation

 docker exec -it vernemqdataloss_mqtt-a_1 bash -c 'vmq-admin cluster show'

+------------------+-------+
|       Node       |Running|
+------------------+-------+
|[email protected]| true  |
|[email protected]| true  |
|[email protected]| true  |
+------------------+-------+

Start test subscriber with clean-session=false

 mosquitto_sub -t resume-test -c -i resume-test -q 1 -d

Note: I can see from the HAProxy stats that the subscriber connected to mqtt-c

Publish a test message "A"

 mosquitto_pub -t resume-test -q 1 -d -m "A"

The subscriber receives this as expected:

Client resume-test received PUBLISH (d0, q1, r0, m1, 'resume-test', ... (1 bytes))
Client resume-test sending PUBACK (Mid: 1)
A

Disconnect test subscriber to simulate a temporary device failure (for the sake of discussion lets assume this is many minutes)
```
 <kill mosquitto_sub process>
```
Publish two more test messages which we expect our subscriber to resume on re-connection to the cluster
```
 mosquitto_pub -t resume-test -q 1 -d -m "B"
 mosquitto_pub -t resume-test -q 1 -d -m "C"
```
Now we knock out mqtt-c ungracefully to simulate a real-life node failure or partition
```
 docker-compose kill mqtt-c
```

The loss is acknowledged by the cluster:

vernemqdataloss_mqtt-c_1 exited with code 137
mqtt-b_1       | 2017-05-19 11:21:41.384 [warning] <0.343.0>@vmq_cluster_mon:handle_info:122 cluster node '[email protected]' DOWN
mqtt-a_1       | 2017-05-19 11:21:41.385 [warning] <0.343.0>@vmq_cluster_mon:handle_info:122 cluster node '[email protected]' DOWN

HAProxy shows the node as down:

Reconnect the original test subscriber

mosquitto_sub -t resume-test -c -i resume-test -q 1 -d

Client resume-test sending CONNECT
Client resume-test sending CONNECT
Client resume-test received CONNACK
Client resume-test sending SUBSCRIBE (Mid: 1, Topic: resume-test, QoS: 1)
Client resume-test received SUBACK
Subscribed (mid: 1): 1

Issue: HAProxy first routes the connection to to mqtt-b, which fails, so the subscriber retries and successfully connects to mqtt-a

The node logs are as follows:

mqtt-b_1       | 2017-05-19 11:23:23.133 [error] emulator Error in process <0.1067.0> on node '[email protected]' with exit value:
mqtt-b_1       | {{case_clause,{badrpc,nodedown}},[{vmq_reg,'-register_subscriber/4-fun-0-',2,[{file,"/opt/vernemq/distdir/1.0.1/_build/default/lib/vmq_server/src/vmq_reg.erl"},{line,182}]},{vmq_reg,block_until_migrated,4,[{file,"/opt/vernemq/distdir/1.0.1/_build/default/lib/vmq_server/src/vmq_reg.erl"},{line,249}]},{vmq_reg,register_subscriber,4,[{file,"/opt/vernemq/distdir/1.0.1/_build/default/lib/vmq_server/src/vmq_reg.erl"},{line,194}]},{vmq_reg_sync_action,'-init/1-fun-0-',1,[{file,"/opt/vernemq/distdir/1.0.1/_build/default/lib/vmq_server/src/vmq_reg_sync_action.erl"},{line,52}]}]}
mqtt-b_1       | 2017-05-19 11:23:23.134 [error] <0.1065.0> CRASH REPORT Process <0.1065.0> with 0 neighbours crashed with reason: no case clause matching {{case_clause,{badrpc,nodedown}},[{vmq_reg,'-register_subscriber/4-fun-0-',2,[{file,"/opt/vernemq/distdir/1.0.1/_build/default/lib/vmq_server/src/vmq_reg.erl"},{line,182}]},{vmq_reg,block_until_migrated,4,[{file,"/opt/vernemq/distdir/1.0.1/_build/default/lib/vmq_server/src/vmq_reg.erl"},{line,249}]},{vmq_reg,register_subscriber,4,[{file,"/opt/vernemq/distdir/1.0.1/_build/default/lib/vmq_server/src/vmq_reg.erl"},{line,194}]},{vmq_reg_sync_action,'-init/1-fun-0-',1,[{file,"/opt/vernemq/distdi..."},...]}]} in vmq_mqtt_fsm:check_user/2 line 553
mqtt-b_1       | 2017-05-19 11:23:23.134 [error] <0.1065.0> Ranch listener {{0,0,0,0},1883} terminated with reason: {case_clause,{{case_clause,{badrpc,nodedown}},[{vmq_reg,'-register_subscriber/4-fun-0-',2,[{file,"/opt/vernemq/distdir/1.0.1/_build/default/lib/vmq_server/src/vmq_reg.erl"},{line,182}]},{vmq_reg,block_until_migrated,4,[{file,"/opt/vernemq/distdir/1.0.1/_build/default/lib/vmq_server/src/vmq_reg.erl"},{line,249}]},{vmq_reg,register_subscriber,4,[{file,"/opt/vernemq/distdir/1.0.1/_build/default/lib/vmq_server/src/vmq_reg.erl"},{line,194}]},{vmq_reg_sync_action,'-init/1-fun-0-',1,[{file,"/opt/v..."},...]}]}}
mqtt-a_1       | 2017-05-19 11:23:28.273 [info] <0.1064.0>@plumtree_metadata_exchange_fsm:exchange:160 completed metadata exchange with '[email protected]'. repaired 0 missing local prefixes, 0 missing remote prefixes, and 1 keys

Question: That CRASH REPORT doesn't look right. Is cluster integrity compromised?

Next, publish two more messages D and E in the test sequence A (B C) D E where (B C) only exist within the LevelDB msgstore of mqtt-c, which is still out of the picture...

As expected, those messages are received just fine:

Client resume-test received PUBLISH (d0, q1, r0, m1, 'resume-test', ... (1 bytes))
Client resume-test sending PUBACK (Mid: 1)
D
Client resume-test received PUBLISH (d0, q1, r0, m2, 'resume-test', ... (1 bytes))
Client resume-test sending PUBACK (Mid: 2)
E

So far, our test subscriber has received A D E.

Restore node mqtt-c

docker-compose run mqtt-c
Starting: 22a11cbb6b3d 172.26.0.2
Registering: 172.26.0.2
Discovering nodes via S3 bucket
[2017-05-19T11:32:19Z]:join [email protected]
Couldn't join cluster due to self_join

[2017-05-19T11:32:19Z]:join [email protected]
Done
[2017-05-19T11:32:19Z]:join [email protected]
Done
2017-05-19 11:32:05.929 [info] <0.31.0> Application hackney started on node '[email protected]'
2017-05-19 11:32:06.728 [info] <0.341.0>@vmq_reg_trie:handle_info:183 loaded 1 subscriptions into vmq_reg_trie
2017-05-19 11:32:06.734 [info] <0.226.0>@vmq_cluster:init:113 plumtree peer service event handler 'vmq_cluster' registered
2017-05-19 11:32:07.299 [info] <0.31.0> Application vmq_acl started on node '[email protected]'
2017-05-19 11:32:07.385 [info] <0.31.0> Application vmq_passwd started on node '[email protected]'
2017-05-19 11:32:07.454 [error] <0.421.0> Failed to start Ranch listener {{127,0,0,1},8888} in ranch_tcp:listen([{ip,{127,0,0,1}},{port,8888},{nodelay,true},{linger,{true,0}},{send_timeout,30000},{send_timeout_close,true}]) for reason eaddrinuse (address already in use)
2017-05-19 11:32:07.454 [error] <0.421.0> CRASH REPORT Process <0.421.0> with 0 neighbours exited with reason: {listen_error,{{127,0,0,1},8888},eaddrinuse} in gen_server:init_it/6 line 352
2017-05-19 11:32:07.454 [error] <0.419.0> Supervisor {<0.419.0>,ranch_listener_sup} had child ranch_acceptors_sup started with ranch_acceptors_sup:start_link({{127,0,0,1},8888}, 10, ranch_tcp, [{ip,{127,0,0,1}},{port,8888},{nodelay,true},{linger,{true,0}},{send_timeout,30000},{send_timeout_close,...}]) at undefined exit with reason {listen_error,{{127,0,0,1},8888},eaddrinuse} in context start_error
2017-05-19 11:32:07.454 [error] <0.247.0>@vmq_ranch_config:reconfigure_listeners_for_type:187 can't reconfigure http listener({127,0,0,1}, 8888) with Options [{max_connections,10000},{nr_of_acceptors,10},{config_mod,vmq_http_config},{config_fun,config},{proxy_protocol,false}] due to {{shutdown,{failed_to_start_child,ranch_acceptors_sup,{listen_error,{{127,0,0,1},8888},eaddrinuse}}},{child,undefined,{ranch_listener_sup,{{127,0,0,1},8888}},{ranch_listener_sup,start_link,[{{127,0,0,1},8888},10,ranch_tcp,[{ip,{127,0,0,1}},{port,8888},{nodelay,true},{linger,{true,0}},{send_timeout,30000},{send_timeout_close,true}],cowboy_protocol,[{env,[{dispatch,[{'_',[],[{[<<"metrics">>],[],vmq_metrics_http,[]},{[<<"api">>,<<"v1">>,'...'],[],vmq_http_mgmt_api,[]}]}]}]}]]},permanent,infinity,supervisor,[ranch_listener_sup]}}
2017-05-19 11:32:07.454 [info] <0.31.0> Application vmq_server started on node '[email protected]'
2017-05-19 11:32:19.329 [info] <0.449.0>@plumtree_peer_service:attempt_join:50 Sent join request to: '[email protected]'
2017-05-19 11:32:19.466 [info] <0.452.0>@plumtree_peer_service:attempt_join:50 Sent join request to: '[email protected]'

(B C) are immediately sent to our test subscriber, but overall order is still compromised A D E B C, which given causal dependencies of the system-under-development may not be viable.

Observations

VerneMQ comes with a default configuration that trades availability for consistency with regard to MQTT's QoS provisions (in CAP terms it's a CP system)
When a node dies, or a cluster experiences a network partition, no new connections or publications are possible until conditions are restored, thus ensuring subscribers receive a consistent linearizable view of the world as publishers intended.
Consistency can be traded off against higher availability without message loss. The following settings allow new connections, subscriptions, and publications to be delivered by a cluster that has lost some capacity through node loss, or partitioning:

allow_register_during_netsplit = on
allow_publish_during_netsplit = on
allow_subscribe_during_netsplit = on
allow_unsubscribe_during_netsplit = on

Increased availability comes at the cost of linearizability.
CRASH REPORT log entries suggest cluster state may be compromised.

Open Issues

What is the period between LevelDB flushes/syncs? Is data loss inevitable if a node dies before having a chance to sync offline message queues?

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
bin		bin
config		config
files		files
images		images
.dockerignore		.dockerignore
.gitignore		.gitignore
Dockerfile		Dockerfile
Makefile		Makefile
README.md		README.md
docker-compose.yml		docker-compose.yml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

VerneMQ cluster partition

Prerequisites

Instructions

Observations

Open Issues

About

Releases

Packages

Languages

codeasone/vernemq-cluster

Folders and files

Latest commit

History

Repository files navigation

VerneMQ cluster partition

Prerequisites

Instructions

Observations

Open Issues

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages