Mnesia: binding tables can become inconsistent when a node hosting transient queues is restarted #13030
-
Describe the bugTLDR: We had a mismatch of data between rabbit_db_topic_exchange trie records and rabbit_db_binding records after a rolling rabbitmq server update (we saw some flow controlled channels so we decided to use a slightly higher instance size for our nodes). Rabbitmq server version used: RabbitMQ 3.13.7 Erlang 26.2.5.2 We have the following exchange setup for handling incoming requests: 1: Exchange eve-requests-v2 2: Exchange eve-v3-requests 3: Exchange unroutable-requests So each one of our service creates it's own queue, and will bind it to both eve-requests-v2 and eve-v3-requests. We started to observe some flow control as our traffic was growing (namely channels and connections being flow controlled, but now queues themselves) so our guess was that some queue processes can't keep up with traffic, so we wanted to change instance types in AWS to give more cpu resources (specifically more single core performance) to nodes. When verifying the update, we checked with rabbitmqctl if all the queues and bindings that we expect are in place, and everything seemed good. Our expected case, with a service that was working after the rolling update:
So the same queue has the same binding on both eve-requests-v2 and eve-v3-requests exchanges. A scenario that caused unroutable message alert in our system:
Since we had no match on eve-v3-requests the message got handed off to the alternate exchange, which causes our unroutable message alerts to be emitted. But at the same time what we saw via the web UI or rabbitmqctl is this:
If I queried the bindings information from a shell attached to the rabbitmq cluster from rabbit_db_bindings, I got the same information:
If I queried the trie records though:
I only found trie records for one binding. So it seems binding information stored in these two places got desynced during the rolling update process. (Trie information for the working service completely matches up with the other binding information, I can provide them if needed.) Reproduction stepsWe cannot consistently reproduce this issue. But when it happened:
Expected behaviorExpected bindings not just to be displayed correctly, but to route incoming messages correctly. Additional contextSolved the temporary issue by executing this code snippet: However we're no closer to understanding why did this happen in the first place (apart form this likely being an mnesia synchronization error). Some of our services still rely on classic mirrored queues, so currently we can't update to 4.x and use khepri just yet. |
Beta Was this translation helpful? Give feedback.
Replies: 2 comments
-
@ccp-sandor RabbitMQ 3.13.7 is out of community support. Our team won't spend any meaningful amount of time on Mnesia-related issues going forward, at least for non-paying users. Mnesia will be removed entirely in a future version of RabbitMQ. Binding inconsistency of that kind is a very well known problem with transient and semi-transient bindings (when the exchange is durable but the queue is transient). Transient queues won't exist in a future version of RabbitMQ, and those who use only durable entities won't run into this class of problems. Upgrade to |
Beta Was this translation helpful? Give feedback.
-
Here is a somewhat similar problem that was immediately resolved by moving to Khepri #12927. The only difference is how it manifests itself:
Hopefully this explains why durable queues (and exchanges, and their bindings) won't be affected: when a node hosting a durable queue is restarted, durable queue won't be deleted, and thus neither will be their bindings. Khepri uses a completely different data model and won't exhibit the same behavior. |
Beta Was this translation helpful? Give feedback.
@ccp-sandor RabbitMQ 3.13.7 is out of community support. Our team won't spend any meaningful amount of time on Mnesia-related issues going forward, at least for non-paying users. Mnesia will be removed entirely in a future version of RabbitMQ.
Binding inconsistency of that kind is a very well known problem with transient and semi-transient bindings (when the exchange is durable but the queue is transient). Transient queues won't exist in a future version of RabbitMQ, and those who use only durable entities won't run into this class of problems.
Upgrade to
4.0.5
and switch to Khepri.