Queue index file cannot be recovered if it has messages with TTL on 3.8.17+ #3272

annieblomgren · 2021-08-06T20:45:50Z

annieblomgren
Aug 6, 2021

Hi,

We've had two different clusters upgrading to 3.8.19 where the message store got corrupted:

First case is a single node with Erlang 23.3.1.
After the upgrades a queue got into the state 'down'.
Second case is a 3 node cluster with Erlang 24.0.2.
A whole vhost was down.
We used strace to find the "faulty" .idx file and removed it, after that the vhost could start so not all messages were lost.

We can share full logs and if of interest the data folder privately.

Aug 10, 2021

@annieblomgren can you please try this alpha build? https://github.com/rabbitmq/rabbitmq-server-binaries-dev/releases/tag/v3.8.21-alpha.2

View full answer

johanrhodin · 2021-08-06T21:56:09Z

johanrhodin
Aug 6, 2021

More here: #3253 (comment)
and in this rabbitmq-users post: https://groups.google.com/g/rabbitmq-users/c/ekV9tTBRZms/m/kzZKsITSAwAJ

0 replies

annieblomgren · 2021-08-08T13:25:29Z

annieblomgren
Aug 8, 2021
Author

It has happened again to a single node cluster running Erlang 24.0.4.
A common thing for all 3 cases so far is that the faulty queue has used TTL.

0 replies

carlhoerberg · 2021-08-09T09:12:59Z

carlhoerberg
Aug 9, 2021

We at CloudAMQP have stopped provisioning 3.8.19 because so many customers gets corrupted messages stores from this version ⚠️

0 replies

michaelklishin · 2021-08-09T10:12:07Z

michaelklishin
Aug 9, 2021
Maintainer

Most recent queue index changes were #2954 and #3041 in 3.8.17.

0 replies

michaelklishin · 2021-08-09T10:25:16Z

michaelklishin
Aug 9, 2021
Maintainer

#2954 seems to be a lot more relevant than #3041. As of #2954, index files by default store 2048 entries instead of 16384. This value
is stored in the virtual host directory as .config. Existing installations should have the original value stored there. New installations will
use rabbit.queue_index_segment_entry_count or 2048 if it's not set. Setting rabbit.queue_index_segment_entry_count to 16384 should effectively undo #2954 for newly created virtual hosts.

Nodes will log

Setting segment_entry_count for vhost … with … queues to …

on boot and the .config file is a single Erlang term right now, and can easily be inspected.

Without knowing if this was an upgrade and from what version, we can't tell what segment entry count value should be in effect. But you
can inspect those values both using rabbitmq-diagnostics environment and by inspecting the dot files files under the virtual host directory (which would be under the message store root).

1 reply

michaelklishin Aug 9, 2021
Maintainer

Worth mentioning that #3068 was a follow-up PR that makes us back to the original segment count value when virtual host config file does not exist.

annieblomgren · 2021-08-09T10:37:17Z

annieblomgren
Aug 9, 2021
Author

Have been able to reproduce it in versions 3.8.19, 3.8.18 and 3.8.17.
It works in 3.8.16.

To reproduce:

perftest command bin/runjava com.rabbitmq.perf.PerfTest -x 1 -y 0 -u "throughput-test-1" -a --id "test 1" --message-properties expiration=1 -H $URL -f persistent -s 8000
Restart RabbitMQ

Setting segment_entry_count for vhost 'X' with 0 queues to '2048'

3 replies

michaelklishin Aug 9, 2021
Maintainer

@annieblomgren thanks for the details, so this is with a brand new node and not an upgraded one?

michaelklishin Aug 9, 2021
Maintainer

so this also includes per-message TTL, which makes me wonder if #3041 can be just as relevant.

annieblomgren Aug 9, 2021
Author

@michaelklishin yes tested on brand new nodes too

michaelklishin · 2021-08-09T12:13:02Z

michaelklishin
Aug 9, 2021
Maintainer

We could reproduce and so far it seems that #3041 is the root cause. Reverting it seems to reliably make the issue go away.

0 replies

michaelklishin · 2021-08-10T07:24:30Z

michaelklishin
Aug 10, 2021
Maintainer

@annieblomgren can you please try this alpha build? https://github.com/rabbitmq/rabbitmq-server-binaries-dev/releases/tag/v3.8.21-alpha.2

6 replies

michaelklishin Aug 10, 2021
Maintainer

We are ready to produce 3.8.21 but it would be nice to make sure that we are not the only ones who see a behavior change after e63d89a

johanrhodin Aug 10, 2021

We'll take it for a spin during the next 24h

johanrhodin Aug 10, 2021

I haven't been able to reproduce with RabbitMQ 3.8.21-alpha.2 (was able with 3.8.18).
Have a long running job doing this over and over for a few hours running now.

annieblomgren Aug 10, 2021
Author

yeah @michaelklishin can't reproduce it on v3.8.21-alpha.2 👍
thank you for the fast response.

lhoguin Aug 11, 2021
Maintainer

Thanks for the assist!

michaelklishin · 2021-08-11T16:39:54Z

michaelklishin
Aug 11, 2021
Maintainer

3.9.3 and 3.8.21 are released.

0 replies

Queue index file cannot be recovered if it has messages with TTL on 3.8.17+ #3272

Uh oh!

annieblomgren Aug 6, 2021

Replies: 9 comments · 10 replies

Uh oh!

johanrhodin Aug 6, 2021

Uh oh!

annieblomgren Aug 8, 2021 Author

Uh oh!

carlhoerberg Aug 9, 2021

Uh oh!

michaelklishin Aug 9, 2021 Maintainer

Uh oh!

michaelklishin Aug 9, 2021 Maintainer

Uh oh!

michaelklishin Aug 9, 2021 Maintainer

Uh oh!

annieblomgren Aug 9, 2021 Author

Uh oh!

michaelklishin Aug 9, 2021 Maintainer

Uh oh!

michaelklishin Aug 9, 2021 Maintainer

Uh oh!

annieblomgren Aug 9, 2021 Author

Uh oh!

michaelklishin Aug 9, 2021 Maintainer

Uh oh!

michaelklishin Aug 10, 2021 Maintainer

Uh oh!

michaelklishin Aug 10, 2021 Maintainer

Uh oh!

johanrhodin Aug 10, 2021

Uh oh!

johanrhodin Aug 10, 2021

Uh oh!

Uh oh!

annieblomgren Aug 10, 2021 Author

Uh oh!

lhoguin Aug 11, 2021 Maintainer

Uh oh!

Uh oh!

michaelklishin Aug 11, 2021 Maintainer

annieblomgren
Aug 6, 2021

Replies: 9 comments 10 replies

johanrhodin
Aug 6, 2021

annieblomgren
Aug 8, 2021
Author

carlhoerberg
Aug 9, 2021

michaelklishin
Aug 9, 2021
Maintainer

michaelklishin
Aug 9, 2021
Maintainer

michaelklishin Aug 9, 2021
Maintainer

annieblomgren
Aug 9, 2021
Author

michaelklishin Aug 9, 2021
Maintainer

michaelklishin Aug 9, 2021
Maintainer

annieblomgren Aug 9, 2021
Author

michaelklishin
Aug 9, 2021
Maintainer

michaelklishin
Aug 10, 2021
Maintainer

michaelklishin Aug 10, 2021
Maintainer

annieblomgren Aug 10, 2021
Author

lhoguin Aug 11, 2021
Maintainer

michaelklishin
Aug 11, 2021
Maintainer