[BUG] The docs.count is not match for leader and follower #1146

q123dog · 2023-09-14T08:43:54Z

What is the bug?
the docs.count is not match for leader and follower

How can one reproduce the bug?
I am using opensearch 2.9.0 version
Steps to reproduce the behavior:

Start a stress test program, which will create a test index in the leader cluster and write 10,000,000 docs to this index.
such as: nohup ./opensearch-stress write --opensearch-address "http://{leader_ip:port}" --index-name "es-bulk-0" --bulk-batch-size 1000 --bulk-times 10000 &
Before the stress test program ends, start the index replication task in the follower cluster.
such as: curl -XPUT -k -H 'Content-Type: application/json' 'http://{follower_ip:port}/_plugins/_replication/es-bulk-0/_start?pretty' -d '
{
"leader_alias": "leader-cluster-opensearch",
"leader_index": "es-bulk-0"
}'
After the stress test program ends, wait for the index replication task done, the docs.count of the leader index is 10,000,000, which is as expected, but the docs.count of follower index is always less than 10,000,000.

What is the expected behavior?
When the leader index is writing docs, then start the index replication task. When the leader index stops writing and the index replication task ends, the docs.count of follower index should be equal to the docs.count of the leader index.

What is your host/environment?

OS: Ubuntu 20.04
Version: opensearch 2.9.0
Plugins: cross cluster replication

Do you have any screenshots?
As far as I know, cross cluster replication has two stages, In the first phase, the existing data is synchronized. it will do a snapshot for the segment files of the leader index, then read these files and transfer them to the follower cluster. In the second stage, it reads changes with localCheckPoint from translog to synchronize incremental data

after first stage finished，I found that the docs.deleted of the follower index was 6561, but the stress test program only write docs without specifying _id and did not perform any delete/update/upsert operations.

after the stress test program ends, the docs.count of the leader index is 10,000,000, which is as expected

after two stages finished，I found that the docs.count of the follower index is 9993439, which was less than the leader index. I executed the refresh and flush APIs, but the docs.count was still less than the leader index, and the difference was exactly the value of docs.deleted in the first stage I found. (10000000 - 9993439 = 6561)

Do you have any additional context?
This bug can be easily reproduced, only need to start the index replication task when data is being written in batches to the leader index.

I have reproduced this bug many times. I have tried other stress testing programs and this bug always appears. So I hope I can find help here, thx.

btw, If there is no writing to the leader index，then start the replication task，after the task ends，the dos.count of leader index is equal to the follower index.

ankitkala · 2024-02-21T06:06:25Z

Can you trigger refresh on follower index and then verify the doc count?

q123dog added bug Something isn't working untriaged labels Sep 14, 2023

q123dog changed the title ~~[BUG]~~ [BUG] The docs.count is not match for leader and follower Sep 14, 2023

ankitkala removed the untriaged label Feb 21, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] The docs.count is not match for leader and follower #1146

[BUG] The docs.count is not match for leader and follower #1146

q123dog commented Sep 14, 2023

ankitkala commented Feb 21, 2024 •

edited

Loading

[BUG] The docs.count is not match for leader and follower #1146

[BUG] The docs.count is not match for leader and follower #1146

Comments

q123dog commented Sep 14, 2023

ankitkala commented Feb 21, 2024 • edited Loading

ankitkala commented Feb 21, 2024 •

edited

Loading