You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
What is the bug?
the docs.count is not match for leader and follower
How can one reproduce the bug?
I am using opensearch 2.9.0 version
Steps to reproduce the behavior:
Start a stress test program, which will create a test index in the leader cluster and write 10,000,000 docs to this index.
such as: nohup ./opensearch-stress write --opensearch-address "http://{leader_ip:port}" --index-name "es-bulk-0" --bulk-batch-size 1000 --bulk-times 10000 &
Before the stress test program ends, start the index replication task in the follower cluster.
such as: curl -XPUT -k -H 'Content-Type: application/json' 'http://{follower_ip:port}/_plugins/_replication/es-bulk-0/_start?pretty' -d '
{
"leader_alias": "leader-cluster-opensearch",
"leader_index": "es-bulk-0"
}'
After the stress test program ends, wait for the index replication task done, the docs.count of the leader index is 10,000,000, which is as expected, but the docs.count of follower index is always less than 10,000,000.
What is the expected behavior?
When the leader index is writing docs, then start the index replication task. When the leader index stops writing and the index replication task ends, the docs.count of follower index should be equal to the docs.count of the leader index.
What is your host/environment?
OS: Ubuntu 20.04
Version: opensearch 2.9.0
Plugins: cross cluster replication
Do you have any screenshots?
As far as I know, cross cluster replication has two stages, In the first phase, the existing data is synchronized. it will do a snapshot for the segment files of the leader index, then read these files and transfer them to the follower cluster. In the second stage, it reads changes with localCheckPoint from translog to synchronize incremental data
after first stage finished,I found that the docs.deleted of the follower index was 6561, but the stress test program only write docs without specifying _id and did not perform any delete/update/upsert operations.
after the stress test program ends, the docs.count of the leader index is 10,000,000, which is as expected
after two stages finished,I found that the docs.count of the follower index is 9993439, which was less than the leader index. I executed the refresh and flush APIs, but the docs.count was still less than the leader index, and the difference was exactly the value of docs.deleted in the first stage I found. (10000000 - 9993439 = 6561)
Do you have any additional context?
This bug can be easily reproduced, only need to start the index replication task when data is being written in batches to the leader index.
I have reproduced this bug many times. I have tried other stress testing programs and this bug always appears. So I hope I can find help here, thx.
btw, If there is no writing to the leader index,then start the replication task,after the task ends,the dos.count of leader index is equal to the follower index.
The text was updated successfully, but these errors were encountered:
What is the bug?
the docs.count is not match for leader and follower
How can one reproduce the bug?
I am using opensearch 2.9.0 version
Steps to reproduce the behavior:
Start a stress test program, which will create a test index in the leader cluster and write 10,000,000 docs to this index.
such as: nohup ./opensearch-stress write --opensearch-address "http://{leader_ip:port}" --index-name "es-bulk-0" --bulk-batch-size 1000 --bulk-times 10000 &
Before the stress test program ends, start the index replication task in the follower cluster.
such as: curl -XPUT -k -H 'Content-Type: application/json' 'http://{follower_ip:port}/_plugins/_replication/es-bulk-0/_start?pretty' -d '
{
"leader_alias": "leader-cluster-opensearch",
"leader_index": "es-bulk-0"
}'
After the stress test program ends, wait for the index replication task done, the docs.count of the leader index is 10,000,000, which is as expected, but the docs.count of follower index is always less than 10,000,000.
What is the expected behavior?
When the leader index is writing docs, then start the index replication task. When the leader index stops writing and the index replication task ends, the docs.count of follower index should be equal to the docs.count of the leader index.
What is your host/environment?
Do you have any screenshots?
As far as I know, cross cluster replication has two stages, In the first phase, the existing data is synchronized. it will do a snapshot for the segment files of the leader index, then read these files and transfer them to the follower cluster. In the second stage, it reads changes with localCheckPoint from translog to synchronize incremental data
after first stage finished,I found that the docs.deleted of the follower index was 6561, but the stress test program only write docs without specifying _id and did not perform any delete/update/upsert operations.
after the stress test program ends, the docs.count of the leader index is 10,000,000, which is as expected
after two stages finished,I found that the docs.count of the follower index is 9993439, which was less than the leader index. I executed the refresh and flush APIs, but the docs.count was still less than the leader index, and the difference was exactly the value of docs.deleted in the first stage I found. (10000000 - 9993439 = 6561)
Do you have any additional context?
This bug can be easily reproduced, only need to start the index replication task when data is being written in batches to the leader index.
I have reproduced this bug many times. I have tried other stress testing programs and this bug always appears. So I hope I can find help here, thx.
btw, If there is no writing to the leader index,then start the replication task,after the task ends,the dos.count of leader index is equal to the follower index.
The text was updated successfully, but these errors were encountered: