[BUG] Operator cannot reliably bootstrap a cluster #811

lpeter91 · 2024-05-13T17:54:40Z

What is the bug?

The operator sometimes fails to correctly bootstrap/initialize a new cluster, instead it settles on a yellow state with shards stuck in unallocated and initializing statuses.

How can one reproduce the bug?

Note that this doesn't always happen, so you might have to try multiple times; however it happens for me more often than not:

Apply the minimal example below. It's basically the first example from the docs, with the now mandatory TLS added, and Dashboards removed. Wait until the bootstrapping finishes.

apiVersion: opensearch.opster.io/v1
kind: OpenSearchCluster
metadata:
  name: my-first-cluster
  namespace: default
spec:
  general:
    serviceName: my-first-cluster
    version: 2.13.0
  security:
    tls:
      transport:
        generate: true
        perNode: true
      http:
        generate: true
  nodePools:
    - component: nodes
      replicas: 3
      diskSize: "5Gi"
      nodeSelector:
      resources:
         requests:
            memory: "2Gi"
            cpu: "500m"
         limits:
            memory: "2Gi"
            cpu: "500m"
      roles:
        - "cluster_manager"
        - "data"

When the setup process finishes, the bootstrap pod is removed. Also around this time when to operator sometimes decides to log the event "Starting to rolling restart", and recreate the first node (pod). If this happens, sometimes the cluster ends up in a yellow state, that the operator does not resolve. If at this point I manually delete the cluster_manager pod (second node, usually), it will be recreated and the issue seems to resolve itself.

What is the expected behavior?

A cluster with a green state after setup. Preferably without unnecessary restarts.

What is your host/environment?

minikube v1.33.0 on Opensuse-Tumbleweed 20240511 w/ docker driver

I'm currently evaluating the operator locally. It might be part of the problem, as it forces me to run 3 nodes on a single machine. (However it does have sufficient resources to accommodate the nodes. The issue was also reproduced on a MacBook, albeit also with minikube.)

Do you have any additional context?

See the attached files. Some logs are probably missing since a pod was recreated.
kubectl_describe.txt
operator.log
node-2.log
node-1.log
node-0.log
allocation_explain.json
cat_shards.txt
cat_nodes.txt

The text was updated successfully, but these errors were encountered:

dtaivpp · 2024-05-14T14:33:38Z

Going to be honest here not having enough resources to host the cluster is probably where you are running into issues. OpenSearch gets really unstable when there is not enough memory. I've personally experienced this as well when running docker containers with OpenSearch.

These logs are concerning but it's hard to say the are unrelated to OOM type issues.

Node 0 Log:
[2024-05-13T17:35:44,103][WARN ][o.o.s.SecurityAnalyticsPlugin] [my-first-cluster-nodes-0] Failed to initialize LogType config index and builtin log types

Node 1 Log:

[2024-05-13T17:35:42,014][INFO ][o.o.i.i.MetadataService  ] [my-first-cluster-nodes-1] ISM config index not exist, so we cancel the metadata migration job.
[2024-05-13T17:35:43,296][ERROR][o.o.s.l.LogTypeService   ] [my-first-cluster-nodes-1] Custom LogType Bulk Index had failures:
 
[2024-05-13T17:35:43,296][ERROR][o.o.s.l.LogTypeService   ] [my-first-cluster-nodes-1] Custom LogType Bulk Index had failures:```

lpeter91 · 2024-05-30T17:35:14Z

Now I can confirm that this issue happens on an actul production Kubernetes cluster with plenty of resources too. The operator erroneously decides to do a rolling restart and fails to deliver, leaving the cluster in a yellow state. Seems like a concurrency issue as it doesn't always happen.

jaskeerat789 · 2024-06-14T13:04:15Z

We have facing this issue too. We have given ample amount of resources to all node groups but controller tries a rolling restart and then get stuck at yellow cluster state. We are able to use the cluster but any updates to manifests are not enforced by the operator due to yellow cluster state

prudhvigodithi · 2024-06-20T19:36:59Z

[Triage]
I was able to deploy the cluster successfully with the operator, also posted the same here #844 (comment), @jaskeerat789 @lpeter91 can you please test with the latest version of the operator?
Thank you
@dtaivpp @get

dtaivpp · 2024-06-20T19:53:09Z

Okay this feels very much like a stability issue I was having as well. @prudhvigodithi I have a feeling this is the same issue I had at reinvent where like 2/10 clusters wouldn’t bootstrap correctly.

Might be worth checking with Kyle Davis who has the code from that and testing repeatedly. I can test on a local machine to see if I experience a similar issue.

jaskeerat789 · 2024-06-20T19:56:06Z

@prudhvigodithi Cluster deployment is not an issue for us. We are able to bootstrap a cluster. Problem arises when we try to update something in the cluster manifest and apply that. Operator tries to do a rolling restart of pods in order to enforce changes but gets is unable to trigger a restart for some reason. operator then proceeds to mark cluster in yellow state. Any further changes to manifests are ignored by the operator since the cluster is in yellow state. Let me know how can i help you understand this issue in more detail.

lpeter91 · 2024-06-21T13:18:16Z

@lpeter91 can you please test with the latest version of the operator?

@prudhvigodithi Tried it, still reproducible, took me only 4 tries.

Updated versions:
OpenSearch operator: v2.6.1
OpenSearch: 2.14.0
K8s: minikube v1.33.1 on Opensuse-Tumbleweed 20240619; Kubernetes v1.30.0 on Docker 26.1.1
Helm (only used for installing the operator): v3.15.2

prudhvigodithi · 2024-09-17T16:06:41Z

Adding @swoehrl-mw to this conversation to see if this happened while testing the operator. With my EKS setup i havent seen this issue. Thanks

swoehrl-mw · 2024-09-19T14:05:36Z

During local testing (running in k3d) I sometimes had the behaviour that the operator thought it needed to do a rolling restart, but it always managed to complete the cycle and produce a green cluster, and I also was not able to find out a reason for the restart.

In earlier versions we had a problem where the operator did not always correctly reactivate shard allocation, but AFAIK that was fixed.

evheniyt · 2024-10-29T06:51:07Z

I have also experienced unstable cluster bootstrap. I have fully recreated a cluster multiple times and periodically I saw that the cluster was stacked on bootstrapping the second node.
Eventually, I have found a correlation between this issue and a recreation of bootstrap pod.
We are using Karpenter and sometimes, during bootstrap process it could decide to move bootstrap pod to another node. When that happens, the cluster creation stack with this error:

opensearch [2024-10-29T06:38:10,310][WARN ][o.o.c.c.Coordinator      ] [opensearch-primary-bootstrap-0] failed to validate incoming join request from node [{opensearch-primary-nodes-0}{9zZmg5EGRpidHf_0OwLUyA}{kV9
e6qUTSsmvj1lUP-2QjA}{opensearch-primary-nodes-0}{10.152.42.19:9300}{dm}{shard_indexing_pressure_enabled=true}]                                                                                                      
opensearch org.opensearch.transport.RemoteTransportException: [opensearch-primary-nodes-0][10.152.42.19:9300][internal:cluster/coordination/join/validate_compressed]                                               
opensearch Caused by: org.opensearch.cluster.coordination.CoordinationStateRejectedException: join validation on cluster state with a different cluster uuid 7QJiU55FRcWvBidZD_MF6A than local cluster uuid U4a0ix4h
TwCvij0JF9qoEw, rejecting

I believe it is caused by the fact that bootstrap pod is not using persistent disk, and if it is restarted it gets a new cluster UUID which is non equal with the UUID on node-0

lpeter91 added bug Something isn't working untriaged Issues that have not yet been triaged labels May 13, 2024

prudhvigodithi removed the untriaged Issues that have not yet been triaged label Jun 20, 2024

peterzhuamazon added this to Engineering Effectiveness Board Jul 11, 2024

github-project-automation bot moved this to 🆕 New in Engineering Effectiveness Board Jul 11, 2024

getsaurabh02 moved this from 🆕 New to Backlog in Engineering Effectiveness Board Jul 18, 2024

evheniyt mentioned this issue Nov 8, 2024

Missing PersistenceVolume settings for bootstrap pod #897

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Operator cannot reliably bootstrap a cluster #811

[BUG] Operator cannot reliably bootstrap a cluster #811

lpeter91 commented May 13, 2024

dtaivpp commented May 14, 2024

lpeter91 commented May 30, 2024

jaskeerat789 commented Jun 14, 2024

prudhvigodithi commented Jun 20, 2024

dtaivpp commented Jun 20, 2024

jaskeerat789 commented Jun 20, 2024

lpeter91 commented Jun 21, 2024

prudhvigodithi commented Sep 17, 2024

swoehrl-mw commented Sep 19, 2024

evheniyt commented Oct 29, 2024

[BUG] Operator cannot reliably bootstrap a cluster #811

[BUG] Operator cannot reliably bootstrap a cluster #811

Comments

lpeter91 commented May 13, 2024

What is the bug?

How can one reproduce the bug?

What is the expected behavior?

What is your host/environment?

Do you have any additional context?

dtaivpp commented May 14, 2024

lpeter91 commented May 30, 2024

jaskeerat789 commented Jun 14, 2024

prudhvigodithi commented Jun 20, 2024

dtaivpp commented Jun 20, 2024

jaskeerat789 commented Jun 20, 2024

lpeter91 commented Jun 21, 2024

prudhvigodithi commented Sep 17, 2024

swoehrl-mw commented Sep 19, 2024

evheniyt commented Oct 29, 2024