-
Notifications
You must be signed in to change notification settings - Fork 86
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
Showing
9 changed files
with
506 additions
and
2 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,33 @@ | ||
# Licensed to Apache Software Foundation (ASF) under one or more contributor | ||
# license agreements. See the NOTICE file distributed with | ||
# this work for additional information regarding copyright | ||
# ownership. Apache Software Foundation (ASF) licenses this file to you under | ||
# the Apache License, Version 2.0 (the "License"); you may | ||
# not use this file except in compliance with the License. | ||
# You may obtain a copy of the License at | ||
# | ||
# http://www.apache.org/licenses/LICENSE-2.0 | ||
# | ||
# Unless required by applicable law or agreed to in writing, | ||
# software distributed under the License is distributed on an | ||
# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY | ||
# KIND, either express or implied. See the License for the | ||
# specific language governing permissions and limitations | ||
# under the License. | ||
# | ||
|
||
QPS ?= 10 | ||
|
||
GROUP ?= "default" | ||
|
||
.PHONY: up_traffic | ||
up_traffic: | ||
curl -XPOST 'http://localhost:12800/mock-data/segments/tasks?qps=$(QPS)&group=$(GROUP)' -H'Content-Type: application/json' -d "@segment.tpl.json" | ||
|
||
.PHONY: ls_traffic | ||
ls_traffic: | ||
curl -XGET 'http://localhost:12800/mock-data/segments/tasks' | ||
|
||
.PHONY: rm_traffic | ||
rm_traffic: | ||
curl -XDELETE 'http://localhost:12800/mock-data/segments/tasks' |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,110 @@ | ||
# Failover and Resilience Test | ||
|
||
## Setup the Cluster | ||
|
||
See [Setup the Cluster](setup.md). | ||
|
||
## Case 1: Liaison Node Failure | ||
|
||
### Steps to simulate a liaison node failure | ||
|
||
1. Add an annotation "failover-try=1" to the Liaison pod to simulate a failure. | ||
2. A new Liaison pod will be created, and the old Liaison pod will be in the `Terminating` state. | ||
3. Check the status of the Liaison pods and OAP console. | ||
4. Check write and query operations. | ||
|
||
### Result of the liaison node failure | ||
|
||
- The first Liaison pod is in the `Terminating` state. | ||
- The second Liaison pod is in the `Running` state. | ||
- The cluster is still available. | ||
- The trace and metrics(5 services) write and read operations are still available. | ||
|
||
## Case 2: Data Node Failure | ||
|
||
### Steps to simulate a data node failure | ||
|
||
1. Scale the Data pod to 3 replicas. They are `banyandb-0`, `banyandb-1`, and `banyandb-2`. | ||
2. Scale the Data pod to 2 replica. `banyandb-2` pod will be terminated. | ||
3. Check the status of the Data pods, OAP console, and Liaison console. | ||
4. Check write and query operations. | ||
|
||
### Result of the data node failure | ||
|
||
- The `banyandb-1` pod is in the `Terminating` state. | ||
- The cluster is still available. | ||
- OAP might face "fail to execute the query plan for measure events_minute: broadcast errors: failed to publish message to 10. │ | ||
│ 244.0.76:17912: failed to get stream for node 10.244.0.76:17912: rpc error: code = Canceled desc = grpc: the client connection is closing: invalid query message" error. | ||
- The trace and metrics(5 services) write and read operations are still available. | ||
- Partial data loss might occur as the `banyandb-2` is down. | ||
|
||
```yaml | ||
2024-08-15 0609: | ||
value: 0 | ||
isemptyvalue: true | ||
2024-08-15 0610: | ||
value: 0 | ||
isemptyvalue: true | ||
2024-08-15 0611: | ||
value: 0 | ||
isemptyvalue: true | ||
2024-08-15 0612: | ||
value: 0 | ||
isemptyvalue: true | ||
2024-08-15 0613: | ||
value: 549 | ||
isemptyvalue: false | ||
2024-08-15 0614: | ||
value: 541 | ||
isemptyvalue: false | ||
2024-08-15 0615: | ||
value: 566 | ||
isemptyvalue: false | ||
2024-08-15 0616: | ||
value: 546 | ||
isemptyvalue: false | ||
``` | ||
## Case 3: etcd Node Failure | ||
### Steps to simulate an etcd node failure | ||
1. Scale the etcd pod to 0 replicas. | ||
2. Check the status of the OAP, Data and Liaison console. | ||
3. Check write and query operations. | ||
## Result of the etcd node failure | ||
1. Liaison and Data pods are available, but will raise an error. | ||
```json | ||
{"level":"warn","ts":1723709128.2490797,"caller":"[email protected]/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc00049e1e0/failover-test-etcd-0.failover-test-etcd-headless.default:2379","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = latest balancer error: last connection error: connection error: desc = \"transport: Error while dialing: dial tcp 10.96.126.15:2379: connect: connection refused\""} | ||
{"level":"error","module":"ETCD","error":"context deadline exceeded","time":"2024-08-15T08:05:28Z","message":"failed to revoke lease 8287064579165108153"} | ||
{"level":"warn","ts":1723709216.6529357,"caller":"[email protected]/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc00049e1e0/failover-test-etcd-0.failover-test-etcd-headless.default:2379","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = latest balancer error: last connection error: connection error: desc = \"transport: Error while dialing: dial tcp: lookup failover-test-etcd-0.failover-test-etcd-headless.default.svc.cluster.local on 10.96.0.10:53: no such host\""} | ||
{"level":"info","ts":1723709216.653035,"caller":"[email protected]/client.go:210","msg":"Auto sync endpoints failed.","error":"context deadline exceeded"} | ||
``` | ||
|
||
2. The trace and metrics(5 services) write and read operations are still available. | ||
3. `swctl menu get` | ||
|
||
## Case 4: etcd Node recovery | ||
|
||
### Steps to recover the etcd node. | ||
|
||
1. Scale the etcd pod to 1 replica. | ||
2. Check the status of the OAP, Data and Liaison console. | ||
3. Check write and query operations. | ||
|
||
## Result of the etcd node recovery with the correct data | ||
|
||
1. Liaison and Data pods are available, and their consoles will show: | ||
|
||
```json | ||
{"level":"warn","ts":1723710245.1049383,"caller":"[email protected]/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc00049e1e0/failover-test-etcd-0.failover-test-etcd-headless.default:2379","attempt":0,"error":"rpc error: code = Unauthenticated desc = etcdserver: invalid auth token"} | ||
``` | ||
|
||
The message means that the client's token is invalid. The client should re-authenticate with the correct token and reconnect. | ||
|
||
2. The trace and metrics(5 services) write and read operations are still available. | ||
3. `swctl menu get` will return data as expected. | ||
4. Add a new Data node, the liaison will automatically add the new Data node to the route table. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,61 @@ | ||
#!/bin/bash | ||
|
||
|
||
# Licensed to the Apache Software Foundation (ASF) under one or more | ||
# contributor license agreements. See the NOTICE file distributed with | ||
# this work for additional information regarding copyright ownership. | ||
# The ASF licenses this file to You under the Apache License, Version 2.0 | ||
# (the "License"); you may not use this file except in compliance with | ||
# the License. You may obtain a copy of the License at | ||
# | ||
# http://www.apache.org/licenses/LICENSE-2.0 | ||
# | ||
# Unless required by applicable law or agreed to in writing, software | ||
# distributed under the License is distributed on an "AS IS" BASIS, | ||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||
# See the License for the specific language governing permissions and | ||
# limitations under the License. | ||
# | ||
|
||
# List of service IDs | ||
# service_0 to service_4 | ||
service_ids=("ImRlZmF1bHQiOjpzZXJ2aWNlXzQ=.1" "ImRlZmF1bHQiOjpzZXJ2aWNlXzM=.1" "ImRlZmF1bHQiOjpzZXJ2aWNlXzI=.1" "ImRlZmF1bHQiOjpzZXJ2aWNlXzE=.1" "ImRlZmF1bHQiOjpzZXJ2aWNlXzA=.1") | ||
|
||
error_num=0 | ||
# Iterate over each service ID | ||
for service_id in "${service_ids[@]}"; do | ||
echo "Checking service ID: $service_id" | ||
|
||
# Run the swctl command | ||
swctl_result=$(swctl --display json metrics linear --name=service_resp_time --service-id="$service_id") | ||
|
||
# Check if swctl result is not an empty list | ||
if [ "$swctl_result" == "[]" ]; then | ||
echo "The swctl result is an empty list for service ID: $service_id. Skipping jq processing." | ||
error_num=$((error_num + 1)) | ||
else | ||
# Filter with jq | ||
result=$(echo "$swctl_result" | jq 'map(select(.IsEmptyValue == true))') | ||
|
||
# Check if the result is an empty list | ||
if [ "$result" != "[]" ]; then | ||
echo "The result is not an empty list. Some items have IsEmptyValue set to true for service ID: $service_id." | ||
error_num=$((error_num + 1)) | ||
fi | ||
fi | ||
|
||
trace_result=$(swctl t ls --service-id="$service_id" | jq -e '.traces | length > 0') | ||
|
||
if [ "$trace_result" == "false" ]; then | ||
echo "No traces found for service ID: $service_id." | ||
error_num=$((error_num + 1)) | ||
fi | ||
|
||
echo | ||
done | ||
|
||
if [ "$error_num" -gt 0 ]; then | ||
echo "Some service IDs failed the check." | ||
exit 1 | ||
fi | ||
echo "All service IDs passed the check." |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,23 @@ | ||
# Licensed to the Apache Software Foundation (ASF) under one or more | ||
# contributor license agreements. See the NOTICE file distributed with | ||
# this work for additional information regarding copyright ownership. | ||
# The ASF licenses this file to You under the Apache License, Version 2.0 | ||
# (the "License"); you may not use this file except in compliance with | ||
# the License. You may obtain a copy of the License at | ||
# | ||
# http://www.apache.org/licenses/LICENSE-2.0 | ||
# | ||
# Unless required by applicable law or agreed to in writing, software | ||
# distributed under the License is distributed on an "AS IS" BASIS, | ||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||
# See the License for the specific language governing permissions and | ||
# limitations under the License. | ||
|
||
kind: Cluster | ||
apiVersion: kind.x-k8s.io/v1alpha4 | ||
nodes: | ||
- role: control-plane | ||
extraPortMappings: | ||
- containerPort: 12800 | ||
hostPort: 12800 | ||
protocol: TCP |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,77 @@ | ||
# Licensed to the Apache Software Foundation (ASF) under one or more | ||
# contributor license agreements. See the NOTICE file distributed with | ||
# this work for additional information regarding copyright ownership. | ||
# The ASF licenses this file to You under the Apache License, Version 2.0 | ||
# (the "License"); you may not use this file except in compliance with | ||
# the License. You may obtain a copy of the License at | ||
# | ||
# http://www.apache.org/licenses/LICENSE-2.0 | ||
# | ||
# Unless required by applicable law or agreed to in writing, software | ||
# distributed under the License is distributed on an "AS IS" BASIS, | ||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||
# See the License for the specific language governing permissions and | ||
# limitations under the License. | ||
|
||
apiVersion: v1 | ||
kind: Pod | ||
metadata: | ||
labels: | ||
component: oap | ||
name: data-generator | ||
namespace: default | ||
spec: | ||
containers: | ||
- env: | ||
- name: JAVA_OPTS | ||
value: -Xmx2g -Xms2g | ||
- name: SW_STORAGE | ||
value: banyandb | ||
- name: SW_STORAGE_BANYANDB_TARGETS | ||
value: banyandb-grpc:17912 | ||
image: ghcr.io/apache/skywalking/data-generator:9b17ff1efeab7a20c870839f59eb0e6af485cd3f | ||
imagePullPolicy: IfNotPresent | ||
livenessProbe: | ||
failureThreshold: 3 | ||
initialDelaySeconds: 5 | ||
periodSeconds: 10 | ||
successThreshold: 1 | ||
tcpSocket: | ||
port: 12800 | ||
timeoutSeconds: 1 | ||
name: oap | ||
ports: | ||
- containerPort: 11800 | ||
name: grpc | ||
protocol: TCP | ||
- containerPort: 12800 | ||
name: rest | ||
protocol: TCP | ||
hostPort: 12800 | ||
readinessProbe: | ||
failureThreshold: 3 | ||
initialDelaySeconds: 5 | ||
periodSeconds: 10 | ||
successThreshold: 1 | ||
tcpSocket: | ||
port: 12800 | ||
timeoutSeconds: 1 | ||
resources: {} | ||
startupProbe: | ||
failureThreshold: 9 | ||
periodSeconds: 10 | ||
successThreshold: 1 | ||
tcpSocket: | ||
port: 12800 | ||
timeoutSeconds: 1 | ||
dnsPolicy: ClusterFirst | ||
enableServiceLinks: true | ||
initContainers: | ||
- command: | ||
- sh | ||
- -c | ||
- for i in $(seq 1 60); do curl banyandb-http:17913/api/healthz && exit 0 || sleep 5; done; exit 1 | ||
image: curlimages/curl | ||
imagePullPolicy: IfNotPresent | ||
name: wait-for-banyandb | ||
resources: {} |
Oops, something went wrong.