Skip to content

Commit

Permalink
merge main
Browse files Browse the repository at this point in the history
Signed-off-by: Neelanjan Manna <[email protected]>
  • Loading branch information
neelanjan00 committed Dec 7, 2022
2 parents b6f3062 + 79b232e commit 4171b3a
Show file tree
Hide file tree
Showing 4,140 changed files with 89,980 additions and 523 deletions.
The diff you're trying to view is too large. We only load the first 3000 changed files.
2 changes: 1 addition & 1 deletion docs/chaos-engineering/chaos-faults/aws/ec2-cpu-hog.md
Original file line number Diff line number Diff line change
Expand Up @@ -63,7 +63,7 @@ stringData:
## Fault Tunables

<details>
<summary>Check the fault tunables</summary>
<summary>Check the Fault Tunables</summary>
<h2>Mandatory Fields</h2>
<table>
<tr>
Expand Down
2 changes: 1 addition & 1 deletion docs/chaos-engineering/chaos-faults/aws/ec2-io-stress.md
Original file line number Diff line number Diff line change
Expand Up @@ -65,7 +65,7 @@ stringData:
## Fault Tunables

<details>
<summary>Check the fault tunables</summary>
<summary>Check the Fault Tunables</summary>

<h2>Mandatory Fields</h2>

Expand Down
2 changes: 1 addition & 1 deletion docs/chaos-engineering/chaos-faults/aws/ec2-memory-hog.md
Original file line number Diff line number Diff line change
Expand Up @@ -65,7 +65,7 @@ stringData:
## Fault Tunables

<details>
<summary>Check the fault tunables</summary>
<summary>Check the Fault Tunables</summary>
<h2>Mandatory Fields</h2>
<table>
<tr>
Expand Down
2 changes: 1 addition & 1 deletion docs/chaos-engineering/chaos-faults/aws/ecs-agent-stop.md
Original file line number Diff line number Diff line change
Expand Up @@ -16,7 +16,7 @@ title: ECS Agent Stop
## Uses

<details>
<summary>View the uses of the experiment</summary>
<summary>View the uses of the fault</summary>
<div>
Agent chaos stop is another very common and frequent scenario we find with ECS clusters that can break an agent that manages the task container on the ECS cluster and impacts their delivery. Such scenarios can still occur despite whatever availability aids docker provides.

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -20,7 +20,7 @@ title: ECS Container CPU Hog
## Uses

<details>
<summary>View the uses of the experiment</summary>
<summary>View the uses of the fault</summary>
<div>
CPU hogs are another very common and frequent scenario we find with containers/applications that can result in the eviction of the application (task container) and impact its delivery. Such scenarios can still occur despite whatever availability aids docker provides. These problems are generally referred to as "Noisy Neighbour" problems.

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -20,7 +20,7 @@ title: ECS Container IO Hog
## Uses

<details>
<summary>View the uses of the experiment</summary>
<summary>View the uses of the fault</summary>
<div>
Filesystem read and write is another very common and frequent scenario we find with conrainers/applications that can result in the eviction of the application (task container) and impact its delivery. Such scenarios that can still occur despite whatever availability aids docker provides. These problems are generally referred to as "Noisy Neighbour" problems.

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -20,7 +20,7 @@ title: ECS Container Memory Hog
## Uses

<details>
<summary>View the uses of the experiment</summary>
<summary>View the uses of the fault</summary>
<div>
Memory usage within containers is subject to various constraints. If the limits are specified in their spec, exceeding them can cause termination of the container (due to OOMKill of the primary process, often pid 1) - the restart of the container by docker, subject to the policy specified. For containers with no limits placed, the memory usage is uninhibited until such time as the VM level OOM Behaviour takes over. In this case, containers on the Instance can be killed based on their oom_score. This eval is extended to all task containers running on the instance - thereby causing a bigger blast radius.

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -20,7 +20,7 @@ title: ECS Container Network Latency
## Uses

<details>
<summary>View the uses of the experiment</summary>
<summary>View the uses of the fault</summary>
<div>
The fault causes network degradation of the task container without the container being marked unhealthy/unworthy of traffic from outside. The idea of this fault is to simulate issues within your ECS task network OR communication across services in different availability zones/regions etc.

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -20,7 +20,7 @@ title: ECS Container Network Loss
## Uses

<details>
<summary>View the uses of the experiment</summary>
<summary>View the uses of the fault</summary>
<div>
The fault causes network degradation of the task container without the container being marked unhealthy/unworthy of traffic from outside. The idea of this fault is to simulate issues within your ECS task network OR communication across services in different availability zones/regions etc.

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -16,7 +16,7 @@ title: ECS Instance Stop
## Uses

<details>
<summary>View the uses of the experiment</summary>
<summary>View the uses of the fault</summary>
<div>
EC2 instance chaos stop is another very common and frequent scenario we find with ECS clusters that can result in breaking of agent that manages task container on ECS cluster and impact its delivery. Such scenarios that can still occur despite whatever availability aids docker provides.

Expand Down
148 changes: 148 additions & 0 deletions docs/chaos-engineering/chaos-faults/aws/elb-az-down.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,148 @@
---
id: elb-az-down
title: ELB AZ Down
---

## Introduction
- It takes AZ down chaos on a target ELB for a specified duration. It causes access restrictions for certain availability zones.
- It tests application sanity, availability, and recovery workflows of the application pod attached to the load balancer.

:::tip Fault execution flow chart
![ELB AZ Down](./static/images/elb-az-down.png)
:::

## Uses

<details>
<summary>View the uses of the fault</summary>
<div>
AZ down is another very common and frequent scenario we find with ELB that can break the connectivity with the given zones and impacts their delivery. Such scenarios can still occur despite whatever availability aids AWS provides.

Detaching the AZ from the load balancer will disrupt an application's performance and impact its smooth working. So this category of chaos fault helps build immunity in the application undergoing such scenarios.

</div>
</details>

## Prerequisites

:::info
- Kubernetes > 1.17
- AWS access to attach or detach an AZ from ELB.
- Minimum number of AZ is attached to the ELB, else the fault fails to detach the given AZ.
- Kubernetes secret that has the AWS access configuration(key) in the `CHAOS_NAMESPACE`. A sample secret file looks like:
```yaml
apiVersion: v1
kind: Secret
metadata:
name: cloud-secret
type: Opaque
stringData:
cloud_config.yml: |-
# Add the cloud AWS credentials respectively
[default]
aws_access_key_id = XXXXXXXXXXXXXXXXXXX
aws_secret_access_key = XXXXXXXXXXXXXXX
```
- If you change the secret key name (from `cloud_config.yml`), update the `AWS_SHARED_CREDENTIALS_FILE` environment variable value on `fault.yaml`with the same name.
:::

## Default Validations

:::info
- The ELB is attached to the given availability zones.
:::

## Fault tunables

<details>
<summary>Check the Fault Tunables</summary>
<h2>Mandatory Fields</h2>
<table>
<tr>
<th> Variables </th>
<th> Description </th>
<th> Notes </th>
</tr>
<tr>
<td> LOAD_BALANCER_NAME </td>
<td> Provide the name of load balancer whose AZ has to be detached</td>
<td> Eg. <code>elb-name</code> </td>
</tr>
<tr>
<td> ZONES </td>
<td> Provide the target zones that have to be detached from ELB</td>
<td> Eg. <code>us-east-1a</code> </td>
</tr>
<tr>
<td> REGION </td>
<td> The region name for the target volumes</td>
<td> Eg. <code>us-east-1</code> </td>
</tr>
</table>
<h2>Optional Fields</h2>
<table>
<tr>
<th> Variables </th>
<th> Description </th>
<th> Notes </th>
</tr>
<tr>
<td> TOTAL_CHAOS_DURATION </td>
<td> The time duration for chaos insertion (in seconds) </td>
<td> Defaults to 30s </td>
</tr>
<tr>
<td> CHAOS_INTERVAL </td>
<td> The time duration between the attachment and detachment of the volumes (sec) </td>
<td> Defaults to 30s </td>
</tr>
<tr>
<td> SEQUENCE </td>
<td> It defines sequence of chaos execution for multiple volumes</td>
<td> Default value: parallel. Supported: serial, parallel </td>
</tr>
<tr>
<td> RAMP_TIME </td>
<td> Period to wait before and after injection of chaos in sec </td>
<td> Eg: 30 </td>
</tr>
</table>
</details>

## Fault Examples

### Common and AWS specific tunables

Refer to the [common attributes](../common-tunables-for-all-experiments) and [AWS specific tunable](./aws-experiments-tunables) to tune the common tunables for all faults and aws specific tunables.

### Target Zones

It contains comma separated list of target zones. It can be tuned via `ZONES` environment variable.

Use the following example to tune it:

[embedmd]:# (./static/manifests/elb-az-down/target-zones.yaml yaml)
```yaml
# contains elb az down for given zones
apiVersion: litmuschaos.io/v1alpha1
kind: ChaosEngine
metadata:
name: engine-nginx
spec:
engineState: "active"
chaosServiceAccount: litmus-admin
experiments:
- name: elb-az-down
spec:
components:
env:
# load balancer name for chaos
- name: LOAD_BALANCER_NAME
value: 'tes-elb'
# target zones for the chaos
- name: ZONES
value: 'us-east-1a,us-east-1b'
# region for chaos
- name: REGION
value: 'us-east-1'
```
Original file line number Diff line number Diff line change
@@ -0,0 +1,143 @@
---
id: lambda-delete-event-source-mapping
title: Lambda Delete Event Source Mapping
---

## Introduction

- It removes the event source mapping from an AWS Lambda function for a certain chaos duration.
- It checks the performance of the running application/service without the event source mapping which can cause, for example, missing entries on a database.

:::tip Fault execution flow chart
![Lambda Delete Event Source Mapping](./static/images/lambda-delete-event-source-mapping.png)
:::

## Uses

<details>
<summary>View the uses of the fault</summary>
<div>
Deleting an event source mapping from a lambda function is critical. It can lead to scenarios such as failure to update the database on an event trigger which can break the service and impact their delivery. Such scenarios can occur despite availability aids provided by AWS or determined by you.

It helps understand if you have proper error handling or auto recovery configured for such cases. Hence, this category of chaos fault helps build the immunity of the application.
</div>
</details>

## Prerequisites

:::info

- Kubernetes >= 1.17
- AWS Lambda event source mapping attached to the lambda function.
- Kubernetes secret that has AWS access configuration(key) in the `CHAOS_NAMESPACE`. A secret file looks like this:

```yaml
apiVersion: v1
kind: Secret
metadata:
name: cloud-secret
type: Opaque
stringData:
cloud_config.yml: |-
# Add the cloud AWS credentials respectively
[default]
aws_access_key_id = XXXXXXXXXXXXXXXXXXX
aws_secret_access_key = XXXXXXXXXXXXXXX
```
- If you change the secret key name (from `cloud_config.yml`), update the `AWS_SHARED_CREDENTIALS_FILE` environment variable value on `experiment.yaml` with the same name.

## Default Validations

:::info

- The AWS Lambda event source mapping is healthy and attached to the lambda function.

:::

## Fault Tunables

<details>
<summary>Check the Fault Tunables</summary>
<h2>Mandatory Fields</h2>
<table>
<tr>
<th> Variables </th>
<th> Description </th>
<th> Notes </th>
</tr>
<tr>
<td> FUNCTION_NAME </td>
<td> Function name of the target lambda function. It supports single function name.</td>
<td> Eg: <code>test-function</code> </td>
</tr>
<tr>
<td> EVENT_UUIDS </td>
<td> Provide the UUID for the target event source mapping.</td>
<td> You can provide multiple values as (,) comma separated values. Eg: <code>id1,id2</code> </td>
</tr>
<tr>
<td> REGION </td>
<td> The region name of the target lambda function</td>
<td> Eg: <code>us-east-2</code></td>
</tr>
</table>
<h2>Optional Fields</h2>
<table>
<tr>
<th> Variables </th>
<th> Description </th>
<th> Notes </th>
</tr>
<tr>
<td> TOTAL_CHAOS_DURATION </td>
<td> The total time duration for chaos insertion in seconds </td>
<td> Defaults to 30s </td>
</tr>
<tr>
<td> SEQUENCE </td>
<td> It defines sequence of chaos execution for multiple instance</td>
<td> Default value: parallel. Supported: serial, parallel </td>
</tr>
<tr>
<td> RAMP_TIME </td>
<td> Period to wait before and after injection of chaos in sec </td>
<td> Eg. 30 </td>
</tr>
</table>
</details>

## Fault Examples

### Common and AWS specific tunables

Refer to the [common attributes](../common-tunables-for-all-experiments) and [AWS specific tunable](./aws-experiments-tunables) to tune the common tunables for all faults and aws specific tunables.

### Multiple Event Source Mapping

It can delete multiple event source mappings for a certain chaos duration using `EVENT_UUIDS` environment variable that takes the UUID of the events as a comma separated value (CSV file).

Use the following example to tune it:

[embedmd]:# (./static/manifests/lambda-delete-event-source-mapping/multiple-events.yaml yaml)
```yaml
# contains the removal of multiple event source mapping
apiVersion: litmuschaos.io/v1alpha1
kind: ChaosEngine
metadata:
name: engine-nginx
spec:
engineState: "active"
chaosServiceAccount: litmus-admin
experiments:
- name: lambda-delete-event-source-mapping
spec:
components:
env:
# provide UUIDS of event source mapping
- name: EVENT_UUIDS
value: 'id1,id2'
# provide the function name for the chaos
- name: FUNCTION_NAME
value: 'chaos-function'
```
Loading

0 comments on commit 4171b3a

Please sign in to comment.