Skip to content

Commit

Permalink
feat: restart pipeline with recovery (#1853)
Browse files Browse the repository at this point in the history
* wip

* implement restart method

* implement restart swapping nodes

* fix method calls

* fix some tests

* fix another test

* fix test

* update comment and refactor test

* refactor tests (wip)

* restart test WIP

* remove redundant code

* update test

* fix test

* add new tests and refactor

* fix logger

* cosmetic changes

* move to a separate method

* add tracing and fix max retries

* not needed

* Set -1 as infinite retries for err recovery

* rename flag

* goes to degraded once it exits

* fix const

* simpler implementation

* add test case for 0 max-retries

* fix test

* no need to kill the tomb

* typo and fix

* fix exit-on-degraded when degraded

* log before erroring

* removes restart and uses start

* clearer validation

* handle recovery

* Pipeline recovery: test cases (#1873)


---------

Co-authored-by: Raúl Barroso <[email protected]>

* Pipeline recovery tests: add test pipeline, add test case (#1876)

* need to return err

* assign error

* updated test cases

* reuse method

* use log.AttemptField

---------

Co-authored-by: Haris Osmanagic <[email protected]>
  • Loading branch information
raulb and hariso authored Oct 4, 2024
1 parent 8608f0f commit d7417fd
Show file tree
Hide file tree
Showing 13 changed files with 900 additions and 216 deletions.
330 changes: 330 additions & 0 deletions docs/test-cases/pipeline-recovery.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,330 @@
<!-- markdownlint-disable MD013 -->
# Test case for the pipeline recovery feature

## Test Case 01: Recovery triggered for on a DLQ write error

**Priority** (low/medium/high):

**Description**:
Recovery is triggered when there is an error writing to a DLQ. As for a normal
destination, a DLQ write error can be a temporary error that can be solved after
a retry.

**Automated** (yes/no)

**Setup**:

**Pipeline configuration file**:

```yaml
version: "2.2"
pipelines:
- id: file-pipeline
status: running
name: file-pipeline
description: dlq write error
connectors:
- id: chaos-src
type: source
plugin: standalone:chaos
name: chaos-src
settings:
readMode: error
- id: log-dst
type: destination
plugin: builtin:log
log: file-dst
dead-letter-queue:
plugin: "builtin:postgres"
settings:
table: non_existing_table_so_that_dlq_fails
url: postgresql://meroxauser:meroxapass@localhost/meroxadb?sslmode=disable
window-size: 3
window-nack-threshold: 2
```
**Steps**:
**Expected Result**:
**Additional comments**:
---
## Test Case 02: Recovery not triggered for fatal error - processor
**Priority** (low/medium/high):
**Description**:
Recovery is not triggered when there is an error processing a record.
**Automated** (yes/no)
**Setup**:
**Pipeline configuration file**:
```yaml
```

**Steps**:

**Expected Result**:

**Additional comments**:

---

## Test Case 03: Recovery not triggered - graceful shutdown

**Priority** (low/medium/high):

**Description**:
Recovery is not triggered when Conduit is shutting down gracefully (i.e. when
typing Ctrl+C in the terminal where Conduit is running, or sending a SIGINT).

**Automated** (yes/no)

**Setup**:

**Pipeline configuration file**:

```yaml
```

**Steps**:

**Expected Result**:

**Additional comments**:

---

## Test Case 04: Recovery not triggered - user stopped pipeline

**Priority** (low/medium/high):

**Description**:
Recovery is not triggered if a user stops a pipeline (via the HTTP API's
`/v1/pipelines/pipeline-id/stop` endpoint).

**Automated** (yes/no)

**Setup**:

**Pipeline configuration file**:

```yaml
```

**Steps**:

**Expected Result**:

**Additional comments**:

---

## Test Case 05: Recovery is configured by default

**Priority** (low/medium/high):

**Description**:
Pipeline recovery is configured by default. A failing pipeline will be restarted
a number of times without any additional configuration.

**Automated** (yes/no)

**Setup**:

**Pipeline configuration file**:

```yaml
version: "2.2"
pipelines:
- id: chaos-to-log
status: running
name: chaos-to-log
description: chaos source, error on read
connectors:
- id: chaos-source-1
type: source
plugin: standalone:chaos
name: chaos-source-1
settings:
readMode: error
- id: destination1
type: destination
plugin: builtin:log
name: log-destination
```
**Steps**:
**Expected Result**:
**Additional comments**:
---
## Test Case 06: Recovery not triggered on malformed pipeline
**Priority** (low/medium/high):
**Description**:
Recovery is not triggered for a malformed pipeline, e.g. when a connector is
missing.
**Automated** (yes/no)
**Setup**:
**Pipeline configuration file**:
```yaml
version: "2.2"
pipelines:
- id: nothing-to-log
status: running
name: nothing-to-log
description: no source
connectors:
- id: destination1
type: destination
plugin: builtin:log
name: log-destination
```
**Steps**:
**Expected Result**:
**Additional comments**:
---
## Test Case 07: Conduit exits with --pipelines.exit-on-degraded=true and a pipeline failing after recovery
**Priority** (low/medium/high):
**Description**: Given a Conduit instance with
`--pipelines.exit-on-degraded=true`, and a pipeline that's failing after the
maximum number of retries configured, Conduit should shut down gracefully.

**Automated** (yes/no)

**Setup**:

**Pipeline configuration file**:

```yaml
```

**Steps**:

**Expected Result**:

**Additional comments**:

---

## Test Case 08: Conduit doesn't exit with --pipelines.exit-on-degraded=true and a pipeline that recovers after a few retries

**Priority** (low/medium/high):

**Description**:
Given a Conduit instance with `--pipelines.exit-on-degraded=true`, and a
pipeline that recovers after a few retries, Conduit should still be running.

**Automated** (yes/no)

**Setup**:

**Pipeline configuration file**:

```yaml
```

**Steps**:

**Expected Result**:

**Additional comments**:

---

## Test Case 09: Conduit exits with --pipelines.exit-on-degraded=true, --pipelines.error-recovery.max-retries=0, and a degraded pipeline

**Priority** (low/medium/high):

**Description**:
Given a Conduit instance with
`--pipelines.exit-on-degraded=true --pipelines.error-recovery.max-retries=0`,
and a pipeline that goes into a degraded state, the Conduit instance will
gracefully shut down. This is due `max-retries=0` disabling the recovery.

**Automated** (yes/no)

**Setup**:

**Pipeline configuration file**:

```yaml
```

**Steps**:

**Expected Result**:

**Additional comments**:

---

## Test Case 10: Recovery not triggered for fatal error - DLQ threshold exceeded

**Priority** (low/medium/high):

**Description**:
Recovery is not triggered when the DLQ threshold is exceeded.

**Automated** (yes/no)

**Setup**:

**Pipeline configuration file**:

```yaml
version: "2.2"
pipelines:
- id: pipeline1
status: running
name: pipeline1
description: chaos destination with write errors, DLQ threshold specified
connectors:
- id: generator-src
type: source
plugin: builtin:generator
name: generator-src
settings:
format.type: structured
format.options.id: int
format.options.name: string
rate: "1"
- id: chaos-destination-1
type: destination
plugin: standalone:chaos
name: chaos-destination-1
settings:
writeMode: error
dead-letter-queue:
window-size: 2
window-nack-threshold: 1
```

**Steps**:

**Expected Result**:

**Additional comments**:

---
27 changes: 27 additions & 0 deletions docs/test-cases/test-case-template.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,27 @@
<!-- markdownlint-disable MD041 -->

## Test Case 01: Test case title

**Priority** (low/medium/high):

**Description**:

**Automated** (yes/no)

**Setup**:

**Pipeline configuration file**:

```yaml
```

**Steps**:

1. Step 1
2. Step 2

**Expected Result**:

**Additional comments**:

---
Loading

0 comments on commit d7417fd

Please sign in to comment.