Skip to content

Commit

Permalink
📝 docs(res4j): details circuit breaker mechanism (kresil/kresil#19)
Browse files Browse the repository at this point in the history
  • Loading branch information
franciscoengenheiro committed May 10, 2024
1 parent 99d34f3 commit 6f25ccf
Show file tree
Hide file tree
Showing 2 changed files with 175 additions and 22 deletions.
Binary file added docs/imgs/circuit-breaker-state-machine.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
197 changes: 175 additions & 22 deletions resilience4j/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -20,12 +20,17 @@
- [Kotlin Interop](#kotlin-interop)
- [Configuration](#configuration-1)
- [Decorators](#decorators-1)
- [Flow](#flow)
- [Kotlin Multiplatform Design](#kotlin-multiplatform-design)
2. [Circuit Breaker](#circuit-breaker)
- [Configuration](#configuration-2)
3. [Kotlin Multiplatform Design](#kotlin-multiplatform-design)
4. [Flow](#flow)

## Retry

TODO: Add server and downstream retry drawing
Involves retrying an operation or request that has failed due to transient faults or network issues.
It's designed
to enhance the robustness and reliability of the system
by automatically attempting the operation again after a certain period and/or under certain conditions.

[RetryImpl](https://github.com/resilience4j/resilience4j/blob/9ed5b86fa6add55ee32a733e8ed43058a3c9ec63/resilience4j-retry/src/main/java/io/github/resilience4j/retry/internal/RetryImpl.java)

Expand Down Expand Up @@ -378,25 +383,155 @@ retry.executeFunction {
// or retry.decorateFunction { ... } and call it later
```

#### Flow
## Circuit Breaker

The library also provides several extensions for the asynchronous
primitive [Flow](https://kotlinlang.org/docs/flow.html)
to work with all provided mechanisms.
Such extensions are not terminal operators and can be chained with others.
In electronics, traditional circuit breaker is an automatically operated electrical switch designed to protect an electrical circuit from damage caused by excess current from an overload or short circuit.
Its basic function is to interrupt current flow after a fault is detected.
Similary, in resilience engineering,
a circuit breaker is a design pattern
that prevents an application from repeatedly trying to execute an operation that's likely to fail.
Allowing it to continue (fail-fast) without waiting for the fault to be fixed or wasting
CPU cycles while it determines that the fault is long-lasting.
But unlike the electrical circuit breaker, which needs to be manually reset after a fault is fixed, the resilience circuit breaker can also detect whether the fault has been
resolved. If the problem appears to have been fixed, the application is allowed to try to invoke the operation.
```kotlin
val retry = Retry.ofDefaults()
val rateLimiter = RateLimiter.ofDefaults()
### State Machine
flowOf(1, 2, 3)
.rateLimiter(rateLimiter)
.map { it * 2 }
.retry(retry)
.collect { println(it) } // terminal operator
```
The circuit breaker, which acts like a proxy for the underlying operation, can be implemented as a state machine with the following states:
- `Closed`: The request from the application is routed to the operation. The proxy maintains a count of the number of recent failures, and if the call to the operation is unsuccessful, the proxy increments this count. If the number of recent failures exceeds a specified threshold within a given time period (assuming a time-based sliding window), the proxy is placed into the `Open` state. At this point the proxy starts a timeout timer, and when this timer expires the proxy is placed into the `Half-Open` state. The purpose of the timeout timer is to give the system time to fix the problem that caused the failure before allowing the application to try to perform the operation again.
- `Open`: The request from the application fails immediately, and an exception is returned to the application.
- `Half-Open`: A limited number of requests from the application are allowed to pass through and invoke the operation. If these requests are successful, it's assumed that the fault that was previously causing the failure has been fixed and the circuit breaker switches to the `Closed` state (the failure counter is reset). If any request fails, the circuit breaker assumes that the fault is still present so it reverts to the `Open` state and restarts the timeout timer to give the system a further period of time to recover from the failure.

> [!IMPORTANT]
> The `Half-Open` state is useful to prevent a recovering service from suddenly being flooded with requests. As a service recovers, it might be able to support a limited volume of requests until the recovery is complete, but while recovery is in progress, a flood of work can cause the service to time out or fail again.

### Kotlin Multiplatform Design
| <img src="../docs/imgs/circuit-breaker-state-machine.png" alt="Circuit Breaker State Machine" width="80%"> |
|:----------------------------------------------------------------------------------------------------------:|
| Circuit Breaker State Machine |

From: [Microsoft Azure Docs](https://docs.microsoft.com/en-us/azure/architecture/patterns/circuit-breaker)

### Configuration

<table>
<tr>
<th>Config property</th>
<th>Default Value</th>
<th>Description</th>
</tr>
<tr>
<td>failureRateThreshold</td>
<td>50</td>
<td>Configures the failure rate threshold in percentage. When the failure rate is equal or greater than the threshold, the CircuitBreaker transitions to open and starts short-circuiting calls.</td>
</tr>
<tr>
<td>slowCallRateThreshold</td>
<td>100</td>
<td>Configures a threshold in percentage. The CircuitBreaker considers a call as slow when the call duration is greater than slowCallDurationThreshold. When the percentage of slow calls is equal or greater than the threshold, the CircuitBreaker transitions to open and starts short-circuiting calls.</td>
</tr>
<tr>
<td>slowCallDurationThreshold</td>
<td>60000 [ms]</td>
<td>Configures the duration threshold above which calls are considered as slow and increase the rate of slow calls.</td>
</tr>
<tr>
<td>permittedNumberOfCallsInHalfOpenState</td>
<td>10</td>
<td>Configures the number of permitted calls when the CircuitBreaker is half open.</td>
</tr>
<tr>
<td>maxWaitDurationInHalfOpenState</td>
<td>0 [ms]</td>
<td>Configures a maximum wait duration which controls the longest amount of time a CircuitBreaker could stay in Half Open state, before it switches to open. A value of 0 means Circuit Breaker would wait infinitely in HalfOpen State until all permitted calls have been completed.</td>
</tr>
<tr>
<td>slidingWindowType</td>
<td>COUNT_BASED</td>
<td>Configures the type of the sliding window which is used to record the outcome of calls when the CircuitBreaker is closed. The sliding window can either be count-based or time-based.</td>
</tr>
<tr>
<td>slidingWindowSize</td>
<td>100</td>
<td>Configures the size of the sliding window which is used to record the outcome of calls when the CircuitBreaker is closed.</td>
</tr>
<tr>
<td>minimumNumberOfCalls</td>
<td>100</td>
<td>Configures the minimum number of calls which are required (per sliding window period) before the CircuitBreaker can calculate the error rate or slow call rate.</td>
</tr>
<tr>
<td>waitDurationInOpenState</td>
<td>60000 [ms]</td>
<td>The time that the CircuitBreaker should wait before transitioning from open to half-open.</td>
</tr>
<tr>
<td>automaticTransitionFromOpenToHalfOpenEnabled</td>
<td>false</td>
<td>If set to true, it means that the CircuitBreaker will automatically transition from open to half-open state and no call is needed to trigger the transition. If set to false, the transition to half-open only happens if a call is made, even after waitDurationInOpenState is passed.</td>
</tr>
<tr>
<td>recordExceptions</td>
<td>empty</td>
<td>A list of exceptions that are recorded as a failure and thus increase the failure rate. Any exception matching or inheriting from one of the list counts as a failure, unless explicitly ignored via ignoreExceptions.</td>
</tr>
<tr>
<td>ignoreExceptions</td>
<td>empty</td>
<td>A list of exceptions that are ignored and neither count as a failure nor success. Any exception matching or inheriting from one of the list will not count as a failure nor success, even if the exception is part of recordExceptions.</td>
</tr>
<tr>
<td>recordFailurePredicate</td>
<td>throwable -&gt; true</td>
<td>A custom Predicate which evaluates if an exception should be recorded as a failure. The Predicate must return true if the exception should count as a failure.</td>
</tr>
<tr>
<td>ignoreExceptionPredicate</td>
<td>throwable -&gt; false</td>
<td>A custom Predicate which evaluates if an exception should be ignored and neither count as a failure nor success. The Predicate must return true if the exception should be ignored.</td>
</tr>
</table>

From: [Resilience4j Circuit Breaker Docs](https://resilience4j.readme.io/docs/circuitbreaker#create-and-configure-a-circuitbreaker)

> [!NOTE]
> `Resilience4j` also provides two more states: `DISABLED` (stopping automatic state transition, metrics and event publishing)
> and `FORCED_OPEN` (same behavior as disabled state, but always returning an exception), as well as manual control
> over the possible state transitions.

> [!NOTE]
> Worth mentioning that [Polly](https://www.pollydocs.org/strategies/circuit-breaker.html#defaults)'s circuit breaker
> also allows for [manual control](https://www.pollydocs.org/strategies/circuit-breaker.html#defaults) over the circuit breaker's state.
> They present an additional state, `Isolated`, which can be used to prevent the circuit breaker from automatically transitioning, as it is manually held open (i.e., actions are blocked).
> It is an implementation detail to consider.

### Sliding Window

The CircuitBreaker uses a sliding window to store and aggregate the outcome of calls.
There are two types of sliding windows: `count-based` (aggregrates the outcome of the last `N` calls) and `time-based` (aggregrates the outcome of the calls of the last `N` seconds).

In more detail:
- `Count Based`: The sliding window is implemented with a circular array of `N` measurements. If the count window size is `10`, the circular array has always `10` measurements. The sliding window incrementally updates a total aggregation. The total aggregation is updated when a new call outcome is recorded. When the oldest measurement is evicted, the measurement is subtracted from the total aggregation and the bucket is reset. (Subtract-on-Evict)

- `Time Based`: The sliding window is implemented with a circular array of `N` partial aggregations (buckets). If the time window size is `10` seconds, the circular array has always `10` partial aggregations (buckets). Every bucket aggregates the outcome of all calls which happen in a certain epoch second. (Partial aggregation). The head bucket of the circular array stores the call outcomes of the current epoch second. The other partial aggregations store the call outcomes of the previous seconds. The sliding window does not store call outcomes (tuples) individually, but incrementally updates partial aggregations (bucket) and a total aggregation. The total aggregation is updated incrementally when a new call outcome is recorded. When the oldest bucket is evicted, the partial total aggregation of that bucket is subtracted from the total aggregation and the bucket is reset. (Subtract-on-Evict)

From [Resilience4j Circuit Breaker Docs](https://resilience4j.readme.io/docs/circuitbreaker#count-based-sliding-window)

> [!IMPORTANT]
> For each sliding window type future implementation, the time and space complexity of the sliding window should
> be documented.

### Additional Details

Just like the [Retry](#retry) mechanism, the Circuit Breaker mechanism also provides:
- [Registry](#registry) for managing Circuit Breaker instances and configurations;
- [Decorators](#decorators) for wrapping functions with the Circuit Breaker logic;
- [Events](#events) for monitoring the Circuit Breaker's state transitions and outcomes.
- [Kotlin Interop](#kotlin-interop) for accessing the Circuit Breaker mechanism in Kotlin that compiles to JVM bytecode.
## Kotlin Multiplatform Design
Resilience4j is compatible with Kotlin but only for the JVM environment.
Some considerations for multiplatform design found are:
Expand All @@ -410,17 +545,35 @@ Some considerations for multiplatform design found are:
2. `Duration`
- **Problem**: The library uses Java's [Duration](https://docs.oracle.com/javase/8/docs/api/java/time/Duration.html)
to represent time intervals.
- **Potential solution**: use `kotlinx-datetime` for multiplatform compatibility.
- **Potential solution**: Use `kotlinx-datetime` for multiplatform compatibility.
3. `Delay`
- **Problem**: The library uses
Java's [Thread.sleep]([Thread.sleep](https://github.com/resilience4j/resilience4j/blob/9ed5b86fa6add55ee32a733e8ed43058a3c9ec63/resilience4j-retry/src/main/java/io/github/resilience4j/retry/internal/RetryImpl.java#L48))
for delay as the default function.
- **Potential solution**: Use a library like [kotlinx-coroutines](https://github.com/Kotlin/kotlinx.coroutines) for
for delay as the default delay provider.
- **Potential solution**: Use [kotlinx-coroutines](https://github.com/Kotlin/kotlinx.coroutines) for
delay and other asynchronous operations in a multiplatform environment.

> [!IMPORTANT]
> If Javascript target is required,
> a Kotlin Multiplatform implementation of the Retry mechanism cannot use synchronous [context](#context) because of
> the [single-threaded
> nature of JavaScript](https://medium.com/@hibaabdelkarim/javascript-synchronous-asynchronous-single-threaded-daaa0bc4ad7d).
> Implementation should be done using asynchronou context only.
> Implementation should be done using asynchronou context only.

## Flow

The library also provides several extensions for the asynchronous
primitive [Flow](https://kotlinlang.org/docs/flow.html)
to work with all provided mechanisms.
Such extensions are not terminal operators and can be chained with others.

```kotlin
val retry = Retry.ofDefaults()
val rateLimiter = RateLimiter.ofDefaults()

flowOf(1, 2, 3)
.rateLimiter(rateLimiter)
.map { it * 2 }
.retry(retry)
.collect { println(it) } // terminal operator
```

0 comments on commit 6f25ccf

Please sign in to comment.