Skip to content

Latest commit

 

History

History
136 lines (100 loc) · 7.59 KB

README.adoc

File metadata and controls

136 lines (100 loc) · 7.59 KB

microprofile fault tolerance

Fault Tolerance

During the review process, add the following fields as needed:

Introduction

It is increasingly important to build fault tolerant micro services. Fault tolerance is about leveraging different strategies to guide the execution and result of some logic. Retry policies, bulkheads, and circuit breakers are popular concepts in this area. They dictate whether and when executions should take place, and fallbacks offer an alternative result when an execution does not complete successfully.

As mentioned above, the Fault Tolerance proposal is to focus the aspects: TimeOut, RetryPolicy, Fallback, bulkhead and circuit breaker.

  • TimeOut: Define a duration for timeout

  • RetryPolicy: Define a criteria on when to retry

  • Fallback: provide an alternative solution for a failed execution.

  • Bulkhead: isolate failures in part of the system while the rest part of the system can still function.

  • CircuitBreaker: offer a way of fail fast by automatically failing execution to prevent the system overloading and indefinite wait or timeout by the clients.

The main design is to separate execution logic from execution. The execution can be configured with fault tolerance policies, such as RetryPolicy, fallback, Bulkhead and CircuitBreaker.

Hystrix and Failsafe are two popular libraries for handling failures. This proposal is to define a standard API and approach for applications to follow in order to achieve the fault tolerance.

The requirements are as follows:

  • Loose coupling: Execution logic should not know anything about the execution status or fault tolerance.

  • Failure handling strategy should be configured when the execution takes place.

  • Support for synchronous and asynchronous execution

  • Integration with 3rd party asynchronous APIs. This is necessary to handle executions that are completed at some time in the future, where retries will need to be explicitly scheduled from within the asynchronous execution. This is common when working with various 3rd party asynchronous tools such as Netty, RxJava, Vert.x, etc.

  • Require immutable failure handling policy configuration

  • Some Failure policy configurations, e.g. CircuitBreaker, RetryPolicy, can be used stand alone. For example, it has been very useful for circuit breakers to be standalone constructs which can be plugged into and intentionally shared across multiple executions. Likewise for retry policies. Additionally, an Execution construct can be offered that allows retry policies to be applied to some logic in a standalone, manually controlled way.

Motivation

Currently there are at least two libraries to provide fault tolerance. It is best to uniform the technologies and define a standard so that micro service applications can adopt and the implementation of fault tolerance can be provided by the containers if possible.

Proposed solution

Separate the responsibility of executing logic (Runnables/Callables/etc) from guiding when execution should take place (through retry policies, bulkheads, circuit breakers). In this way, failure handling strategies become configuration that can influence executions, and the execution API itself is just responsible for receiving some configuration and performing executions.

By default, a failure handling strategy could assume, for example, that any exception is a failure. This is what the RetryPolicy’s retryOn, abortOn clauses are about - defining a failure.

Standardise the Fallback, Bulkhead and CircuitBreaker APIs and provide implementations.

  • CDI-first approach to apply RetryPolicy, Fallback, BulkHead, CircuitBreaker using annotations

Detailed design (One example of implementations)

This specification utilises CDI to simplify the programming model.

CDI-based approach

Use interceptor binding to specify the execution and policy configuration. An annotation of Asynchronous has to be specified for any asynchronous calls. Otherwise, synchronous execution is assumed.

RetryPolicy: A policy to define the retry criteria

An annotation to specify the max retries, delays, maxDuration, Duration unit, jitter, retryOn etc.

CircuitBreaker: a rule to achieve fail fast, in order to prevent from repeating timeout

An annotation to specify when to open a circuit, when to half open, close the circuit.

Fallback

Define the fallback method or fallback handler for a failed execution.

Timeout to be used specifying the maximum time for an execution

Timeout to specify the maximum time for a particular execution.

Bulkhead - threadpool or semaphore style

Use this annotation without Asynchronous annotation for semaphore style. When used with Asynchronous, it means threadpool style of bulkhead. ## Usage The annotations can be applied to a bean or methods. They can be used together. For an instance, @Retry can be used with @Fallback in order to trigger the fallback when the Retry policy fails.

@ApplicationScoped
public class FaultToleranceBean {
   int i = 0;
   @Retry(maxRetries = 2)
   public Runnable doWork() {
      Runnable mainService = () -> serviceA(); // This unreliable service sometimes succeeds but
                                         // sometimes throws a RuntimeException
	  return mainService;
   }
}
}

Configuration

The annotation parameters can be configured via MicroProfile Config. In order to configure the maxRetries to be 6 for the following Retry policy, define a property org.microprofile.readme.FaultToleranceBean/doWork/Retry/maxRetries=6. Alternatively, if the maxRetries of the Retry is to be configured to 6, just specify the property of Retry/maxRetries=6.

package org.microprofile.readme
@ApplicationScoped
public class FaultToleranceBean {
   int i = 0;
   @Retry(maxRetries = 2)
   public Runnable doWork() {
      Runnable mainService = () -> serviceA(); // This unreliable service sometimes succeeds but
                                         // sometimes throws a RuntimeException
	  return mainService;
   }
}
}

Impact on existing code

n/a

Alternatives considered

n/a