Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consistency #17

Open
nbelzer opened this issue May 15, 2020 · 4 comments
Open

Consistency #17

nbelzer opened this issue May 15, 2020 · 4 comments

Comments

@nbelzer
Copy link
Collaborator

nbelzer commented May 15, 2020

After our discussion today I took a look at the different articles discussing Saga:

Below I've taken some notes on how this would work for the payment service (which should be extendable to the order service`.

Saga notes

  • Choreagraphy approach sends events directly to other services that are subscribed.
    • Since each service is a set of replicas we need a place to store all the services that are subscribed to us.
      • This could be cassandra or postgres or some other service (redis?)
    • Upon an event (say creating the order) we emit this event (through http) to all that are subscribed given some details (like the transaction id)
    • The choreagraphy approach (pushing messages to the services) is preferable as in this case we are sure only one replica answers to an event. K8s loadbalancer will make sure that this replica is healthy.
    • These events trigger actions on the subscribed services.
      • For example triggering a reservation
      • They could trigger another event
    • Upon success the next event will trigger the next step in the process
    • Upon failure all services that performed some action can run the appropriate code to roll back the actions made.
      • What happens when the next event doesn't arrive? In case a service actually failed.
        - Make sure that sending an event waits for a 2xx code or reports failure on timeout.
    • This means we should clearly document the events and when they happen in the process. As the amount of events can become quite large.
    • The main benefit of this is that we don't need to implement all rollback logic on a single service and that we don't need to store any state in the services themselves.
      • It also provides a clear distinction from the public api for actual behaviour that we want our clients to use while providing an internal api for communication between services to handle reservations or failures.

Example

The payment service needs to reserve stock and credits before subtracting them and completing the order.

  1. Payment receives the payment request
  2. Payment creates a payment entry with status INITIATED and creates an event PAYMENT_INITIATED with the order id and a transaction id (uuid4?) (not sure if we need the transaction id)
  3. User receives event PAYMENT_INITIATED and reserves credits for the transaction, emits CREDITS_RESERVED event for the same transaction id.
  4. Stock receives event CREDITS_RESERVED and reserves the stock for the transaction, emits the STOCK_RESERVED event for the same transaction id.
  5. Payment receives event STOCK_RESERVED and changes the payment status to RESERVED. Emits PAYMENT_RESERVED event.
  6. User receives PAYMENT_RESERVED and applies the reservation for the transaction, emits CREDITS_SUBTRACTED for the transaction.
  7. Stock receives event CREDITS_SUBTRACTED and applies the stock reservation. Emits STOCK_SUBTRACTED for the transaction.
  8. At this point the payment is complete.
  9. Payment receives the event STOCK_SUBTRACTED and updates the status of the payment to PAID.

At any point there are also failed responses. For example:

  1. User Failure sends the INSUFFICIENT_CREDITS which is received by the payment service and stops the transaction.
  2. Stock Failure sends the INSUFFICIENT_STOCK which is received by the user service (who cancels their reservation) and payment service (who returns failure on the transaction).
  3. User Failure Not sure how this could happen, but in case both reservations should be removed using a FAILURE event for the transaction.
  4. Stock Failure Not sure how this could happen, but in case the user service receives the event and credits the payment back using a FAILURE event for the transaction.
  5. Payment Failure Again not sure how this could happen, but in case we return the stock and credit using a FAILURE event for the transaction.

The only problem I see here is when an event is emitted but not responded to (2xx status code). The transaction will forever halt. This should be solvable using some sort of deliver-at-least once logic that waits for a 200 status code -> which could break if the node fails or is replaced -> This could be avoided by using a message broker that is highly available.

The only logic that is required on this message broker is to be highly available, send messages through their channels and wait for a response. If no response is given or an error we send a general FAILURE event for that transaction which should roll back the actions on other systems. This should make it so that unless a machine actually shuts down unexpectedly the system should stay consistent.

The original request

An additional problem I see here is that because of this chain of messages we will need to keep the original request from the user to the payment service open until we either receive a failure or STOCK_SUBTRACTED event.

In case of a failure of the payment service within this time we will not be able to let the user know the payment failed or succeeded.

@nbelzer
Copy link
Collaborator Author

nbelzer commented May 15, 2020

While I discuss using the message broker above, it is optional and depends on our assumptions about the system:

By using the emitting service to check whether the receiving service succeeds or fails (and call failure in case it fails) we are able to detect failures in other services and prevent these forever halted transactions (if we assume that two services will not fail at the same time). The message broker could provide an extra guarantee on top of this if we do not want to assume that two services never fail at the same time.

@nbelzer
Copy link
Collaborator Author

nbelzer commented May 19, 2020

A quick summary of the different types of events:

Payment Service

  • PAYMENT_INITIATED
  • PAYMENT_RESERVED
  • PAYMENT_FAILURE

User Service

  • CREDITS_RESERVED
  • CREDITS_SUBTRACTED
  • CREDITS_FAILURE

Stock Service

  • STOCK_RESERVED
  • STOCK_SUBTRACTED
  • STOCK_FAILURE

@plammerts
Copy link
Collaborator

plammerts commented May 27, 2020

Great work Nick! I agree to take the Choreagraphy approach. However I had some short notes/questions.

  • Transaction IDs are definitely necessary as the event listeners at each service should know which transaction has be rollbacked.
  • You talked about reserving such as the stock service that reserves the stock for the transaction. Are we just updating the database immediately, such as subtracting stock? And in case of a failure, we just rollback the the database by updating it again, so adding the subtracted stock?
  • I became a bit confused about the message broker part as this belongs to the orchestration approach of SAGA. In the Choreagraphy approach, the services itself should listen to each other in a chain instead of through a message broker. An example of this was provided in the first blog:
  1. Stock Service produces PRODUCT_OUT_OF_STOCK_EVENT;
  2. Both Order Service and Payment Service listen to the previous message:
    Payment Service refund the client
    Order Service set the order state as failed
  • And about your problem: ' when an event is emitted but not responded to '. Cant we just use a timeout for this?

@nbelzer
Copy link
Collaborator Author

nbelzer commented May 29, 2020

@plammerts

You talked about reserving such as the stock service that reserves the stock for the transaction. Are we just updating the database immediately, such as subtracting stock? And in case of a failure, we just rollback the the database by updating it again, so adding the subtracted stock?

We would be adding an extra space (table) per 'thing' that can be reserved (credits or stock) that keeps track of reservations and when they are made (such that we can automatically release them again after some time). On our routes that show the amount of stock available you would return available = stock - reserved for a specific item.

I became a bit confused about the message broker part as this belongs to the orchestration approach of SAGA. In the Choreagraphy approach, the services itself should listen to each other in a chain instead of through a message broker.

Yeah this is where I started moving away from the pattern to solve some problems I was seeing. I guess the exact thing I describe above is a mix between the two types.

And about your problem: ' when an event is emitted but not responded to '. Cant we just use a timeout for this?

Yes that is exactly what I was thinking to do.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants