You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Many micro-services have a few dependencies and it's pretty obvious that when the back-end's database connection has failed, the micro-service is completely down. Things aren't nearly so clear when the hierarchy of micro-services is more than one layer deep. I've added #51 in an effort to allow components to incorporated the top-level health response object as a check, but this hierarchy obviously also impacts how you calculate the top-level status. The failure of some components might only result in a top-level warn status. Or maybe a full-text indexer's failure wouldn't change a pass status since you could conceivably catch up later (and operate without it until then).
This is even more obvious when you consider the recommended UI practice of graceful degradation. For instance, our user account management system relies on about 20 different back-end systems but the UI only outright fails for a few of them. In such cases, we need a way for the UI's healthcheck end-point to report on the impact that each degraded or failed component has on the UI's back-end. We'd propose something like:
The impactId field is primarily for debugging and log analysis. While a random UUID is suggested, the important aspect of this ID is that it's a unique, constant string.
The impactDetail field is a human-readable description that details what is currently non-functional.
The recommendedStatus field is NOT the status returned by the component but is rather the status that will be used to calculate the top-level status. This is a subtle difference but in many cases allows the top-level status to be calculated as the most severe of the impact recommendsStatuses.
Using the account UI healthcheck as an example, the JSON resulting from a Kerberos outage and an SMS outage would produce the following impacts section:
..."impacts": [
{
"impactId": "47619208-2556-41a4-a72c-801209b8ed9e",
"checkKey": "kerberos:connection",
"impactDetail": "The user will be unable to change their password",
"recommendsStatus": "warn"
},
{
"impactId": "85ad165d-9edf-4da5-8d95-93d299673680",
"checkKey": "sms:connection",
"impactDetail": "The user will be unable to perform self-service account recovery",
"recommendsStatus": "warn"
}
],
...
Receiving these impacts allows the UI to adopt a couple very useful behaviors:
The UI can use the impactDetail information to tell the user precisely which functions are not available (as simple as putting a toast at the top of a screen).
The UI can use the unique, constant impactId value to conditionally disable or hide the control elements for those functions.
Two more benefits of this format are:
The top-level status field can be calculated as warn based on the severity of the two underlying failures.
A human looking at the health response object can determine which checks contributed to the calculation of the top-level status.
The text was updated successfully, but these errors were encountered:
Many micro-services have a few dependencies and it's pretty obvious that when the back-end's database connection has failed, the micro-service is completely down. Things aren't nearly so clear when the hierarchy of micro-services is more than one layer deep. I've added #51 in an effort to allow components to incorporated the top-level
health response
object as acheck
, but this hierarchy obviously also impacts how you calculate the top-level status. The failure of some components might only result in a top-levelwarn
status. Or maybe a full-text indexer's failure wouldn't change apass
status since you could conceivably catch up later (and operate without it until then).This is even more obvious when you consider the recommended UI practice of graceful degradation. For instance, our user account management system relies on about 20 different back-end systems but the UI only outright fails for a few of them. In such cases, we need a way for the UI's healthcheck end-point to report on the impact that each degraded or failed component has on the UI's back-end. We'd propose something like:
Three important notes about the format above:
The
impactId
field is primarily for debugging and log analysis. While a random UUID is suggested, the important aspect of this ID is that it's a unique, constant string.The
impactDetail
field is a human-readable description that details what is currently non-functional.The
recommendedStatus
field is NOT the status returned by the component but is rather the status that will be used to calculate the top-level status. This is a subtle difference but in many cases allows the top-level status to be calculated as the most severe of the impactrecommendsStatus
es.Using the account UI healthcheck as an example, the JSON resulting from a Kerberos outage and an SMS outage would produce the following
impacts
section:Receiving these impacts allows the UI to adopt a couple very useful behaviors:
The UI can use the
impactDetail
information to tell the user precisely which functions are not available (as simple as putting a toast at the top of a screen).The UI can use the unique, constant
impactId
value to conditionally disable or hide the control elements for those functions.Two more benefits of this format are:
The top-level
status
field can be calculated aswarn
based on the severity of the two underlying failures.A human looking at the
health response
object can determine which checks contributed to the calculation of the top-level status.The text was updated successfully, but these errors were encountered: