monitoring guidelines #13

pvdbosch · 2020-01-06T14:46:42Z

In GitLab by @wsalembi on Feb 23, 2018, 09:04

Because of the low priority, monitoring.adoc is removed from the styleguide at the moment, but still exists in the git repo.

eHealth: HealthCheck

Monitoring returns the XML monitoring message in order to be interpretable by foglight. This is legacy…
Should become JSON

CBSS:

most of monitoring section contains checks internal to the organization which shouldn't be exposed in the public API. Priority should be on guidelines which allow external consumers of an API to check health.
Public health checks should not expose internals of applications, and should be lightweight (e.g. HTTP get, with health status cached server-side for a little while)
might be a good idea to offer a single health check API and web page per organization, returning the health of all it's APIs

pvdbosch · 2020-01-06T14:46:46Z

In GitLab by @pvdbosch on Apr 24, 2018, 20:25

assigned to @wsalembi

pvdbosch · 2020-01-06T14:46:49Z

In GitLab by @wsalembi on Jun 28, 2018, 09:15

swagger: '2.0'
info:
  title: Demo Service with healthCheck
  description: healthCheck allows to retrieve the status of the API. 
      Also see https://docs.spring.io/spring-boot/docs/current/reference/htmlsingle/#production-ready
  version: '1.0'
basePath: /demo/v1
tags:
  - name: Monitoring
schemes:
  - https
paths:
  
  /health:
    get:
      tags:
        - Monitoring
      summary: Check health of this REST API
      description: This resource is only available to supervision users
      operationId: healthCheck
      produces:
        - application/json
      responses:
        '200':
          description: The service is UP
          schema:
            $ref: '#/definitions/HealthStatus'
        '503':
          description: The service is DOWN or OUT_OF_SERVICE
          schema:
            $ref: '#/definitions/HealthStatus'
          examples: {
            'application/json': {
            'status': 'DOWN'
            }
          }
          
 
definitions:
  HealthStatus:
    description: Response message for the API health 
    type: object
    required:
      - status
    properties:
      status:
        description: Status from https://github.com/spring-projects/spring-boot/blob/v2.0.3.RELEASE/spring-boot-project/spring-boot-actuator/src/main/java/org/springframework/boot/actuate/health/Status.java
        type: string
        enum:
        - UP
        - DOWN
        - UNKNOWN
        - OUT_OF_SERVICE      
      details:
        type: object
        additionalProperties:
          $ref: '#/definitions/HealthStatus'

pvdbosch · 2020-01-06T14:46:52Z

In GitLab by @wsalembi on Jun 28, 2018, 10:49

After review

swagger: '2.0'
info:
  title: Demo Service with healthCheck
  description: healthCheck allows to retrieve the status of the API. 
      Also see https://docs.spring.io/spring-boot/docs/current/reference/htmlsingle/#production-ready
  version: '1.0'
basePath: /demo/v1
tags:
  - name: Monitoring
schemes:
  - https
paths:
  
  /health:
    get:
      tags:
        - Monitoring
      summary: Check health of this REST API
      description: This resource is only available to supervision users
      operationId: health
      produces:
        - application/json
      responses:
        '200':
          description: The service is UP
          schema:
            $ref: '#/definitions/HealthStatus'
        '503':
          description: The service is DOWN or OUT_OF_SERVICE
          schema:
            $ref: '#/definitions/HealthStatus'
          examples: {
            'application/json': {
            'status': 'DOWN',
            'details': {
              'datastore': {
                'status': 'OUT_OF_SERVICE'
              }
            }
            }
          }
          
 
definitions:
  HealthStatus:
    description: Response message for the API health 
    type: object
    required:
      - status
    properties:
      status:
        $ref: '#/definitions/HealthLevel'
      details:
        type: object
        additionalProperties:
          $ref: '#/definitions/ComponentStatus'
  ComponentStatus:
    description: Status of an API component 
    type: object
    required:
      - status
    properties:
      status:
        $ref: '#/definitions/HealthLevel'
  HealthLevel:
    description: Status from https://github.com/spring-projects/spring-boot/blob/v2.0.3.RELEASE/spring-boot-project/spring-boot-actuator/src/main/java/org/springframework/boot/actuate/health/Status.java
    type: string
    enum:
      - UP
      - DEGRADED
      - DOWN
      - UNKNOWN
      - OUT_OF_SERVICE

pvdbosch · 2020-01-06T14:46:55Z

In GitLab by @pvdbosch on Dec 21, 2018, 12:17

mentioned in December WG meeting:

enum values to be converted to lowerCamelCase
component level details are typically hidden from clients outside the organization for security reasons (can be done using scopes)
when running on OpenShift/k8s, this check cannot be used as a readinessProbe, because when status isn't up, it would make the health check itself unreachable for external clients

pvdbosch · 2020-01-06T14:46:59Z

In GitLab by @pvdbosch on Jan 8, 2019, 10:40

Added schema (with lowerCamelCase enum values) to common-v1beta in f2f3842

pvdbosch · 2020-01-06T14:47:02Z

In GitLab by @pvdbosch on Jan 21, 2019, 15:41

mentioned in merge request !3

pvdbosch · 2020-01-06T14:47:05Z

In GitLab by @pvdbosch on Jan 22, 2019, 10:48

Discussed on WG of January:

Http status code for failing health checks: 200 or 503?

503:
- some basic monitoring could be done on http status code only
- 503 is also used in (spring boot)https://docs.spring.io/spring-boot/docs/2.1.2.RELEASE/reference/htmlsingle/#_writing_custom_healthindicators and (microprofile health)https://github.com/eclipse/microprofile-health/blob/master/spec/src/main/asciidoc/protocol-wireformat.adoc#response-codes-and-status-mappings
200: makes /health behave consistent to other REST resources:
- the GET on the /health resource itself is OK (status 200), even if status is down
- On 4xx and 5xx codes, we usually return a Problem json instead of the normal resource representation

Conclusion: use status code 503. For 'degraded' status, use 200.

'unknown' status can be removed, a common use case can't be found to use it

Steve mentioned VAS uses variants of the /health resource for OpenShift, e.g.:

/health?type=healthCheck
/health?type=readiness

pvdbosch · 2020-01-06T14:47:08Z

In GitLab by @wsalembi on May 10, 2019, 09:23

mentioned in commit 5dfd323

pvdbosch · 2020-01-06T14:47:11Z

In GitLab by @wsalembi on May 10, 2019, 09:35

mentioned in commit 27fe765

pvdbosch · 2020-01-06T14:47:14Z

In GitLab by @wsalembi on May 10, 2019, 09:36

mentioned in merge request !4

pvdbosch · 2020-01-06T14:47:17Z

In GitLab by @wsalembi on May 13, 2019, 09:29

mentioned in commit 537e010

pvdbosch · 2020-01-06T14:47:20Z

In GitLab by @wsalembi on May 13, 2019, 10:21

mentioned in commit c328943

pvdbosch · 2020-01-06T14:47:23Z

In GitLab by @pvdbosch on Jun 18, 2019, 21:51

How the microprofile health standard compares:

status UP and DOWN, no support for DEGRADED or other values, same http codes
/health/live and /health/ready for kubernetes liveness and readiness checks, /health for all checks
individual checks in a "checks" array property, with "name" property as identifier.
- each individual check can have a "data" object with key/value properties. Values are simple types only.
spec recommends implementations to provide reasonable default health check procedures
adopted in recent releases by JBoss/Wildfly, WebSphere, Quarkus, ...

Spring Boot actuator health:

status DOWN, UP, UNKNOWN or OUT_OF_SERVICE, same http codes
readiness or liveness probes requires some custom configuration using health groups
detailed health information "components" JSON object, with property as identifier.
- individual component can also be consulted as subresource, e.g. 'http://localhost:8080/actuator/health/db'
- a component can have a 'details' JSON Object with addtional free-form info
supports quite some builtin healthchecks for common components

I'd like to discuss our options:

go with our own format like in the pull request
adopt one of microprofile or spring boot standards, gaining reuse of 3rd party libs and server support but losing some customization possibilities
only adopt the least common denonimator of spring boot and microprofile: a /health endpoint with status DOWN or UP and status codes
- the rest of response is up to internal organization guidelines; internal details should be hidden often anyway from external clients
... other ideas?

pvdbosch · 2020-01-06T14:47:26Z

In GitLab by @wsalembi on Dec 11, 2019, 10:21

Logging at the specs, I think we should just formalize the common parts. Top-level is UP (200) or DOWN (503).
Then the service provider has the choice to either implement it with microprofile, spring or custom code

pvdbosch · 2020-01-07T13:32:01Z

I agree with @wsalembi , adoption of only the common parts in gCloud REST guide.

bertvannuffelen · 2020-01-22T14:38:45Z

A draft of an upcomming agreement:
https://tools.ietf.org/id/draft-inadarei-api-health-check-01.html

pvdbosch · 2020-01-22T14:43:24Z

latest working version: https://inadarei.github.io/rfc-healthcheck/

We'll check if this RFC has some (planned) support in implementations. If not, we'll use the microprofile Health/spring-boot based one.

pvdbosch · 2020-01-29T16:55:38Z

About the RFC, I found following related github issues

alignment with spring-boot: Consider supporting health check RFC spring-projects/spring-boot#11706
alignment with microprofile:

So Spring Boot rejected it because lack of apparent traction on the RFC. Issues in microprofile are still open, but discussion focuses more on the health check details (not the global state), which we were thinking to not regulate in the styleguide. The global state ("status": "UP" / "DOWN") is considered compliant for the RFC, because the RFC allows it as "acceptable aliases".

The RFC is quite lenient in order to be compatible with Spring Boot and microprofile, but this makes it weaker as a standard IMO. For instance, multiple status value aliases are accepted and they are case-insensitive, and for http status codes only 2xx-3xx (pass) or 4xx-5xx (fail) are mentioned.

pvdbosch · 2020-03-25T14:23:49Z

decision on WG: continue based on common parts of the microprofile Health/spring-boot standards

pvdbosch · 2020-04-16T13:49:28Z

FYI, Spring Boot 2.3 will add specific support for liveness and readiness health checks: https://spring.io/blog/2020/03/25/liveness-and-readiness-probes-with-spring-boot

pvdbosch · 2020-05-27T13:46:59Z

degraded/warn status is being discussed in microprofile health: eclipse/microprofile-health#130

derekm · 2020-07-21T21:03:54Z

Hi! I was happy to stumble across this thread about monitoring guidelines for REST standards within Belgium!

Everyone is still tending to hand roll their own Health systems, and common elements beyond HTTP status codes and liveness/readiness groups (thanks, k8s!) have been hard to come by or slow to emerge.

There is some momentum to standardize Health check responses, and I think it is much needed, so I would be happy to join this conversation from the United States!

I think the RFC could be important, if people can have their interoperability/alignment conversations there to get appropriate adjustments made for a more viable standard.

pvdbosch · 2020-09-08T16:59:18Z

Hi @derekm , we're seeing three use cases requiring different types of health checks:

for use by k8s liveness (is the pod healthy) and readiness probes (should traffic be routed to the application)
report to a monitoring system if the application is alive and functioning properly
- if down or degraded, it can still be useful to report UP to the k8s readiness probe, so the application can return a user-friendly error message and provide proper fallback behavior
- includes information about the availability of any external services required by the application, taking their impact on the application's global health status into account e.g. the application may provide fallback behavior when a dependency is down
make a service's health status available to its users, e.g. to integrate into their own monitoring system. Requirements are similar to 2), but internal details should be hidden in this case.

Our REST guide mainly focuses on the third case.

pvdbosch · 2020-09-08T17:14:52Z

I updated the PR #51 a bit:

mentioned that status may be extended with other values than the three standardized ones
explained the bit about component-level details being not included in the standard (because internal) and removed to "scope:health" because that's not in the Swagger anymore and internal to an organization

The PR for the Swagger/OpenAPI files is in belgif/openapi-common#2

If OK on WG tomorrow, we can merge.

#13 health check guidelines Co-authored-by: wisa <[email protected]> Co-authored-by: wsalembi <[email protected]>

pvdbosch · 2020-09-09T15:31:01Z

both PRs are merged. Will be published in next version of guide; probably by end or this week or beginning of next week

smals-chlo · 2020-09-14T07:50:39Z

Hello @pvdbosch ,
Thank you for publishing the healthcheck guidelines.
However, it was pointed out by Wouter Roose that this link is broken:
common-v1.yaml (https://github.com/belgif/openapi-common/blob/master/src/main/swagger-2.0/common/v1/common-v1.yaml)
Can you please adapt?

pvdbosch · 2020-09-14T08:14:56Z

Hi @smals-chlo, thanks for reporting. The links for all swagger files were broken because the directories "swagger-2.0" were renamed to "swagger" (#27). I've updated and published the rest guide to fix this.

smals-chlo · 2020-09-17T14:13:08Z

Hi @pvdbosch,
During the Rest group today, we discovered that 500 as response code is not documented in the guidelines for "18. Health".
It is present in the common-v1.yaml.
Can you also add it?

    "500":
      description: The service cannot generate a health response.
      schema:
        $ref: "../../problem/v1/problem-v1.yaml#/definitions/Problem"

pvdbosch · 2020-09-18T14:37:58Z

I think we can generalize in a broader scope than Health: any API operation can always fail unexpectedly and return 500 response. Not being able to generate a health response is just a specific occurrence of this.

And then we can discuss if such unexpected failures should or shouldn't be represented in OpenAPI/Swagger => created #67 on this.

pvdbosch added the needs discussion label Jan 6, 2020

pvdbosch assigned wsalembi Jan 6, 2020

This was referenced Jan 6, 2020

gCloud schemas for common business data #26

Closed

Documentation health resource #50

Closed

Completes issue #13 health check #51

Merged

pvdbosch added this to the in progress milestone Jan 9, 2020

pvdbosch added needs work and removed needs discussion labels Apr 16, 2020

pvdbosch pushed a commit that referenced this issue Aug 3, 2020

#13 integrate 200/503 status codes

fad9e7a

pvdbosch pushed a commit that referenced this issue Aug 3, 2020

#13 integrate 200/503 status codes, correction introduction

4059fc9

pvdbosch pushed a commit that referenced this issue Aug 3, 2020

#13 review health page

e2a2b63

pvdbosch pushed a commit that referenced this issue Sep 8, 2020

#13 integrate 200/503 status codes

ca6cee5

pvdbosch pushed a commit that referenced this issue Sep 8, 2020

#13 integrate 200/503 status codes, correction introduction

fda19c4

pvdbosch pushed a commit that referenced this issue Sep 8, 2020

#13 review health page

ed6a82e

pvdbosch added a commit that referenced this issue Sep 9, 2020

Completes issue #13 health check (#51)

ac31a1b

#13 health check guidelines Co-authored-by: wisa <[email protected]> Co-authored-by: wsalembi <[email protected]>

pvdbosch closed this as completed Sep 9, 2020

pvdbosch mentioned this issue May 15, 2023

Standardize /health?type parameter #128

Closed

pvdbosch pushed a commit that referenced this issue May 28, 2024

#13 integrate 200/503 status codes

107a9f7

pvdbosch pushed a commit that referenced this issue May 28, 2024

#13 integrate 200/503 status codes

33318a7

pvdbosch pushed a commit that referenced this issue May 28, 2024

#13 integrate 200/503 status codes, correction introduction

00494de

pvdbosch pushed a commit that referenced this issue May 28, 2024

#13 review health page

895a052

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

monitoring guidelines #13

monitoring guidelines #13

pvdbosch commented Jan 6, 2020

pvdbosch commented Jan 6, 2020

pvdbosch commented Jan 6, 2020

pvdbosch commented Jan 6, 2020

pvdbosch commented Jan 6, 2020

pvdbosch commented Jan 6, 2020

pvdbosch commented Jan 6, 2020

pvdbosch commented Jan 6, 2020

pvdbosch commented Jan 6, 2020

pvdbosch commented Jan 6, 2020

pvdbosch commented Jan 6, 2020

pvdbosch commented Jan 6, 2020

pvdbosch commented Jan 6, 2020

pvdbosch commented Jan 6, 2020

pvdbosch commented Jan 6, 2020

pvdbosch commented Jan 7, 2020

bertvannuffelen commented Jan 22, 2020

pvdbosch commented Jan 22, 2020

pvdbosch commented Jan 29, 2020 •

edited

Loading

pvdbosch commented Mar 25, 2020

pvdbosch commented Apr 16, 2020

pvdbosch commented May 27, 2020

derekm commented Jul 21, 2020

pvdbosch commented Sep 8, 2020

pvdbosch commented Sep 8, 2020

pvdbosch commented Sep 9, 2020

smals-chlo commented Sep 14, 2020

pvdbosch commented Sep 14, 2020

smals-chlo commented Sep 17, 2020

pvdbosch commented Sep 18, 2020

monitoring guidelines #13

monitoring guidelines #13

Comments

pvdbosch commented Jan 6, 2020

pvdbosch commented Jan 6, 2020

pvdbosch commented Jan 6, 2020

pvdbosch commented Jan 6, 2020

pvdbosch commented Jan 6, 2020

pvdbosch commented Jan 6, 2020

pvdbosch commented Jan 6, 2020

pvdbosch commented Jan 6, 2020

pvdbosch commented Jan 6, 2020

pvdbosch commented Jan 6, 2020

pvdbosch commented Jan 6, 2020

pvdbosch commented Jan 6, 2020

pvdbosch commented Jan 6, 2020

pvdbosch commented Jan 6, 2020

pvdbosch commented Jan 6, 2020

pvdbosch commented Jan 7, 2020

bertvannuffelen commented Jan 22, 2020

pvdbosch commented Jan 22, 2020

pvdbosch commented Jan 29, 2020 • edited Loading

pvdbosch commented Mar 25, 2020

pvdbosch commented Apr 16, 2020

pvdbosch commented May 27, 2020

derekm commented Jul 21, 2020

pvdbosch commented Sep 8, 2020

pvdbosch commented Sep 8, 2020

pvdbosch commented Sep 9, 2020

smals-chlo commented Sep 14, 2020

pvdbosch commented Sep 14, 2020

smals-chlo commented Sep 17, 2020

pvdbosch commented Sep 18, 2020

pvdbosch commented Jan 29, 2020 •

edited

Loading