Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

monitoring guidelines #13

Closed
pvdbosch opened this issue Jan 6, 2020 · 29 comments
Closed

monitoring guidelines #13

pvdbosch opened this issue Jan 6, 2020 · 29 comments
Assignees
Milestone

Comments

@pvdbosch
Copy link
Contributor

pvdbosch commented Jan 6, 2020

In GitLab by @wsalembi on Feb 23, 2018, 09:04

Because of the low priority, monitoring.adoc is removed from the styleguide at the moment, but still exists in the git repo.

eHealth: HealthCheck

  • Monitoring returns the XML monitoring message in order to be interpretable by foglight. This is legacy…
  • Should become JSON

CBSS:

  • most of monitoring section contains checks internal to the organization which shouldn't be exposed in the public API. Priority should be on guidelines which allow external consumers of an API to check health.
  • Public health checks should not expose internals of applications, and should be lightweight (e.g. HTTP get, with health status cached server-side for a little while)
  • might be a good idea to offer a single health check API and web page per organization, returning the health of all it's APIs
@pvdbosch
Copy link
Contributor Author

pvdbosch commented Jan 6, 2020

In GitLab by @pvdbosch on Apr 24, 2018, 20:25

assigned to @wsalembi

@pvdbosch
Copy link
Contributor Author

pvdbosch commented Jan 6, 2020

In GitLab by @wsalembi on Jun 28, 2018, 09:15

swagger: '2.0'
info:
  title: Demo Service with healthCheck
  description: healthCheck allows to retrieve the status of the API. 
      Also see https://docs.spring.io/spring-boot/docs/current/reference/htmlsingle/#production-ready
  version: '1.0'
basePath: /demo/v1
tags:
  - name: Monitoring
schemes:
  - https
paths:
  
  /health:
    get:
      tags:
        - Monitoring
      summary: Check health of this REST API
      description: This resource is only available to supervision users
      operationId: healthCheck
      produces:
        - application/json
      responses:
        '200':
          description: The service is UP
          schema:
            $ref: '#/definitions/HealthStatus'
        '503':
          description: The service is DOWN or OUT_OF_SERVICE
          schema:
            $ref: '#/definitions/HealthStatus'
          examples: {
            'application/json': {
            'status': 'DOWN'
            }
          }
          
 
definitions:
  HealthStatus:
    description: Response message for the API health 
    type: object
    required:
      - status
    properties:
      status:
        description: Status from https://github.com/spring-projects/spring-boot/blob/v2.0.3.RELEASE/spring-boot-project/spring-boot-actuator/src/main/java/org/springframework/boot/actuate/health/Status.java
        type: string
        enum:
        - UP
        - DOWN
        - UNKNOWN
        - OUT_OF_SERVICE      
      details:
        type: object
        additionalProperties:
          $ref: '#/definitions/HealthStatus'

@pvdbosch
Copy link
Contributor Author

pvdbosch commented Jan 6, 2020

In GitLab by @wsalembi on Jun 28, 2018, 10:49

After review

swagger: '2.0'
info:
  title: Demo Service with healthCheck
  description: healthCheck allows to retrieve the status of the API. 
      Also see https://docs.spring.io/spring-boot/docs/current/reference/htmlsingle/#production-ready
  version: '1.0'
basePath: /demo/v1
tags:
  - name: Monitoring
schemes:
  - https
paths:
  
  /health:
    get:
      tags:
        - Monitoring
      summary: Check health of this REST API
      description: This resource is only available to supervision users
      operationId: health
      produces:
        - application/json
      responses:
        '200':
          description: The service is UP
          schema:
            $ref: '#/definitions/HealthStatus'
        '503':
          description: The service is DOWN or OUT_OF_SERVICE
          schema:
            $ref: '#/definitions/HealthStatus'
          examples: {
            'application/json': {
            'status': 'DOWN',
            'details': {
              'datastore': {
                'status': 'OUT_OF_SERVICE'
              }
            }
            }
          }
          
 
definitions:
  HealthStatus:
    description: Response message for the API health 
    type: object
    required:
      - status
    properties:
      status:
        $ref: '#/definitions/HealthLevel'
      details:
        type: object
        additionalProperties:
          $ref: '#/definitions/ComponentStatus'
  ComponentStatus:
    description: Status of an API component 
    type: object
    required:
      - status
    properties:
      status:
        $ref: '#/definitions/HealthLevel'
  HealthLevel:
    description: Status from https://github.com/spring-projects/spring-boot/blob/v2.0.3.RELEASE/spring-boot-project/spring-boot-actuator/src/main/java/org/springframework/boot/actuate/health/Status.java
    type: string
    enum:
      - UP
      - DEGRADED
      - DOWN
      - UNKNOWN
      - OUT_OF_SERVICE      

@pvdbosch
Copy link
Contributor Author

pvdbosch commented Jan 6, 2020

In GitLab by @pvdbosch on Dec 21, 2018, 12:17

mentioned in December WG meeting:

  • enum values to be converted to lowerCamelCase
  • component level details are typically hidden from clients outside the organization for security reasons (can be done using scopes)
  • when running on OpenShift/k8s, this check cannot be used as a readinessProbe, because when status isn't up, it would make the health check itself unreachable for external clients

@pvdbosch
Copy link
Contributor Author

pvdbosch commented Jan 6, 2020

In GitLab by @pvdbosch on Jan 8, 2019, 10:40

Added schema (with lowerCamelCase enum values) to common-v1beta in f2f3842

@pvdbosch
Copy link
Contributor Author

pvdbosch commented Jan 6, 2020

In GitLab by @pvdbosch on Jan 21, 2019, 15:41

mentioned in merge request !3

@pvdbosch
Copy link
Contributor Author

pvdbosch commented Jan 6, 2020

In GitLab by @pvdbosch on Jan 22, 2019, 10:48

Discussed on WG of January:

Http status code for failing health checks: 200 or 503?

Conclusion: use status code 503. For 'degraded' status, use 200.

'unknown' status can be removed, a common use case can't be found to use it

Steve mentioned VAS uses variants of the /health resource for OpenShift, e.g.:

  • /health?type=healthCheck
  • /health?type=readiness

@pvdbosch
Copy link
Contributor Author

pvdbosch commented Jan 6, 2020

In GitLab by @wsalembi on May 10, 2019, 09:23

mentioned in commit 5dfd323

@pvdbosch
Copy link
Contributor Author

pvdbosch commented Jan 6, 2020

In GitLab by @wsalembi on May 10, 2019, 09:35

mentioned in commit 27fe765

@pvdbosch
Copy link
Contributor Author

pvdbosch commented Jan 6, 2020

In GitLab by @wsalembi on May 10, 2019, 09:36

mentioned in merge request !4

@pvdbosch
Copy link
Contributor Author

pvdbosch commented Jan 6, 2020

In GitLab by @wsalembi on May 13, 2019, 09:29

mentioned in commit 537e010

@pvdbosch
Copy link
Contributor Author

pvdbosch commented Jan 6, 2020

In GitLab by @wsalembi on May 13, 2019, 10:21

mentioned in commit c328943

@pvdbosch
Copy link
Contributor Author

pvdbosch commented Jan 6, 2020

In GitLab by @pvdbosch on Jun 18, 2019, 21:51

How the microprofile health standard compares:

  • status UP and DOWN, no support for DEGRADED or other values, same http codes
  • /health/live and /health/ready for kubernetes liveness and readiness checks, /health for all checks
  • individual checks in a "checks" array property, with "name" property as identifier.
    • each individual check can have a "data" object with key/value properties. Values are simple types only.
  • spec recommends implementations to provide reasonable default health check procedures
  • adopted in recent releases by JBoss/Wildfly, WebSphere, Quarkus, ...

Spring Boot actuator health:

  • status DOWN, UP, UNKNOWN or OUT_OF_SERVICE, same http codes
  • readiness or liveness probes requires some custom configuration using health groups
  • detailed health information "components" JSON object, with property as identifier.
  • supports quite some builtin healthchecks for common components

I'd like to discuss our options:

  • go with our own format like in the pull request
  • adopt one of microprofile or spring boot standards, gaining reuse of 3rd party libs and server support but losing some customization possibilities
  • only adopt the least common denonimator of spring boot and microprofile: a /health endpoint with status DOWN or UP and status codes
    • the rest of response is up to internal organization guidelines; internal details should be hidden often anyway from external clients
  • ... other ideas?

@pvdbosch
Copy link
Contributor Author

pvdbosch commented Jan 6, 2020

In GitLab by @wsalembi on Dec 11, 2019, 10:21

Logging at the specs, I think we should just formalize the common parts. Top-level is UP (200) or DOWN (503).
Then the service provider has the choice to either implement it with microprofile, spring or custom code

@pvdbosch
Copy link
Contributor Author

pvdbosch commented Jan 7, 2020

I agree with @wsalembi , adoption of only the common parts in gCloud REST guide.

@pvdbosch pvdbosch added this to the in progress milestone Jan 9, 2020
@bertvannuffelen
Copy link

A draft of an upcomming agreement:
https://tools.ietf.org/id/draft-inadarei-api-health-check-01.html

@pvdbosch
Copy link
Contributor Author

latest working version: https://inadarei.github.io/rfc-healthcheck/

We'll check if this RFC has some (planned) support in implementations. If not, we'll use the microprofile Health/spring-boot based one.

@pvdbosch
Copy link
Contributor Author

pvdbosch commented Jan 29, 2020

About the RFC, I found following related github issues

So Spring Boot rejected it because lack of apparent traction on the RFC. Issues in microprofile are still open, but discussion focuses more on the health check details (not the global state), which we were thinking to not regulate in the styleguide. The global state ("status": "UP" / "DOWN") is considered compliant for the RFC, because the RFC allows it as "acceptable aliases".

The RFC is quite lenient in order to be compatible with Spring Boot and microprofile, but this makes it weaker as a standard IMO. For instance, multiple status value aliases are accepted and they are case-insensitive, and for http status codes only 2xx-3xx (pass) or 4xx-5xx (fail) are mentioned.

@pvdbosch
Copy link
Contributor Author

decision on WG: continue based on common parts of the microprofile Health/spring-boot standards

@pvdbosch
Copy link
Contributor Author

FYI, Spring Boot 2.3 will add specific support for liveness and readiness health checks: https://spring.io/blog/2020/03/25/liveness-and-readiness-probes-with-spring-boot

@pvdbosch
Copy link
Contributor Author

degraded/warn status is being discussed in microprofile health: eclipse/microprofile-health#130

@derekm
Copy link

derekm commented Jul 21, 2020

Hi! I was happy to stumble across this thread about monitoring guidelines for REST standards within Belgium!

Everyone is still tending to hand roll their own Health systems, and common elements beyond HTTP status codes and liveness/readiness groups (thanks, k8s!) have been hard to come by or slow to emerge.

There is some momentum to standardize Health check responses, and I think it is much needed, so I would be happy to join this conversation from the United States!

I think the RFC could be important, if people can have their interoperability/alignment conversations there to get appropriate adjustments made for a more viable standard.

pvdbosch pushed a commit that referenced this issue Aug 3, 2020
pvdbosch pushed a commit that referenced this issue Aug 3, 2020
pvdbosch pushed a commit that referenced this issue Sep 8, 2020
pvdbosch pushed a commit that referenced this issue Sep 8, 2020
@pvdbosch
Copy link
Contributor Author

pvdbosch commented Sep 8, 2020

Hi @derekm , we're seeing three use cases requiring different types of health checks:

  1. for use by k8s liveness (is the pod healthy) and readiness probes (should traffic be routed to the application)
  2. report to a monitoring system if the application is alive and functioning properly
    • if down or degraded, it can still be useful to report UP to the k8s readiness probe, so the application can return a user-friendly error message and provide proper fallback behavior
    • includes information about the availability of any external services required by the application, taking their impact on the application's global health status into account e.g. the application may provide fallback behavior when a dependency is down
  3. make a service's health status available to its users, e.g. to integrate into their own monitoring system. Requirements are similar to 2), but internal details should be hidden in this case.

Our REST guide mainly focuses on the third case.

@pvdbosch
Copy link
Contributor Author

pvdbosch commented Sep 8, 2020

I updated the PR #51 a bit:

  • mentioned that status may be extended with other values than the three standardized ones
  • explained the bit about component-level details being not included in the standard (because internal) and removed to "scope:health" because that's not in the Swagger anymore and internal to an organization

The PR for the Swagger/OpenAPI files is in belgif/openapi-common#2

If OK on WG tomorrow, we can merge.

pvdbosch added a commit that referenced this issue Sep 9, 2020
#13 health check guidelines

Co-authored-by: wisa <[email protected]>
Co-authored-by: wsalembi <[email protected]>
@pvdbosch
Copy link
Contributor Author

pvdbosch commented Sep 9, 2020

both PRs are merged. Will be published in next version of guide; probably by end or this week or beginning of next week

@pvdbosch pvdbosch closed this as completed Sep 9, 2020
@smals-chlo
Copy link
Contributor

Hello @pvdbosch ,
Thank you for publishing the healthcheck guidelines.
However, it was pointed out by Wouter Roose that this link is broken:
common-v1.yaml (https://github.com/belgif/openapi-common/blob/master/src/main/swagger-2.0/common/v1/common-v1.yaml)
Can you please adapt?

@pvdbosch
Copy link
Contributor Author

Hi @smals-chlo, thanks for reporting. The links for all swagger files were broken because the directories "swagger-2.0" were renamed to "swagger" (#27). I've updated and published the rest guide to fix this.

@smals-chlo
Copy link
Contributor

Hi @pvdbosch,
During the Rest group today, we discovered that 500 as response code is not documented in the guidelines for "18. Health".
It is present in the common-v1.yaml.
Can you also add it?

    "500":
      description: The service cannot generate a health response.
      schema:
        $ref: "../../problem/v1/problem-v1.yaml#/definitions/Problem"

@pvdbosch
Copy link
Contributor Author

I think we can generalize in a broader scope than Health: any API operation can always fail unexpectedly and return 500 response. Not being able to generate a health response is just a specific occurrence of this.

And then we can discuss if such unexpected failures should or shouldn't be represented in OpenAPI/Swagger => created #67 on this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants