Database connection error handling #268

georg-schwarz · 2020-12-03T08:36:42Z

How will error handling work in the services if connections to the db break?
Should they be able to use a subscribeToError function to perform custom error handling?
Or do we want to handle that synchronously when requesting queries? Should we catch that specific case and throw WebExceptions 500 "Database currently not reachable" or something similar?

Originally posted by @georg-schwarz in jvalue/node-dry-pg#2 (comment)

georg-schwarz · 2020-12-03T08:37:33Z

moved discussion here @sonallux

sonallux · 2020-12-03T10:43:52Z

Anything related to connection losses on idle connections in the connection pool should be resolved with jvalue/node-dry-pg#2. When performing a query, the connection pool will automatically try to establish a new connection if there is no active connection in the pool. Therefore those errors should not concern the services as the node-dry-pg library is handling those errors silently.

The query method does return a Promise which resolves with the query result or rejects with an error if performing the query failed (database error or connection problems). Therefore we must handle those errors synchronously.

First, we must distinguish between database schema violations errors (e.g. client errors) and network-related errors. I am going to focus only on the network-related errors now.

Further, we should also distinguish between the origins of the database queries because each of them will need a different error handling. I have currently identified these three origins:

Service startup (table initilization)

Here we do have two options:

wait and retry: We are currently doing this with a limited number of retries. After that, we are following the second option and exiting the microservice.
fail-fast: In this case, one would immediately exit the service and let the container orchestrator (e.g. Kubernetes) handle the error.

The wait and retry option is convenient in development. But if the ODS should ever run in production, I would move to the fail-fast option, because the database initialization is only needed on the very first start of the database. On all other service startups executing the database initialization is unnecessary and can break things if it is not idempotent.

When running in production database migrations are also a scenario that will arise at some point. For me, database initialization and database migration are actually very similar and ideally should be handled similarly. But as database migrations are a complex topic on its own, they should be handled separately when it is needed.

REST request

For me, the only option is to return a 5XX error. This is already done as the error of the rejected query Promise just bubbles up till it is caught by the default express error handler, which returns a 500 response (see #247)

Async message/event

I think we are currently just logging the error. This is definitely not an appropriate error handling mechanism. In those cases, I would use the feature of rejecting/nacking messages back to the message broker, so they do not get lost. Then the message can either be redelivered or put in a dead-letter queue.

Further points

Here are some further points that can influence the above decisions. Most of the points do not affect us now. But I would like to mention them here, as they are getting important when running the ODS in production with live traffic.

Is the connection loss due to high load on the database?
Is the database replicated? (Retry with another replica)
Are there multiple instances running of the microservice? (always fail-fast and let the client do a retry on another instance)
Are there circuit breakers?
How can the container orchestrator detect unhealthy or broken services and databases?
What does the container orchestrator do with unhealthy or broken services and databases (draining traffic and service restart)?
How does load balancing work (especially automatic traffic draining from unhealthy services)?

georg-schwarz · 2020-12-03T11:26:24Z

I agree with everything you say!

Since we have retries on startup configured via end variables, we can set the retries to 0 on future Kubernetes deployments and let k8s restart the containers on failure.

The schema initialization + migrations should be handled differently than like right now. But that can happen later on when we have a version deployed.

sonallux mentioned this issue Dec 3, 2020

Make PostgresRepository fault tolerant jvalue/node-dry-pg#2

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Database connection error handling #268

Database connection error handling #268

georg-schwarz commented Dec 3, 2020

georg-schwarz commented Dec 3, 2020 •

edited

Loading

sonallux commented Dec 3, 2020

georg-schwarz commented Dec 3, 2020

Database connection error handling #268

Database connection error handling #268

Comments

georg-schwarz commented Dec 3, 2020

georg-schwarz commented Dec 3, 2020 • edited Loading

sonallux commented Dec 3, 2020

Service startup (table initilization)

REST request

Async message/event

Further points

georg-schwarz commented Dec 3, 2020

georg-schwarz commented Dec 3, 2020 •

edited

Loading