The principle is that errors are logged when they are handled. If you just propagate an error to the caller in a function, you don't need to log it; the caller will. But if you consume an error in a function, you must log it (if it needs to be logged at all).
For example:
fn read_motd_file() -> std::io::Result<String> {
let mut f = File::open("/etc/motd")?;
let mut result = String::new();
f.read_to_string(&mut result)?;
result
}
Opening or reading the file could fail, but there is no need to log the error here. The function merely propagates the error to the caller, and it is up to the caller to log the error or propagate it further, if the failure is not expected. But if, for example, it is normal that the "/etc/motd" file doesn't exist, the caller can choose to silently ignore the error, or log it as an INFO or DEBUG level message:
fn get_message_of_the_day() -> String {
// Get the motd from /etc/motd, or return the default proverb
match read_motd_file() {
Ok(motd) => motd,
Err(err) => {
// It's normal that /etc/motd doesn't exist, but if we fail to
// read it for some other reason, that's unexpected. The message
// of the day isn't very important though, so we just WARN and
// continue with the default in any case.
if err.kind() != std::io::ErrorKind::NotFound {
tracing::warn!("could not read \"/etc/motd\": {err:?}");
}
"An old error is always more popular than a new truth. - German proverb"
}
}
}
We use the anyhow
crate widely. It contains many convenient macros
like bail!
and ensure!
to construct and return errors, and to
propagate many kinds of low-level errors, wrapped in anyhow::Error
.
A downside of anyhow::Error
is that the caller cannot distinguish
between different error cases. Most errors are propagated all the way
to the mgmt API handler function, or the main loop that handles a
connection with the compute node, and they are all handled the same
way: the error is logged and returned to the client as an HTTP or
libpq error.
But in some cases, we need to distinguish between errors and handle them differently. For example, attaching a tenant to the pageserver could fail either because the tenant has already been attached, or because we could not load its metadata from cloud storage. The first case is more or less expected. The console sends the Attach request to the pageserver, and the pageserver completes the operation, but the network connection might be lost before the console receives the response. The console will retry the operation in that case, but the tenant has already been attached. It is important that the pagserver responds with the HTTP 403 Already Exists error in that case, rather than a generic HTTP 500 Internal Server Error.
If you need to distinguish between different kinds of errors, create a
new Error
type. The thiserror
crate is useful for that. But in
most cases anyhow::Error
is good enough.
Depending on where a panic happens, it can cause the whole pageserver
or safekeeper to restart, or just a single tenant. In either case,
that is pretty bad and causes an outage. Avoid panics. Never use
unwrap()
or other calls that might panic, to verify inputs from the
network or from disk.
It is acceptable to use functions that might panic, like unwrap()
, if
it is obvious that it cannot panic. For example, if you have just
checked that a variable is not None, it is OK to call unwrap()
on it,
but it is still preferable to use expect("reason")
instead to explain
why the function cannot fail.
assert!
and panic!
are reserved for checking clear invariants and
very obvious "can't happen" cases. When in doubt, use anyhow ensure!
or bail!
instead.
tracing::Level
doesn't provide very clear guidelines on what the
different levels mean, or when to use which level. Here is how we use
them:
Examples:
- could not open file "foobar"
- invalid tenant id
Errors are not expected to happen during normal operation. Incorrect inputs from client can cause ERRORs. For example, if a client tries to call a mgmt API that doesn't exist, or if a compute node sends passes an LSN that has already been garbage collected away.
These should not happen during normal operations. "Normal operations" is not a very precise concept. But for example, disk errors are not expected to happen when the system is working, so those count as Errors. However, if a TCP connection to a compute node is lost, that is not considered an Error, because it doesn't affect the pageserver's or safekeeper's operation in any way, and happens fairly frequently when compute nodes are shut down, or are killed abruptly because of errors in the compute.
Errors are monitored, and always need human investigation to determine the cause.
Whether something should be logged at ERROR, WARNING or INFO level can depend on the callers and clients. For example, it might be unexpected and a sign of a serious issue if the console calls the "timeline_detail" mgmt API for a timeline that doesn't exist. ERROR would be appropriate in that case. But if the console routinely calls the API after deleting a timeline, to check if the deletion has completed, then it would be totally normal and an INFO or DEBUG level message would be more appropriate. If a message is logged as an ERROR, but it in fact happens frequently in production and never requires any action, it should probably be demoted to an INFO level message.
Examples:
- could not remove temporary file "foobar.temp"
- unrecognized file "foobar" in timeline directory
Warnings are similar to Errors, in that they should not happen when the system is operating normally. The difference between Error and Warning is that an Error means that the operation failed, whereas Warning means that something unexpected happened, but the operation continued anyway. For example, if deleting a file fails because the file already didn't exist, it should be logged as Warning.
Note: The python regression tests, under
test_regress
, check the pageserver log after each test for any ERROR and WARN lines. If there are any ERRORs or WARNs that have not been explicitly listed in the test as allowed, the test is marked a failed. This is to catch unexpected errors e.g. in background operations, that don't cause immediate misbehaviour in the tested functionality.
Info level is used to log useful information when the system is operating normally. Info level is appropriate e.g. for logging state changes, background operations, and network connections.
Examples:
- "system is shutting down"
- "tenant was created"
- "retrying S3 upload"
Debug and Trace level messages are not printed to the log in our normal production configuration, but could be enabled for a specific server or tenant, to aid debugging. (Although we don't actually have that capability as of this writing).
We use logging "spans" to hold context information about the current operation. Almost every operation happens on a particular tenant and timeline, so we enter a span with the "tenant_id" and "timeline_id" very early when processing an incoming API request, for example. All background operations should also run in a span containing at least those two fields, and any other parameters or information that might be useful when debugging an error that might happen when performing the operation.
TODO: Spans are not captured in the Error when it is created, but when
the error is logged. It would be more useful to capture them at Error
creation. We should consider using tracing_error::SpanTrace
to do
that.
PostgreSQL has a style guide for writing error messages:
https://www.postgresql.org/docs/current/error-style-guide.html
Follow that guide when writing error messages in the PostgreSQL extension. We don't follow it strictly in the pageserver and safekeeper, but the advice in the PostgreSQL style guide is generally good, and you can't go wrong by following it.