[ADR] Universal Error Specification #30

yordis · 2022-11-03T05:57:05Z

Interested?

Every comment, every link to any information about it, and every opinion is appreciated (including the opinionated and strong ones), as small as it could be.

Please participate.

stale · 2023-06-10T01:29:43Z

This pull request has been automatically marked as "Withdrawn". If this pull request is still relevant, please leave any comment (for example, "bump"), and we'll keep it open. We are sorry that we haven't been able to prioritize reviewing it yet. Your contribution is very much appreciated!.

yordis · 2024-07-19T18:49:21Z

How could we tell if a given error is "safe" to expose at the edge to the client vs. being an internal error only?

Safe is decided by the domain-specific level, whatever that means.

yordis · 2024-07-20T06:01:21Z

What about PII tagging of a given metadata?

dallashall

I think that we should consider splitting this spec into two pieces (with some significant overlap):

In-system (to be logged and viewable only to the system and system admins)
Out-system (to be broadcast out to clients)

The main reason I would opt for this separation is to more easily standardize and meet security compliance requirements. Many orgs have already adopted a stance of SHALL NOT expose underlying architecture details to external actors, and this can involve anything from IDs that are "internal-only" to the names of the systems themselves.

It's likely that an initial error thrown in an app could adhere to the universal error specification, but an external-facing error (I assume even for service-to-service) would need to be derived from the initial error and cut down to be secure.

Short OWASP Community article on improper error handling
Latest OWASP draft for errors and exceptions

dallashall · 2024-08-23T11:21:14Z

src/adrs/0129349218/README.md

+- `domain`: The logical grouping to which the "code" belongs. The error domain
+  is typically the registered service name of the tool or product that
+  generates the error. Example: "com.myapp.iam". If the error is
+  generated by some common infrastructure, the error domain must be a
+  globally unique value that identifies the infrastructure.
+- `code`: The code of the error. This is a constant value that identifies
+  the proximate cause of the error. Error codes are unique within a particular
+  domain of errors.
+- `id`: The ID of an instance of the error. It could be a UUID, Nanoid, or any
+  other unique identifier that could be generated without coordination.


Thinking specifically of systems where there can be multiple instances of the same service (and in some systems, there can be temporary drift between instances during rollover)...

Would it make sense to have an optional identifier for the service itself (more specific than the domain)?

For example, if we have two instances of com.myapp.iam, and during rollover (or blue-green), we encounter an error, I think having that additional detail of a unique identifier per instance will be useful. The identifier itself could be anything (ARN for AWS, commit/release SHA for K8s, or anything else that fits the implementation).

Separate question:
Is it intended for domain to include the environment? Or is the assumption that environment is out of scope here?

There are multiple ways to deal with the situation. Ideally, the "internal metadata" would allow you to tag your error with any data you wish to have.

The domain is intended to be as static as possible and define the Business Domain or Infrastructure Domain.

I wouldn't include the environment, but it is up to you; its meaning is based on the system.

It Depends 🤷🏻

The dimensions of the payload, which is my primary focus, kind of cover all of the above. I intend to standardize some of the values in another ADR unless it is something that we can for sure tell will be the same across different businesses.

That makes sense. I think that in my own systems, I likely would include some sort of release identifier/version number in some standard form that is present across all errors (but not in domain, since that should be static, like you said).

After thinking more about environment I think it's a totally separate issue, as I don't think that logs between environments shouldn't be mixed, so no need to include it.

yordis · 2024-08-23T16:25:13Z

@dallashall let me follow up on your notes; deeply appreciate, in the meantime; do not focus on the text as much since I do not want to enforce any formatting of the values but the structure itself for now

yordis · 2024-08-26T00:04:59Z

Hey folks, I do not want to remove all the docs written, but keep in mind that I do want to enforce any formatting of the values itself and be a freeform for the most part; leaving to another ADR to enforce formatting; expecting that each company opt-into different things

yordis · 2024-08-26T00:37:13Z

@dallashall about having In-system and Out-system, let me speak my mind; I am unsure.

I am unsure about the value of having two specs (ignore implementation) since they could have the same metadata.

That said, we must find a way to tag the error as internal; otherwise, the system would map the mistakes to a generic one.

Should we add a dimension that marks the error as internal only? By the way, I came across the need for the metadata level of visibility when considering GDPR.

This is purely from the spec perspective; in code, you could define two classes, InternalError and PublicError, which will enforce such dimensions to be correctly set up.
The good part of having a spec is that such low-level SDKs could be done once and avoid mistakes around the topic.

Again, I do not see the point of having two specs, but I agree with the problem of exposing a given error incorrectly.

The links you shared (which are excellent links, thank you so much) mentioned multiple that I am talking about:

Manage exceptions in a centralized manner to avoid duplicated try/catch blocks in the code
https://owasp.org/www-project-developer-guide/draft/design/web_app_checklist/handle_errors_and_exceptions/

That means that we never expose debug_info to the clients, and if we add the extra internal dimension, we could transform it into a generic INTERNAL error!

Let me know what you think.

yordis · 2024-08-26T00:52:01Z

Added Visibility to the spec

dallashall · 2024-08-30T10:53:28Z

@dallashall about having In-system and Out-system, let me speak my mind; I am unsure.

I am unsure about the value of having two specs (ignore implementation) since they could have the same metadata.

That said, we must find a way to tag the error as internal; otherwise, the system would map the mistakes to a generic one.

Should we add a dimension that marks the error as internal only? By the way, I came across the need for the metadata level of visibility when considering GDPR.

This is purely from the spec perspective; in code, you could define two classes, InternalError and PublicError, which will enforce such dimensions to be correctly set up. The good part of having a spec is that such low-level SDKs could be done once and avoid mistakes around the topic.

Again, I do not see the point of having two specs, but I agree with the problem of exposing a given error incorrectly.

The links you shared (which are excellent links, thank you so much) mentioned multiple that I am talking about:

Manage exceptions in a centralized manner to avoid duplicated try/catch blocks in the code
https://owasp.org/www-project-developer-guide/draft/design/web_app_checklist/handle_errors_and_exceptions/

That means that we never expose debug_info to the clients, and if we add the extra internal dimension, we could transform it into a generic INTERNAL error!

Let me know what you think.

I think the case for having two separate specs is that there are two audiences with very different requirements.

This is highlighted in the discussion on localization:

Should I always include localized_message field?

No, you should only include the localized_message field when the error is
meant to be displayed to the end-user. Most likely the field will be added
at the edge of the system, aka application layer, where you are aware of the
end-user actor and the end-user language.
It is anti-pattern to add the localized_message field at the core of the
system, aka domain layer, where you are not aware of the end-user actor and
the end-user language, unless you are building a system that is about
translations.

In one representation, we have a best practice (include localized messaging for external clients), but also an anti-pattern in the other representation (we should not include localization inside the system).

A similar logic happens around the physical structure of the error for errors that leave the system:

You MUST use the application/universal-error+json mime type
content-type HTTP header when sending the error over the wire as a JSON.

In my mind, this ignores other ways of sending JSON/JSON-like data that are not tied to "normal" http requests (WebSocket messages, SSEs, GSON/gRPC). So rather than specifying this kind of rule in a specification like this, it makes more sense to live on its own in a spec that can handle questions about standards around the structure and protocol-specific rules to follow.

Looking at the section for input/form validation, I have a similar concern, but input validation is much more nuanced. As a common example: for unauthenticated users, I would hope provide error details on structure only (type, length, missing, etc), otherwise sticking with generic errors to minimize different attack strategies by bad actors. Even for authenticated users, a request to a document owned by another account should not differentiate between 404 not 401.

The way I think about it, form/input validation is actually a refinement of handling type errors (or data structure errors) in general, and from what you have so far, the intent is that type errors should:

Show what the expected type was
Show what the received type was
Show where the mismatch(es) occurred (causes, in your example)

I don't think this is actually limited to external/UI interfaces, but is useful in all parts of a system where messages are passed/consumed.

I think this would be easier to review if we started at the top with a more value-driven purpose for the spec. You highlighted a lot of problems that we hope to solve, but I think if have clearer answers to a few questions, it will help guide our spec definition.

Some example questions (based on my reading of the spec so far):

If implemented, what are the main benefits to a common developer?
- Developer does not have to reinvent/learn a new error specification or error handling pattern
- Logs will contain consistent data between parts of a system and event between separate systems
- Developers can understand and debug a system more easily because
  - Logs have information about where in the system an error occurred (Domain)
  - Logs have information about what the error is (Code)
  - Logs show a unique ID for errors (useful in searching/tracing, especially alongside causes)
  - Logs show what the error is (human readable message)
  - Logs give insight into code-level details (stack traces, utilizing cause + detailed type mismatch details, etc)

What I would like to see, as a separate spec for errors that leave a system:

If implemented, what are the main benefits to a common developer?
- Developer reduces risk of sending sensitive data (this would be my prime motive for adopting)
- Errors have reasonable, localized meaning to an end user/system, ideally providing enough information for self correction ("Passwords must contain 12 characters....", "Field 'amount' MUST be formatted as a string of USD, like '$2002.99'")
- Errors do not reveal system details (data-store implementations, messaging patterns, even service providers)
- Errors minimize the risk of OWASP-identified attack patterns (sniffing for not-found vs unauthorized, as an example)
- External errors may reference or be referenced in system level errors via ID (maybe... the idea is that an external error should still be logged in a way to be searchable/connectable to system level errors by developers)

Even though its out of scope here, I've seen several standards on handling PII (personally identifiable information) that specify only logging IDs of objects which would contain the information. I have mixed feelings on that kind of approach, but from a data-deletion standpoint, it makes sense. I've also seen strategies that store encrypted values, and devs check-out the decryption key (which is rotated based on time). Again, out of scope, but putting here for kicks.

yordis · 2024-11-21T18:27:30Z

Note for self: follow up on GraphQL error

yordis · 2024-12-14T20:36:18Z

Leaving my notes here,

There is a strong reason not to create two specs at the top level of the message.

The error message travels across multiple hands (Product Engineer, Platform Engineer, Infra Engineer), and each role has its own needs.

Why avoid top-level forking/branching or forking/branching that is far away from the information that matters?

We would wrap things in a middleware pipeline style, where developers assume what happened after or before. This would lead to every error prune and, in the worst case, insecure situations.

Suppose we keep the discrimination union as close as possible to where it matters based on different roles and perspectives. In that case, developers with such a high level of information must deal with such branching.

The higher the branching, the earlier the branching and the more people will make assumptions that may be, most likely, wrong.

visibility, debug_info, or tagging of information as GDPR, Secret, or whatever, matters for a given layer to make assumptions about that data and adequately understand things. So, keep the branching closer to such data.

Example, stripping GDPR information from logging the error matters because of GDPR, not because it is internal or not

Every role must care about what information is injected and its different uses. Even Product Engineers have to care about GDPR since they know the domain and, therefore, can correctly tag information or distinguish between internal and external.

manuphatak · 2025-01-07T22:23:25Z

Let me build up to a suggestion.

When reviewing errors we could say there's a "macro lens" and a "micro lens".

With a Macro lens, you're looking at counts and rates of occurrences. You'd be using this info to prioritize fixing one error over another.

With a Micro lens, you're looking at individual errors and trying to understand the 5 W's -- who, what, where, when, why, and how? Usually you're trying to understand the 5 W's with the intention of solving the error.

Ok, pursuing the Micro lens...

There's some information that could be included to streamline resolving issues.

An address in code. Something extremely searchable.

This could be a file+line#. Why? In its current form, you could have the same error being dispatched from multiple places in the code base.

Another way to put this: I'd like to be able to search info in the error, and instantly find the line of code that produced it.
Transaction ID

This is crem-de-crem. If your error is attached to some sort of transactionId that connects multiple logs & messages, this is a huge boon in being able to trace where and why the error came up.
Tenant Identifier -- like company id or org id.

Why? So that you can use search filters to isolate activity of a tenant and understand what's happening across the tenant.

manuphatak · 2025-01-07T22:24:01Z

src/adrs/0129349218/README.md

+It is anti-pattern to add the `localized_message` field at the core of the
+system, aka domain layer, where you are not aware of the end-user actor and
+the end-user language, unless you are building a system that is about
+translations.


yordis · 2025-01-08T21:22:18Z

Sorry for the delay, I am having some health issue at the moment,

An address in code. Something extremely searchable.

That is where the stack trace info would help you, imagine a system built around these errors where you could treat the error as any other schema and have a schema registry for it, static stuff

Transaction ID

Somebody else brought up the same conversation, at that moment I couldnt' make up my mind about it, I think I will commit to add metadata to the error itself

causation id: who called the thing. Similar to span id in OTEL
correlation id: what workflow am I part of? Similar to trace id in OTEL

Back then I thought that OTEL should be doing this, but after some years, I realized that OTEL tools have their own guarantees (like dropping things or not, billing concerns) so I think it is prudent to have this as part of the spec.

@dallashall @acco @manuphatak I would love to hear your thoughts on that one.

Tenant Identifier -- like company id or org id.

This is out of this particular spec, but you can add to the spec (follow up ADR) to come up with patterns for a given product or organization where you may want to enforce the tenant id as part of the metadata or something like that.

yordis force-pushed the univeral-error branch 7 times, most recently from 0f79999 to ccbbaa8 Compare November 3, 2022 06:55

yordis force-pushed the univeral-error branch from c3a6b0a to 533073d Compare December 20, 2022 04:18

yordis force-pushed the univeral-error branch 2 times, most recently from 36390ba to 2b63e27 Compare January 2, 2023 21:04

stale bot added the State: Withdrawn The ADR is withdrawn may be taken up by another champion. label Jun 10, 2023

yordis added State: Deferred The ADR has not been acted upon for a significant period of time. and removed State: Withdrawn The ADR is withdrawn may be taken up by another champion. labels Jun 10, 2023

yordis force-pushed the univeral-error branch 4 times, most recently from 62344df to d0ad86b Compare August 21, 2023 23:01

yordis force-pushed the univeral-error branch from d0ad86b to f4941fa Compare December 26, 2023 15:29

yordis force-pushed the univeral-error branch from f4941fa to a2efabb Compare January 2, 2024 18:56

yordis force-pushed the univeral-error branch 7 times, most recently from c74c765 to 7bc6e20 Compare February 3, 2024 11:38

yordis force-pushed the univeral-error branch 3 times, most recently from 42cbf9c to 198484c Compare February 3, 2024 22:29

yordis force-pushed the univeral-error branch 2 times, most recently from 6a4fe40 to 3ebada5 Compare February 23, 2024 18:28

yordis requested a review from acco March 21, 2024 07:19

yordis force-pushed the univeral-error branch from 3ebada5 to 1e37260 Compare April 21, 2024 23:55

yordis force-pushed the univeral-error branch from 1e37260 to 8cf2c83 Compare June 13, 2024 19:24

yordis deleted the branch main August 20, 2024 17:40

yordis closed this Aug 20, 2024

yordis reopened this Aug 22, 2024

yordis changed the base branch from master to main August 22, 2024 17:19

ADR] Universal Error Specification

dffaebb

yordis force-pushed the univeral-error branch from 8cf2c83 to dffaebb Compare August 22, 2024 17:23

fix: wording

54e3e7a

dallashall reviewed Aug 23, 2024

View reviewed changes

chore: add visibility

c7f011f

yordis requested a review from dallashall August 26, 2024 01:36

fix: wording

9101a8e

yordis mentioned this pull request Oct 1, 2024

Restructure AIP-193 aip-dev/google.aip.dev#1426

Merged

manuphatak reviewed Jan 7, 2025

View reviewed changes

yordis requested a review from manuphatak January 8, 2025 21:22

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ADR] Universal Error Specification #30

[ADR] Universal Error Specification #30

yordis commented Nov 3, 2022 •

edited

Loading

stale bot commented Jun 10, 2023

yordis commented Jul 19, 2024

yordis commented Jul 20, 2024

dallashall left a comment

dallashall Aug 23, 2024

yordis Aug 26, 2024

dallashall Aug 30, 2024

yordis commented Aug 23, 2024

yordis commented Aug 26, 2024

yordis commented Aug 26, 2024

yordis commented Aug 26, 2024

dallashall commented Aug 30, 2024

yordis commented Nov 21, 2024

yordis commented Dec 14, 2024

manuphatak commented Jan 7, 2025

manuphatak Jan 7, 2025

yordis commented Jan 8, 2025

[ADR] Universal Error Specification #30

Are you sure you want to change the base?

[ADR] Universal Error Specification #30

Conversation

yordis commented Nov 3, 2022 • edited Loading

Interested?

stale bot commented Jun 10, 2023

yordis commented Jul 19, 2024

yordis commented Jul 20, 2024

dallashall left a comment

Choose a reason for hiding this comment

dallashall Aug 23, 2024

Choose a reason for hiding this comment

yordis Aug 26, 2024

Choose a reason for hiding this comment

dallashall Aug 30, 2024

Choose a reason for hiding this comment

yordis commented Aug 23, 2024

yordis commented Aug 26, 2024

yordis commented Aug 26, 2024

yordis commented Aug 26, 2024

dallashall commented Aug 30, 2024

yordis commented Nov 21, 2024

yordis commented Dec 14, 2024

manuphatak commented Jan 7, 2025

manuphatak Jan 7, 2025

Choose a reason for hiding this comment

yordis commented Jan 8, 2025

yordis commented Nov 3, 2022 •

edited

Loading