Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ADR] Universal Error Specification #30

Draft
wants to merge 4 commits into
base: main
Choose a base branch
from
Draft

[ADR] Universal Error Specification #30

wants to merge 4 commits into from

Conversation

yordis
Copy link
Member

@yordis yordis commented Nov 3, 2022

Interested?

Every comment, every link to any information about it, and every opinion is appreciated (including the opinionated and strong ones), as small as it could be.

Please participate.

@yordis yordis force-pushed the univeral-error branch 7 times, most recently from 0f79999 to ccbbaa8 Compare November 3, 2022 06:55
@yordis yordis force-pushed the univeral-error branch 2 times, most recently from 36390ba to 2b63e27 Compare January 2, 2023 21:04
@stale
Copy link

stale bot commented Jun 10, 2023

This pull request has been automatically marked as "Withdrawn". If this pull request is still relevant, please leave any comment (for example, "bump"), and we'll keep it open. We are sorry that we haven't been able to prioritize reviewing it yet. Your contribution is very much appreciated!.

@stale stale bot added the State: Withdrawn The ADR is withdrawn may be taken up by another champion. label Jun 10, 2023
@yordis yordis added State: Deferred The ADR has not been acted upon for a significant period of time. and removed State: Withdrawn The ADR is withdrawn may be taken up by another champion. labels Jun 10, 2023
@yordis yordis force-pushed the univeral-error branch 4 times, most recently from 62344df to d0ad86b Compare August 21, 2023 23:01
@yordis yordis force-pushed the univeral-error branch 7 times, most recently from c74c765 to 7bc6e20 Compare February 3, 2024 11:38
@yordis yordis force-pushed the univeral-error branch 3 times, most recently from 42cbf9c to 198484c Compare February 3, 2024 22:29
@yordis yordis force-pushed the univeral-error branch 2 times, most recently from 6a4fe40 to 3ebada5 Compare February 23, 2024 18:28
@yordis yordis requested a review from acco March 21, 2024 07:19
@yordis
Copy link
Member Author

yordis commented Jul 19, 2024

How could we tell if a given error is "safe" to expose at the edge to the client vs. being an internal error only?

Safe is decided by the domain-specific level, whatever that means.

@yordis
Copy link
Member Author

yordis commented Jul 20, 2024

What about PII tagging of a given metadata?

@yordis yordis deleted the branch main August 20, 2024 17:40
@yordis yordis closed this Aug 20, 2024
@yordis yordis reopened this Aug 22, 2024
@yordis yordis changed the base branch from master to main August 22, 2024 17:19
Copy link

@dallashall dallashall left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that we should consider splitting this spec into two pieces (with some significant overlap):

  • In-system (to be logged and viewable only to the system and system admins)
  • Out-system (to be broadcast out to clients)

The main reason I would opt for this separation is to more easily standardize and meet security compliance requirements. Many orgs have already adopted a stance of SHALL NOT expose underlying architecture details to external actors, and this can involve anything from IDs that are "internal-only" to the names of the systems themselves.

It's likely that an initial error thrown in an app could adhere to the universal error specification, but an external-facing error (I assume even for service-to-service) would need to be derived from the initial error and cut down to be secure.

Short OWASP Community article on improper error handling
Latest OWASP draft for errors and exceptions

Comment on lines 80 to 89
- `domain`: The logical grouping to which the "code" belongs. The error domain
is typically the registered service name of the tool or product that
generates the error. Example: "com.myapp.iam". If the error is
generated by some common infrastructure, the error domain must be a
globally unique value that identifies the infrastructure.
- `code`: The code of the error. This is a constant value that identifies
the proximate cause of the error. Error codes are unique within a particular
domain of errors.
- `id`: The ID of an instance of the error. It could be a UUID, Nanoid, or any
other unique identifier that could be generated without coordination.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thinking specifically of systems where there can be multiple instances of the same service (and in some systems, there can be temporary drift between instances during rollover)...

Would it make sense to have an optional identifier for the service itself (more specific than the domain)?

For example, if we have two instances of com.myapp.iam, and during rollover (or blue-green), we encounter an error, I think having that additional detail of a unique identifier per instance will be useful. The identifier itself could be anything (ARN for AWS, commit/release SHA for K8s, or anything else that fits the implementation).


Separate question:
Is it intended for domain to include the environment? Or is the assumption that environment is out of scope here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are multiple ways to deal with the situation. Ideally, the "internal metadata" would allow you to tag your error with any data you wish to have.

The domain is intended to be as static as possible and define the Business Domain or Infrastructure Domain.

I wouldn't include the environment, but it is up to you; its meaning is based on the system.

It Depends 🤷🏻

The dimensions of the payload, which is my primary focus, kind of cover all of the above. I intend to standardize some of the values in another ADR unless it is something that we can for sure tell will be the same across different businesses.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That makes sense. I think that in my own systems, I likely would include some sort of release identifier/version number in some standard form that is present across all errors (but not in domain, since that should be static, like you said).

After thinking more about environment I think it's a totally separate issue, as I don't think that logs between environments shouldn't be mixed, so no need to include it.

@yordis
Copy link
Member Author

yordis commented Aug 23, 2024

@dallashall let me follow up on your notes; deeply appreciate, in the meantime; do not focus on the text as much since I do not want to enforce any formatting of the values but the structure itself for now

@yordis
Copy link
Member Author

yordis commented Aug 26, 2024

Hey folks, I do not want to remove all the docs written, but keep in mind that I do want to enforce any formatting of the values itself and be a freeform for the most part; leaving to another ADR to enforce formatting; expecting that each company opt-into different things

@yordis
Copy link
Member Author

yordis commented Aug 26, 2024

@dallashall about having In-system and Out-system, let me speak my mind; I am unsure.

I am unsure about the value of having two specs (ignore implementation) since they could have the same metadata.

That said, we must find a way to tag the error as internal; otherwise, the system would map the mistakes to a generic one.

Should we add a dimension that marks the error as internal only? By the way, I came across the need for the metadata level of visibility when considering GDPR.

This is purely from the spec perspective; in code, you could define two classes, InternalError and PublicError, which will enforce such dimensions to be correctly set up.
The good part of having a spec is that such low-level SDKs could be done once and avoid mistakes around the topic.

Again, I do not see the point of having two specs, but I agree with the problem of exposing a given error incorrectly.


The links you shared (which are excellent links, thank you so much) mentioned multiple that I am talking about:

Manage exceptions in a centralized manner to avoid duplicated try/catch blocks in the code
https://owasp.org/www-project-developer-guide/draft/design/web_app_checklist/handle_errors_and_exceptions/

That means that we never expose debug_info to the clients, and if we add the extra internal dimension, we could transform it into a generic INTERNAL error!

Let me know what you think.

@yordis
Copy link
Member Author

yordis commented Aug 26, 2024

Added Visibility to the spec

@yordis yordis requested a review from dallashall August 26, 2024 01:36
@dallashall
Copy link

@dallashall about having In-system and Out-system, let me speak my mind; I am unsure.

I am unsure about the value of having two specs (ignore implementation) since they could have the same metadata.

That said, we must find a way to tag the error as internal; otherwise, the system would map the mistakes to a generic one.

Should we add a dimension that marks the error as internal only? By the way, I came across the need for the metadata level of visibility when considering GDPR.

This is purely from the spec perspective; in code, you could define two classes, InternalError and PublicError, which will enforce such dimensions to be correctly set up. The good part of having a spec is that such low-level SDKs could be done once and avoid mistakes around the topic.

Again, I do not see the point of having two specs, but I agree with the problem of exposing a given error incorrectly.

The links you shared (which are excellent links, thank you so much) mentioned multiple that I am talking about:

Manage exceptions in a centralized manner to avoid duplicated try/catch blocks in the code
https://owasp.org/www-project-developer-guide/draft/design/web_app_checklist/handle_errors_and_exceptions/

That means that we never expose debug_info to the clients, and if we add the extra internal dimension, we could transform it into a generic INTERNAL error!

Let me know what you think.


I think the case for having two separate specs is that there are two audiences with very different requirements.

This is highlighted in the discussion on localization:

  • Should I always include localized_message field?

No, you should only include the localized_message field when the error is
meant to be displayed to the end-user. Most likely the field will be added
at the edge of the system, aka application layer, where you are aware of the
end-user actor and the end-user language.
It is anti-pattern to add the localized_message field at the core of the
system, aka domain layer, where you are not aware of the end-user actor and
the end-user language, unless you are building a system that is about
translations.

In one representation, we have a best practice (include localized messaging for external clients), but also an anti-pattern in the other representation (we should not include localization inside the system).

A similar logic happens around the physical structure of the error for errors that leave the system:

  • You MUST use the application/universal-error+json mime type
    content-type HTTP header when sending the error over the wire as a JSON.

In my mind, this ignores other ways of sending JSON/JSON-like data that are not tied to "normal" http requests (WebSocket messages, SSEs, GSON/gRPC). So rather than specifying this kind of rule in a specification like this, it makes more sense to live on its own in a spec that can handle questions about standards around the structure and protocol-specific rules to follow.


Looking at the section for input/form validation, I have a similar concern, but input validation is much more nuanced. As a common example: for unauthenticated users, I would hope provide error details on structure only (type, length, missing, etc), otherwise sticking with generic errors to minimize different attack strategies by bad actors. Even for authenticated users, a request to a document owned by another account should not differentiate between 404 not 401.

The way I think about it, form/input validation is actually a refinement of handling type errors (or data structure errors) in general, and from what you have so far, the intent is that type errors should:

  • Show what the expected type was
  • Show what the received type was
  • Show where the mismatch(es) occurred (causes, in your example)

I don't think this is actually limited to external/UI interfaces, but is useful in all parts of a system where messages are passed/consumed.


I think this would be easier to review if we started at the top with a more value-driven purpose for the spec. You highlighted a lot of problems that we hope to solve, but I think if have clearer answers to a few questions, it will help guide our spec definition.

Some example questions (based on my reading of the spec so far):

  • If implemented, what are the main benefits to a common developer?
    • Developer does not have to reinvent/learn a new error specification or error handling pattern
    • Logs will contain consistent data between parts of a system and event between separate systems
    • Developers can understand and debug a system more easily because
      • Logs have information about where in the system an error occurred (Domain)
      • Logs have information about what the error is (Code)
      • Logs show a unique ID for errors (useful in searching/tracing, especially alongside causes)
      • Logs show what the error is (human readable message)
      • Logs give insight into code-level details (stack traces, utilizing cause + detailed type mismatch details, etc)

What I would like to see, as a separate spec for errors that leave a system:

  • If implemented, what are the main benefits to a common developer?
    • Developer reduces risk of sending sensitive data (this would be my prime motive for adopting)
    • Errors have reasonable, localized meaning to an end user/system, ideally providing enough information for self correction ("Passwords must contain 12 characters....", "Field 'amount' MUST be formatted as a string of USD, like '$2002.99'")
    • Errors do not reveal system details (data-store implementations, messaging patterns, even service providers)
    • Errors minimize the risk of OWASP-identified attack patterns (sniffing for not-found vs unauthorized, as an example)
    • External errors may reference or be referenced in system level errors via ID (maybe... the idea is that an external error should still be logged in a way to be searchable/connectable to system level errors by developers)

Even though its out of scope here, I've seen several standards on handling PII (personally identifiable information) that specify only logging IDs of objects which would contain the information. I have mixed feelings on that kind of approach, but from a data-deletion standpoint, it makes sense. I've also seen strategies that store encrypted values, and devs check-out the decryption key (which is rotated based on time). Again, out of scope, but putting here for kicks.

@yordis
Copy link
Member Author

yordis commented Nov 21, 2024

Note for self: follow up on GraphQL error

@yordis
Copy link
Member Author

yordis commented Dec 14, 2024

Leaving my notes here,

There is a strong reason not to create two specs at the top level of the message.

The error message travels across multiple hands (Product Engineer, Platform Engineer, Infra Engineer), and each role has its own needs.

  • Why avoid top-level forking/branching or forking/branching that is far away from the information that matters?

We would wrap things in a middleware pipeline style, where developers assume what happened after or before. This would lead to every error prune and, in the worst case, insecure situations.

Suppose we keep the discrimination union as close as possible to where it matters based on different roles and perspectives. In that case, developers with such a high level of information must deal with such branching.

The higher the branching, the earlier the branching and the more people will make assumptions that may be, most likely, wrong.

visibility, debug_info, or tagging of information as GDPR, Secret, or whatever, matters for a given layer to make assumptions about that data and adequately understand things. So, keep the branching closer to such data.

Example, stripping GDPR information from logging the error matters because of GDPR, not because it is internal or not


Every role must care about what information is injected and its different uses. Even Product Engineers have to care about GDPR since they know the domain and, therefore, can correctly tag information or distinguish between internal and external.

@manuphatak
Copy link
Contributor

Let me build up to a suggestion.

When reviewing errors we could say there's a "macro lens" and a "micro lens".

With a Macro lens, you're looking at counts and rates of occurrences. You'd be using this info to prioritize fixing one error over another.

With a Micro lens, you're looking at individual errors and trying to understand the 5 W's -- who, what, where, when, why, and how? Usually you're trying to understand the 5 W's with the intention of solving the error.


Ok, pursuing the Micro lens...

There's some information that could be included to streamline resolving issues.

  1. An address in code. Something extremely searchable.

    This could be a file+line#. Why? In its current form, you could have the same error being dispatched from multiple places in the code base.

    Another way to put this: I'd like to be able to search info in the error, and instantly find the line of code that produced it.

  2. Transaction ID

    This is crem-de-crem. If your error is attached to some sort of transactionId that connects multiple logs & messages, this is a huge boon in being able to trace where and why the error came up.

  3. Tenant Identifier -- like company id or org id.

    Why? So that you can use search filters to isolate activity of a tenant and understand what's happening across the tenant.

Comment on lines +204 to +207
It is anti-pattern to add the `localized_message` field at the core of the
system, aka domain layer, where you are not aware of the end-user actor and
the end-user language, unless you are building a system that is about
translations.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💯

@yordis
Copy link
Member Author

yordis commented Jan 8, 2025

Sorry for the delay, I am having some health issue at the moment,

  1. An address in code. Something extremely searchable.

That is where the stack trace info would help you, imagine a system built around these errors where you could treat the error as any other schema and have a schema registry for it, static stuff

  1. Transaction ID

Somebody else brought up the same conversation, at that moment I couldnt' make up my mind about it, I think I will commit to add metadata to the error itself

  • causation id: who called the thing. Similar to span id in OTEL
  • correlation id: what workflow am I part of? Similar to trace id in OTEL

Back then I thought that OTEL should be doing this, but after some years, I realized that OTEL tools have their own guarantees (like dropping things or not, billing concerns) so I think it is prudent to have this as part of the spec.

@dallashall @acco @manuphatak I would love to hear your thoughts on that one.

  1. Tenant Identifier -- like company id or org id.

This is out of this particular spec, but you can add to the spec (follow up ADR) to come up with patterns for a given product or organization where you may want to enforce the tenant id as part of the metadata or something like that.

@yordis yordis requested a review from manuphatak January 8, 2025 21:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
State: Deferred The ADR has not been acted upon for a significant period of time.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants