-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[ADR] Universal Error Specification #30
base: main
Are you sure you want to change the base?
Conversation
0f79999
to
ccbbaa8
Compare
c3a6b0a
to
533073d
Compare
36390ba
to
2b63e27
Compare
This pull request has been automatically marked as "Withdrawn". If this pull request is still relevant, please leave any comment (for example, "bump"), and we'll keep it open. We are sorry that we haven't been able to prioritize reviewing it yet. Your contribution is very much appreciated!. |
62344df
to
d0ad86b
Compare
d0ad86b
to
f4941fa
Compare
c74c765
to
7bc6e20
Compare
42cbf9c
to
198484c
Compare
6a4fe40
to
3ebada5
Compare
How could we tell if a given error is "safe" to expose at the edge to the client vs. being an internal error only? Safe is decided by the domain-specific level, whatever that means. |
What about PII tagging of a given metadata? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that we should consider splitting this spec into two pieces (with some significant overlap):
- In-system (to be logged and viewable only to the system and system admins)
- Out-system (to be broadcast out to clients)
The main reason I would opt for this separation is to more easily standardize and meet security compliance requirements. Many orgs have already adopted a stance of SHALL NOT expose underlying architecture details to external actors
, and this can involve anything from IDs that are "internal-only" to the names of the systems themselves.
It's likely that an initial error thrown in an app could adhere to the universal error specification, but an external-facing error (I assume even for service-to-service) would need to be derived from the initial error and cut down to be secure.
Short OWASP Community article on improper error handling
Latest OWASP draft for errors and exceptions
src/adrs/0129349218/README.md
Outdated
- `domain`: The logical grouping to which the "code" belongs. The error domain | ||
is typically the registered service name of the tool or product that | ||
generates the error. Example: "com.myapp.iam". If the error is | ||
generated by some common infrastructure, the error domain must be a | ||
globally unique value that identifies the infrastructure. | ||
- `code`: The code of the error. This is a constant value that identifies | ||
the proximate cause of the error. Error codes are unique within a particular | ||
domain of errors. | ||
- `id`: The ID of an instance of the error. It could be a UUID, Nanoid, or any | ||
other unique identifier that could be generated without coordination. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thinking specifically of systems where there can be multiple instances of the same service (and in some systems, there can be temporary drift between instances during rollover)...
Would it make sense to have an optional identifier for the service itself (more specific than the domain)?
For example, if we have two instances of com.myapp.iam
, and during rollover (or blue-green), we encounter an error, I think having that additional detail of a unique identifier per instance will be useful. The identifier itself could be anything (ARN for AWS, commit/release SHA for K8s, or anything else that fits the implementation).
Separate question:
Is it intended for domain
to include the environment? Or is the assumption that environment is out of scope here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are multiple ways to deal with the situation. Ideally, the "internal metadata" would allow you to tag your error with any data you wish to have.
The domain
is intended to be as static as possible and define the Business Domain or Infrastructure Domain.
I wouldn't include the environment, but it is up to you; its meaning is based on the system.
It Depends 🤷🏻
The dimensions of the payload, which is my primary focus, kind of cover all of the above. I intend to standardize some of the values in another ADR unless it is something that we can for sure tell will be the same across different businesses.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That makes sense. I think that in my own systems, I likely would include some sort of release identifier/version number in some standard form that is present across all errors (but not in domain, since that should be static, like you said).
After thinking more about environment
I think it's a totally separate issue, as I don't think that logs between environments shouldn't be mixed, so no need to include it.
@dallashall let me follow up on your notes; deeply appreciate, in the meantime; do not focus on the text as much since I do not want to enforce any formatting of the values but the structure itself for now |
Hey folks, I do not want to remove all the docs written, but keep in mind that I do want to enforce any formatting of the values itself and be a freeform for the most part; leaving to another ADR to enforce formatting; expecting that each company opt-into different things |
@dallashall about having I am unsure about the value of having two That said, we must find a way to tag the error as internal; otherwise, the system would map the mistakes to a generic one. Should we add a dimension that marks the error as internal only? By the way, I came across the need for the metadata level of visibility when considering GDPR. This is purely from the spec perspective; in code, you could define two classes, Again, I do not see the point of having two specs, but I agree with the problem of exposing a given error incorrectly. The links you shared (which are excellent links, thank you so much) mentioned multiple that I am talking about:
That means that we never expose Let me know what you think. |
Added |
I think the case for having two separate specs is that there are two audiences with very different requirements. This is highlighted in the discussion on localization:
In one representation, we have a best practice (include localized messaging for external clients), but also an anti-pattern in the other representation (we should not include localization inside the system). A similar logic happens around the physical structure of the error for errors that leave the system:
In my mind, this ignores other ways of sending JSON/JSON-like data that are not tied to "normal" http requests (WebSocket messages, SSEs, GSON/gRPC). So rather than specifying this kind of rule in a specification like this, it makes more sense to live on its own in a spec that can handle questions about standards around the structure and protocol-specific rules to follow. Looking at the section for input/form validation, I have a similar concern, but input validation is much more nuanced. As a common example: for unauthenticated users, I would hope provide error details on structure only (type, length, missing, etc), otherwise sticking with generic errors to minimize different attack strategies by bad actors. Even for authenticated users, a request to a document owned by another account should not differentiate between 404 not 401. The way I think about it, form/input validation is actually a refinement of handling type errors (or data structure errors) in general, and from what you have so far, the intent is that type errors should:
I don't think this is actually limited to external/UI interfaces, but is useful in all parts of a system where messages are passed/consumed. I think this would be easier to review if we started at the top with a more value-driven purpose for the spec. You highlighted a lot of problems that we hope to solve, but I think if have clearer answers to a few questions, it will help guide our spec definition. Some example questions (based on my reading of the spec so far):
What I would like to see, as a separate spec for errors that leave a system:
Even though its out of scope here, I've seen several standards on handling PII (personally identifiable information) that specify only logging IDs of objects which would contain the information. I have mixed feelings on that kind of approach, but from a data-deletion standpoint, it makes sense. I've also seen strategies that store encrypted values, and devs check-out the decryption key (which is rotated based on time). Again, out of scope, but putting here for kicks. |
Note for self: follow up on GraphQL error |
Leaving my notes here, There is a strong reason not to create two specs at the top level of the message. The error message travels across multiple hands (Product Engineer, Platform Engineer, Infra Engineer), and each role has its own needs.
We would wrap things in a middleware pipeline style, where developers assume what happened after or before. This would lead to every error prune and, in the worst case, insecure situations. Suppose we keep the discrimination union as close as possible to where it matters based on different roles and perspectives. In that case, developers with such a high level of information must deal with such branching. The higher the branching, the earlier the branching and the more people will make assumptions that may be, most likely, wrong.
Example, stripping GDPR information from logging the error matters because of GDPR, not because it is internal or not Every role must care about what information is injected and its different uses. Even Product Engineers have to care about GDPR since they know the domain and, therefore, can correctly tag information or distinguish between internal and external. |
Let me build up to a suggestion. When reviewing errors we could say there's a "macro lens" and a "micro lens". With a Macro lens, you're looking at counts and rates of occurrences. You'd be using this info to prioritize fixing one error over another. With a Micro lens, you're looking at individual errors and trying to understand the 5 W's -- who, what, where, when, why, and how? Usually you're trying to understand the 5 W's with the intention of solving the error. Ok, pursuing the Micro lens... There's some information that could be included to streamline resolving issues.
|
It is anti-pattern to add the `localized_message` field at the core of the | ||
system, aka domain layer, where you are not aware of the end-user actor and | ||
the end-user language, unless you are building a system that is about | ||
translations. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💯
Sorry for the delay, I am having some health issue at the moment,
That is where the stack trace info would help you, imagine a system built around these errors where you could treat the error as any other schema and have a schema registry for it, static stuff
Somebody else brought up the same conversation, at that moment I couldnt' make up my mind about it, I think I will commit to add metadata to the error itself
Back then I thought that OTEL should be doing this, but after some years, I realized that OTEL tools have their own guarantees (like dropping things or not, billing concerns) so I think it is prudent to have this as part of the spec. @dallashall @acco @manuphatak I would love to hear your thoughts on that one.
This is out of this particular spec, but you can add to the spec (follow up ADR) to come up with patterns for a given product or organization where you may want to enforce the tenant id as part of the metadata or something like that. |
Interested?
Every comment, every link to any information about it, and every opinion is appreciated (including the opinionated and strong ones), as small as it could be.
Please participate.