-
Notifications
You must be signed in to change notification settings - Fork 5
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
feat: adds conceptual framework docs
- Loading branch information
1 parent
f7d4a68
commit 8397296
Showing
31 changed files
with
788 additions
and
56 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file was deleted.
Oops, something went wrong.
2 changes: 1 addition & 1 deletion
2
packages/reference/environments/production/us-east-2/pf_website/version.yaml
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1 +1 @@ | ||
version: alpha.10 | ||
version: alpha.11 |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
183 changes: 183 additions & 0 deletions
183
...ges/website/src/app/(web)/docs/framework/framework/downtime-visibility/page.mdx
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,183 @@ | ||
# Downtime Visibility | ||
|
||
Keeping your systems online is one of the largest responsibilities of platform engineering. Downtime not only | ||
means you aren't delivering value to your users at that moment, but it also undermines your organization's | ||
public reputation, potentially causing users to reconsider their relationship with your service. | ||
|
||
However, preventing downtime doesn't start with deploying a highly-available Kubernetes cluster. It involves creating | ||
visibility into when and why downtime occurs so that your organization can improve its resilient engineering practices. | ||
|
||
## Defining Downtime | ||
|
||
"Is the system down?" is a surprisingly nuanced question, and many organizations struggle to come up with a concrete, | ||
repeatable process for answering this question. | ||
|
||
Obviously, if the servers are offline, the system is down. But what if a bug prevents 25% of users from logging in? What | ||
if users can interact with the application, but response times are 10x above normal? Is the system down? | ||
|
||
The truth is that there is no universal definition of downtime, but your job is to help provide concreteness to this problem. | ||
|
||
We recommend the following exercise: | ||
|
||
1. Break your system into top-level application components. For example, if you are a social media site, you might have the following components: | ||
|
||
* Authentication | ||
* Messaging | ||
* Feed | ||
* Profiles | ||
* Posts | ||
|
||
2. For each top-level component, define a set of application flows that **must** work in order for that component to be considered | ||
functional. For example, consider authentication. You might come up with the following application flows: | ||
|
||
* Users can enter a username and password, press submit, and land on their home feed within 500ms. | ||
* Users can click a social login button, provide their Google credentials, and land on their home feed within 1000ms. | ||
* Users can click the logout button, have their local authentication token removed, and be returned to the login page within 500ms. | ||
|
||
3. Devise some standard way for evaluating whether those application flows are functional. There are many potential approaches | ||
which we will discuss in the next section. | ||
|
||
The key is to evaluate the question of downtime from the perspective of your users, not from the | ||
perspective of your infrastructure. Why is this critical? | ||
|
||
* The business impact (\$\$\$) of downtime can only be measured in the context of what business value is being disrupted. | ||
|
||
* Nobody outside the immediate engineering team knows what it means when service `foo` is failing its health checks. | ||
|
||
* A bottom-up approach is error-prone. In other words, there are 1,000s of technical reasons why authentication might be broken. You | ||
will never be able to measure every possible failure scenario reliability. Instead, measure the thing you care about | ||
directly: "Can users log in?" | ||
|
||
## Identifying Downtime | ||
|
||
### Continuous Testing | ||
|
||
The gold standard for measurement of downtime is continuous testing in production. In other words, you should | ||
have an automated mechanism that simulates the application flows that you want to ensure are functional. These tests should regularly | ||
run against your production system. Test failures should alert the team to downtime in that application component. | ||
|
||
While this produces an extremely high signal-to-noise ratio, it also presents many challenges: | ||
|
||
* Designing the test runners themselves | ||
|
||
* Ensuring that test actions and data do not pollute business metrics for actual users | ||
|
||
* Keeping the tests updated with application code changes to prevent false positives | ||
|
||
For many small organizations, this might be too costly to implement effectively. It is entirely reasonable | ||
to decide that this investment is too heavy at the current point in your organization's lifecycle. | ||
|
||
### Analyzing Runtime Data | ||
|
||
An easier solution would be to instrument your application code to produce a signal or event when an application flow does | ||
not work for unexpected reasons. | ||
|
||
For example, if a user attempts to log in but the server returns an HTTP `5xx` error | ||
code, the application code would fire an event into your observability platform indicating that "User login | ||
with username and password failed." After a certain error threshold, a downtime notification would | ||
be triggered. | ||
|
||
While this is an easier technical lift than continuous testing, it suffers from a handful of downsides: | ||
|
||
* It requires real users experience problems before downtime can be detected. | ||
|
||
* The instrumentation can be error-prone in both over- and under-reporting actual downtime: | ||
|
||
* Over-reporting: Consider the authentication scenario above. A `403` error might simply indicate that the user | ||
entered an incorrect password. However, an engineer might accidentally capture that as a downtime error. | ||
|
||
* Under-reporting: Consider the scenario where the observability code fails to load due to a server error | ||
that also impacts other systems. That impact would not be captured unless you explicitly designed around | ||
this failure scenario. | ||
|
||
Unfortunately, there are 1,000s of subtle ways systems might break, and it | ||
is difficult (if not impossible) to account for them all. In our experience, even the best attempts at explicit reporting | ||
miss 10-20% of downtime events. | ||
|
||
* The differentiation between what is downtime versus what is simply a bug can become very tricky. For example, consider | ||
the popular error-reporting system, [Sentry](https://sentry.io/welcome/), which captures *every* unhandled exception | ||
your code throws. The vast majority of exceptions likely do not indicate downtime, so if you use Sentry, you | ||
would need to devise a system to automatically identify which exceptions *do* indicate downtime. | ||
|
||
* It requires that you have an observability platform that can capture and alert on arbitrary events. Depending | ||
on your exact setup, deploying and configuring an observability platform can be a significant amount of work by itself. [^2] | ||
|
||
[^2]: That said, you should definitely prioritize deploying observability tools over automated detection of downtime. You | ||
can survive relying on user downtime reports... you cannot survive if you do not have the observability tooling required | ||
to debug and fix production bugs in a timely manner. | ||
|
||
### Service Health Checks | ||
|
||
This is the easiest system to implement, and you will already need to provide health checks if you plan to run | ||
software on most modern infrastructure solutions such as Kubernetes. | ||
|
||
The premise is simple: provide an endpoint that returns a payload that indicates which parts of the service | ||
are functional and which are not. Analyze the payload to determine which application components the non-functional | ||
service components would likely impact. | ||
|
||
For example, if the user authentication database is offline, then it would be safe | ||
to say that the Authentication application component is down. | ||
|
||
While this can be a good starting point (and is certainly better than nothing), this suffers from significant | ||
over and under-reporting issues: | ||
|
||
* Over-reporting: In a resiliently designed system, a subset of the infrastructure or services may indeed be failing or degraded, | ||
but this could have no impact on actual users due to resiliency safeguards such as failovers and retries. | ||
|
||
* Under-reporting: Again, there are 1,000s of ways any application flow might break. For example, the authentication | ||
service returning healthy health checks *by no means indicates that authentication is currently working.* Only the inverse | ||
is true: if the service is returning *unhealthy* health checks, then you could assume authentication is *not* working. | ||
|
||
Additionally, this can only be used on live services and does not work for identifying issues caused solely by | ||
problems in client-side code. | ||
|
||
### User Reports | ||
|
||
It is worth calling out that user reports are **not** an effective solution for identifying downtime in the majority | ||
of deployment scenarios. | ||
|
||
Generally, 100s of users will experience the downtime event for every user report that you actually receive and this | ||
can dramatically delay the identification of a downtime incident. | ||
|
||
Moreover, these reports rarely go directly to the engineering team responsible for fixing the downtime which can | ||
cause further confusion and delays. | ||
|
||
While you *should* have some mechanism for users to report problems, this process should not be used as your | ||
default method of downtime identification. | ||
|
||
## Communicating Downtime | ||
|
||
Once downtime occurs, your job is to ensure that organization has the right visibility into what | ||
is failing and why. | ||
|
||
While your organization will have its own unique incident response process, | ||
you should ensure that you have at minimum the following components in place: | ||
|
||
* Assign direct responsibility for each application domain. Know who is responsible for initial | ||
triaging when a particular application domain is experiencing problems. This usually involves setting up an | ||
on-call schedule as problems can occur at *any* time, not just during working hours. | ||
|
||
* Have a means to automatically notify the responsible person(s) when an issue occurs. Have a means to escalate | ||
the issue if the responsible person(s) do not respond within a certain time frame. | ||
|
||
* Provide a status dashboard with each application domain listed along with its current health. Additionally, provide a means | ||
to post manual updates to this page as the downtime incident is being addressed by the team. It can sometimes | ||
be appropriate to have different dashboards for internal and external stakeholders. | ||
|
||
## Preventing Downtime | ||
|
||
Finally, it is your responsibility to provide visibility into both how downtime could have been | ||
prevented *and* how the incident could have been resolved more efficiently. | ||
|
||
This is typically done via a [postmortem process](https://en.wikipedia.org/wiki/Postmortem_documentation). While every | ||
organization will approach the process differently, as a platform engineer you should be present in the discussions to | ||
ensure that you understand what improvements to the platform could be made to mitigate the risk and impact of future | ||
incidents. This might include action items such as better safeguards in your CI/CD pipelines or better observability tooling to aid | ||
engineers in tracking down the root cause. | ||
|
||
Moreover, you should ensure that everyone leaves the postmortem with a concrete financial impact estimate for the incident | ||
(e.g., "the incident caused \$50k in lost revenue"). This should be used to justify the return-on-investment (ROI) of platform improvements. | ||
|
||
While unfortunate, downtime incidents provide an amazing opportunity for stakeholder alignment on how and why | ||
the overall platform should be improved. It is your responsibility to rally the team towards the common goal | ||
of providing a more reliable system. |
Oops, something went wrong.