Skip to content

Commit

Permalink
feat: adds conceptual framework docs
Browse files Browse the repository at this point in the history
  • Loading branch information
fullykubed committed Apr 22, 2024
1 parent f7d4a68 commit 8397296
Show file tree
Hide file tree
Showing 31 changed files with 788 additions and 56 deletions.
7 changes: 6 additions & 1 deletion cspell.config.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -114,7 +114,12 @@ words: [
"failovers",
"snapshotting",
"TOTP",
"Yubikeys"
"Yubikeys",
"MTTD",
"MTTR",
"kpis",
"featureful",
"triaging"
]
languageSettings:
- languageId: markdown,mdx
Expand Down
2 changes: 1 addition & 1 deletion packages/nix/devenv/lint.sh
Original file line number Diff line number Diff line change
Expand Up @@ -59,7 +59,7 @@ echo >&2 "Finished shell linting!"
echo >&2 "Starting documentation linting..."
(
cd "$DEVENV_ROOT/packages/website"
./node_modules/.bin/remark src -e .mdx -e .md -o -S -r .remarkrc.lint.mjs
./node_modules/.bin/remark src -e .mdx -e .md -o -S -r .remarkrc.mjs
)
echo >&2 "Finished documentation linting!"

Expand Down
7 changes: 0 additions & 7 deletions packages/reference/.podman/config.json

This file was deleted.

Original file line number Diff line number Diff line change
@@ -1 +1 @@
version: alpha.10
version: alpha.11
Original file line number Diff line number Diff line change
Expand Up @@ -19,7 +19,8 @@ const remarkConfig = {
['remark-lint-fenced-code-flag', ['error']],
['remark-lint-heading-increment', ['error']],
'remark-mdx',
'remark-gfm'
'remark-gfm',
'remark-math'
]
}

Expand Down
2 changes: 1 addition & 1 deletion packages/website/package.json
Original file line number Diff line number Diff line change
Expand Up @@ -25,7 +25,7 @@
"buffer": "6.0.3",
"clsx": "1.2.1",
"dayjs": "1.11.9",
"iconoir-react": "6.11.0",
"iconoir-react": "7.6.0",
"katex": "^0.16.9",
"next": "14.1.0",
"postcss-flexbugs-fixes": "^5.0.2",
Expand Down
2 changes: 1 addition & 1 deletion packages/website/src/app/(web)/TextSlider.tsx
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,7 @@ export default function TextSlider (props: TextSliderProps) {
useEffect(() => {
const timeout = setTimeout(() => {
setIndex(index === items.length - 1 ? 0 : index + 1)
}, 4000)
}, 2500)
return () => {
clearTimeout(timeout)
}
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -207,7 +207,7 @@ We believe this is an inappropriate use case for the following reasons:
* A typical VPN setup will route *all* network traffic from the remote machine to the private network. This introduces problems:
* Organizations incur significant costs for non-work traffic routed though metered networks (e.g., AWS). [^10]
* Users experience degraded network performance as internet bandwidth is shared across all VPN users.
* Users have sacrifice privacy unnecessarily by allowing their organization to view their internet.
* Users sacrifice privacy unnecessarily by allowing their organization to view their internet traffic.

[^10]: For some clients, we have seen VPNs and related charges account for over 25% of their entire AWS bill!

Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,183 @@
# Downtime Visibility

Keeping your systems online is one of the largest responsibilities of platform engineering. Downtime not only
means you aren't delivering value to your users at that moment, but it also undermines your organization's
public reputation, potentially causing users to reconsider their relationship with your service.

However, preventing downtime doesn't start with deploying a highly-available Kubernetes cluster. It involves creating
visibility into when and why downtime occurs so that your organization can improve its resilient engineering practices.

## Defining Downtime

"Is the system down?" is a surprisingly nuanced question, and many organizations struggle to come up with a concrete,
repeatable process for answering this question.

Obviously, if the servers are offline, the system is down. But what if a bug prevents 25% of users from logging in? What
if users can interact with the application, but response times are 10x above normal? Is the system down?

The truth is that there is no universal definition of downtime, but your job is to help provide concreteness to this problem.

We recommend the following exercise:

1. Break your system into top-level application components. For example, if you are a social media site, you might have the following components:

* Authentication
* Messaging
* Feed
* Profiles
* Posts

2. For each top-level component, define a set of application flows that **must** work in order for that component to be considered
functional. For example, consider authentication. You might come up with the following application flows:

* Users can enter a username and password, press submit, and land on their home feed within 500ms.
* Users can click a social login button, provide their Google credentials, and land on their home feed within 1000ms.
* Users can click the logout button, have their local authentication token removed, and be returned to the login page within 500ms.

3. Devise some standard way for evaluating whether those application flows are functional. There are many potential approaches
which we will discuss in the next section.

The key is to evaluate the question of downtime from the perspective of your users, not from the
perspective of your infrastructure. Why is this critical?

* The business impact (\$\$\$) of downtime can only be measured in the context of what business value is being disrupted.

* Nobody outside the immediate engineering team knows what it means when service `foo` is failing its health checks.

* A bottom-up approach is error-prone. In other words, there are 1,000s of technical reasons why authentication might be broken. You
will never be able to measure every possible failure scenario reliability. Instead, measure the thing you care about
directly: "Can users log in?"

## Identifying Downtime

### Continuous Testing

The gold standard for measurement of downtime is continuous testing in production. In other words, you should
have an automated mechanism that simulates the application flows that you want to ensure are functional. These tests should regularly
run against your production system. Test failures should alert the team to downtime in that application component.

While this produces an extremely high signal-to-noise ratio, it also presents many challenges:

* Designing the test runners themselves

* Ensuring that test actions and data do not pollute business metrics for actual users

* Keeping the tests updated with application code changes to prevent false positives

For many small organizations, this might be too costly to implement effectively. It is entirely reasonable
to decide that this investment is too heavy at the current point in your organization's lifecycle.

### Analyzing Runtime Data

An easier solution would be to instrument your application code to produce a signal or event when an application flow does
not work for unexpected reasons.

For example, if a user attempts to log in but the server returns an HTTP `5xx` error
code, the application code would fire an event into your observability platform indicating that "User login
with username and password failed." After a certain error threshold, a downtime notification would
be triggered.

While this is an easier technical lift than continuous testing, it suffers from a handful of downsides:

* It requires real users experience problems before downtime can be detected.

* The instrumentation can be error-prone in both over- and under-reporting actual downtime:

* Over-reporting: Consider the authentication scenario above. A `403` error might simply indicate that the user
entered an incorrect password. However, an engineer might accidentally capture that as a downtime error.

* Under-reporting: Consider the scenario where the observability code fails to load due to a server error
that also impacts other systems. That impact would not be captured unless you explicitly designed around
this failure scenario.

Unfortunately, there are 1,000s of subtle ways systems might break, and it
is difficult (if not impossible) to account for them all. In our experience, even the best attempts at explicit reporting
miss 10-20% of downtime events.

* The differentiation between what is downtime versus what is simply a bug can become very tricky. For example, consider
the popular error-reporting system, [Sentry](https://sentry.io/welcome/), which captures *every* unhandled exception
your code throws. The vast majority of exceptions likely do not indicate downtime, so if you use Sentry, you
would need to devise a system to automatically identify which exceptions *do* indicate downtime.

* It requires that you have an observability platform that can capture and alert on arbitrary events. Depending
on your exact setup, deploying and configuring an observability platform can be a significant amount of work by itself. [^2]

[^2]: That said, you should definitely prioritize deploying observability tools over automated detection of downtime. You
can survive relying on user downtime reports... you cannot survive if you do not have the observability tooling required
to debug and fix production bugs in a timely manner.

### Service Health Checks

This is the easiest system to implement, and you will already need to provide health checks if you plan to run
software on most modern infrastructure solutions such as Kubernetes.

The premise is simple: provide an endpoint that returns a payload that indicates which parts of the service
are functional and which are not. Analyze the payload to determine which application components the non-functional
service components would likely impact.

For example, if the user authentication database is offline, then it would be safe
to say that the Authentication application component is down.

While this can be a good starting point (and is certainly better than nothing), this suffers from significant
over and under-reporting issues:

* Over-reporting: In a resiliently designed system, a subset of the infrastructure or services may indeed be failing or degraded,
but this could have no impact on actual users due to resiliency safeguards such as failovers and retries.

* Under-reporting: Again, there are 1,000s of ways any application flow might break. For example, the authentication
service returning healthy health checks *by no means indicates that authentication is currently working.* Only the inverse
is true: if the service is returning *unhealthy* health checks, then you could assume authentication is *not* working.

Additionally, this can only be used on live services and does not work for identifying issues caused solely by
problems in client-side code.

### User Reports

It is worth calling out that user reports are **not** an effective solution for identifying downtime in the majority
of deployment scenarios.

Generally, 100s of users will experience the downtime event for every user report that you actually receive and this
can dramatically delay the identification of a downtime incident.

Moreover, these reports rarely go directly to the engineering team responsible for fixing the downtime which can
cause further confusion and delays.

While you *should* have some mechanism for users to report problems, this process should not be used as your
default method of downtime identification.

## Communicating Downtime

Once downtime occurs, your job is to ensure that organization has the right visibility into what
is failing and why.

While your organization will have its own unique incident response process,
you should ensure that you have at minimum the following components in place:

* Assign direct responsibility for each application domain. Know who is responsible for initial
triaging when a particular application domain is experiencing problems. This usually involves setting up an
on-call schedule as problems can occur at *any* time, not just during working hours.

* Have a means to automatically notify the responsible person(s) when an issue occurs. Have a means to escalate
the issue if the responsible person(s) do not respond within a certain time frame.

* Provide a status dashboard with each application domain listed along with its current health. Additionally, provide a means
to post manual updates to this page as the downtime incident is being addressed by the team. It can sometimes
be appropriate to have different dashboards for internal and external stakeholders.

## Preventing Downtime

Finally, it is your responsibility to provide visibility into both how downtime could have been
prevented *and* how the incident could have been resolved more efficiently.

This is typically done via a [postmortem process](https://en.wikipedia.org/wiki/Postmortem_documentation). While every
organization will approach the process differently, as a platform engineer you should be present in the discussions to
ensure that you understand what improvements to the platform could be made to mitigate the risk and impact of future
incidents. This might include action items such as better safeguards in your CI/CD pipelines or better observability tooling to aid
engineers in tracking down the root cause.

Moreover, you should ensure that everyone leaves the postmortem with a concrete financial impact estimate for the incident
(e.g., "the incident caused \$50k in lost revenue"). This should be used to justify the return-on-investment (ROI) of platform improvements.

While unfortunate, downtime incidents provide an amazing opportunity for stakeholder alignment on how and why
the overall platform should be improved. It is your responsibility to rally the team towards the common goal
of providing a more reliable system.
Loading

0 comments on commit 8397296

Please sign in to comment.