feat: adds conceptual framework docs

Panfactum · Apr 22, 2024 · 8397296 · 8397296
1 parent f7d4a68
commit 8397296
Show file tree

Hide file tree

Showing 31 changed files with 788 additions and 56 deletions.
diff --git a/cspell.config.yaml b/cspell.config.yaml
@@ -114,7 +114,12 @@ words: [
   "failovers",
   "snapshotting",
   "TOTP",
-  "Yubikeys"
+  "Yubikeys",
+  "MTTD",
+  "MTTR",
+  "kpis",
+  "featureful",
+  "triaging"
 ]
 languageSettings:
   - languageId: markdown,mdx

diff --git a/packages/nix/devenv/lint.sh b/packages/nix/devenv/lint.sh
@@ -59,7 +59,7 @@ echo >&2 "Finished shell linting!"
 echo >&2 "Starting documentation linting..."
 (
   cd "$DEVENV_ROOT/packages/website"
-  ./node_modules/.bin/remark src -e .mdx -e .md -o -S -r .remarkrc.lint.mjs
+  ./node_modules/.bin/remark src -e .mdx -e .md -o -S -r .remarkrc.mjs
 )
 echo >&2 "Finished documentation linting!"
 

diff --git a/packages/reference/.podman/config.json b/packages/reference/.podman/config.json
diff --git a/packages/reference/environments/production/us-east-2/pf_website/version.yaml b/packages/reference/environments/production/us-east-2/pf_website/version.yaml
@@ -1 +1 @@
-version: alpha.10
+version: alpha.11
diff --git a/packages/website/.remarkrc.lint.mjs → packages/website/.remarkrc.mjs b/packages/website/.remarkrc.lint.mjs → packages/website/.remarkrc.mjs
@@ -19,7 +19,8 @@ const remarkConfig = {
         ['remark-lint-fenced-code-flag', ['error']],
         ['remark-lint-heading-increment', ['error']],
         'remark-mdx',
-        'remark-gfm'
+        'remark-gfm',
+        'remark-math'
     ]
 }
 

diff --git a/packages/website/package.json b/packages/website/package.json
@@ -25,7 +25,7 @@
     "buffer": "6.0.3",
     "clsx": "1.2.1",
     "dayjs": "1.11.9",
-    "iconoir-react": "6.11.0",
+    "iconoir-react": "7.6.0",
     "katex": "^0.16.9",
     "next": "14.1.0",
     "postcss-flexbugs-fixes": "^5.0.2",

diff --git a/packages/website/src/app/(web)/TextSlider.tsx b/packages/website/src/app/(web)/TextSlider.tsx
@@ -14,7 +14,7 @@ export default function TextSlider (props: TextSliderProps) {
   useEffect(() => {
     const timeout = setTimeout(() => {
       setIndex(index === items.length - 1 ? 0 : index + 1)
-    }, 4000)
+    }, 2500)
     return () => {
       clearTimeout(timeout)
     }

diff --git a/packages/website/src/app/(web)/docs/concepts/networking/cryptography/page.mdx b/packages/website/src/app/(web)/docs/concepts/networking/cryptography/page.mdx
@@ -207,7 +207,7 @@ We believe this is an inappropriate use case for the following reasons:
 * A typical VPN setup will route *all* network traffic from the remote machine to the private network. This introduces problems:
   * Organizations incur significant costs for non-work traffic routed though metered networks (e.g., AWS). [^10]
   * Users experience degraded network performance as internet bandwidth is shared across all VPN users.
-  * Users have sacrifice privacy unnecessarily by allowing their organization to view their internet.
+  * Users sacrifice privacy unnecessarily by allowing their organization to view their internet traffic.
 
 [^10]: For some clients, we have seen VPNs and related charges account for over 25% of their entire AWS bill!
 

diff --git a/...ges/website/src/app/(web)/docs/framework/framework/downtime-visibility/page.mdx b/...ges/website/src/app/(web)/docs/framework/framework/downtime-visibility/page.mdx
@@ -0,0 +1,183 @@
+# Downtime Visibility
+
+Keeping your systems online is one of the largest responsibilities of platform engineering. Downtime not only
+means you aren't delivering value to your users at that moment, but it also undermines your organization's
+public reputation, potentially causing users to reconsider their relationship with your service.
+
+However, preventing downtime doesn't start with deploying a highly-available Kubernetes cluster. It involves creating
+visibility into when and why downtime occurs so that your organization can improve its resilient engineering practices.
+
+## Defining Downtime
+
+"Is the system down?" is a surprisingly nuanced question, and many organizations struggle to come up with a concrete,
+repeatable process for answering this question.
+
+Obviously, if the servers are offline, the system is down. But what if a bug prevents 25% of users from logging in? What
+if users can interact with the application, but response times are 10x above normal? Is the system down?
+
+The truth is that there is no universal definition of downtime, but your job is to help provide concreteness to this problem.
+
+We recommend the following exercise:
+
+1. Break your system into top-level application components. For example, if you are a social media site, you might have the following components:
+
+   * Authentication
+   * Messaging
+   * Feed
+   * Profiles
+   * Posts
+
+2. For each top-level component, define a set of application flows that **must** work in order for that component to be considered
+   functional. For example, consider authentication. You might come up with the following application flows:
+
+   * Users can enter a username and password, press submit, and land on their home feed within 500ms.
+   * Users can click a social login button, provide their Google credentials, and land on their home feed within 1000ms.
+   * Users can click the logout button, have their local authentication token removed, and be returned to the login page within 500ms.
+
+3. Devise some standard way for evaluating whether those application flows are functional. There are many potential approaches
+   which we will discuss in the next section.
+
+The key is to evaluate the question of downtime from the perspective of your users, not from the
+perspective of your infrastructure. Why is this critical?
+
+* The business impact (\$\$\$) of downtime can only be measured in the context of what business value is being disrupted.
+
+* Nobody outside the immediate engineering team knows what it means when service `foo` is failing its health checks.
+
+* A bottom-up approach is error-prone. In other words, there are 1,000s of technical reasons why authentication might be broken. You
+  will never be able to measure every possible failure scenario reliability. Instead, measure the thing you care about
+  directly: "Can users log in?"
+
+## Identifying Downtime
+
+### Continuous Testing
+
+The gold standard for measurement of downtime is continuous testing in production. In other words, you should
+have an automated mechanism that simulates the application flows that you want to ensure are functional. These tests should regularly
+run against your production system. Test failures should alert the team to downtime in that application component.
+
+While this produces an extremely high signal-to-noise ratio, it also presents many challenges:
+
+* Designing the test runners themselves
+
+* Ensuring that test actions and data do not pollute business metrics for actual users
+
+* Keeping the tests updated with application code changes to prevent false positives
+
+For many small organizations, this might be too costly to implement effectively. It is entirely reasonable
+to decide that this investment is too heavy at the current point in your organization's lifecycle.
+
+### Analyzing Runtime Data
+
+An easier solution would be to instrument your application code to produce a signal or event when an application flow does
+not work for unexpected reasons.
+
+For example, if a user attempts to log in but the server returns an HTTP `5xx` error
+code, the application code would fire an event into your observability platform indicating that "User login
+with username and password failed." After a certain error threshold, a downtime notification would
+be triggered.
+
+While this is an easier technical lift than continuous testing, it suffers from a handful of downsides:
+
+* It requires real users experience problems before downtime can be detected.
+
+* The instrumentation can be error-prone in both over- and under-reporting actual downtime:
+
+  * Over-reporting: Consider the authentication scenario above. A `403` error might simply indicate that the user
+    entered an incorrect password. However, an engineer might accidentally capture that as a downtime error.
+
+  * Under-reporting: Consider the scenario where the observability code fails to load due to a server error
+    that also impacts other systems. That impact would not be captured unless you explicitly designed around
+    this failure scenario.
+
+    Unfortunately, there are 1,000s of subtle ways systems might break, and it
+    is difficult (if not impossible) to account for them all. In our experience, even the best attempts at explicit reporting
+    miss 10-20% of downtime events.
+
+* The differentiation between what is downtime versus what is simply a bug can become very tricky. For example, consider
+  the popular error-reporting system, [Sentry](https://sentry.io/welcome/), which captures *every* unhandled exception
+  your code throws. The vast majority of exceptions likely do not indicate downtime, so if you use Sentry, you
+  would need to devise a system to automatically identify which exceptions *do* indicate downtime.
+
+* It requires that you have an observability platform that can capture and alert on arbitrary events. Depending
+  on your exact setup, deploying and configuring an observability platform can be a significant amount of work by itself. [^2]
+
+[^2]: That said, you should definitely prioritize deploying observability tools over automated detection of downtime. You
+    can survive relying on user downtime reports... you cannot survive if you do not have the observability tooling required
+    to debug and fix production bugs in a timely manner.
+
+### Service Health Checks
+
+This is the easiest system to implement, and you will already need to provide health checks if you plan to run
+software on most modern infrastructure solutions such as Kubernetes.
+
+The premise is simple: provide an endpoint that returns a payload that indicates which parts of the service
+are functional and which are not. Analyze the payload to determine which application components the non-functional
+service components would likely impact.
+
+For example, if the user authentication database is offline, then it would be safe
+to say that the Authentication application component is down.
+
+While this can be a good starting point (and is certainly better than nothing), this suffers from significant
+over and under-reporting issues:
+
+* Over-reporting: In a resiliently designed system, a subset of the infrastructure or services may indeed be failing or degraded,
+  but this could have no impact on actual users due to resiliency safeguards such as failovers and retries.
+
+* Under-reporting: Again, there are 1,000s of ways any application flow might break. For example, the authentication
+  service returning healthy health checks *by no means indicates that authentication is currently working.* Only the inverse
+  is true: if the service is returning *unhealthy* health checks, then you could assume authentication is *not* working.
+
+Additionally, this can only be used on live services and does not work for identifying issues caused solely by
+problems in client-side code.
+
+### User Reports
+
+It is worth calling out that user reports are **not** an effective solution for identifying downtime in the majority
+of deployment scenarios.
+
+Generally, 100s of users will experience the downtime event for every user report that you actually receive and this
+can dramatically delay the identification of a downtime incident.
+
+Moreover, these reports rarely go directly to the engineering team responsible for fixing the downtime which can
+cause further confusion and delays.
+
+While you *should* have some mechanism for users to report problems, this process should not be used as your
+default method of downtime identification.
+
+## Communicating Downtime
+
+Once downtime occurs, your job is to ensure that organization has the right visibility into what
+is failing and why.
+
+While your organization will have its own unique incident response process,
+you should ensure that you have at minimum the following components in place:
+
+* Assign direct responsibility for each application domain. Know who is responsible for initial
+  triaging when a particular application domain is experiencing problems. This usually involves setting up an
+  on-call schedule as problems can occur at *any* time, not just during working hours.
+
+* Have a means to automatically notify the responsible person(s) when an issue occurs. Have a means to escalate
+  the issue if the responsible person(s) do not respond within a certain time frame.
+
+* Provide a status dashboard with each application domain listed along with its current health. Additionally, provide a means
+  to post manual updates to this page as the downtime incident is being addressed by the team. It can sometimes
+  be appropriate to have different dashboards for internal and external stakeholders.
+
+## Preventing Downtime
+
+Finally, it is your responsibility to provide visibility into both how downtime could have been
+prevented *and* how the incident could have been resolved more efficiently.
+
+This is typically done via a [postmortem process](https://en.wikipedia.org/wiki/Postmortem_documentation). While every
+organization will approach the process differently, as a platform engineer you should be present in the discussions to
+ensure that you understand what improvements to the platform could be made to mitigate the risk and impact of future
+incidents. This might include action items such as better safeguards in your CI/CD pipelines or better observability tooling to aid
+engineers in tracking down the root cause.
+
+Moreover, you should ensure that everyone leaves the postmortem with a concrete financial impact estimate for the incident
+(e.g., "the incident caused \$50k in lost revenue"). This should be used to justify the return-on-investment (ROI) of platform improvements.
+
+While unfortunate, downtime incidents provide an amazing opportunity for stakeholder alignment on how and why
+the overall platform should be improved. It is your responsibility to rally the team towards the common goal
+of providing a more reliable system.