Skip to content
This repository was archived by the owner on Nov 18, 2024. It is now read-only.

Commit 2ca8a96

Browse files
committed
Minor cosmetic text adjustments to the OpenTelemetry article
1 parent 6385367 commit 2ca8a96

File tree

1 file changed

+18
-16
lines changed

1 file changed

+18
-16
lines changed

content/posts/otel-from-0-to-100.md

+18-16
Original file line numberDiff line numberDiff line change
@@ -1,30 +1,32 @@
11
---
22
title: OpenTelemetry from 0 to 100
3-
description: The story of how we adopted OpenTelemetry at NAV
3+
description: The story of how we adopted OpenTelemetry at NAV (Norway's largest government agency).
44
date: 2024-05-27T19:09:09+02:00
55
draft: false
66
author: Hans Kristian Flaatten
7-
tags: [observability, tracing, opentelemetry, tempo, grafana]
7+
tags: [observability, opentelemetry, tracing]
88
featuredImage: /blog/images/otel-rappids-and-rivers.png
99
---
1010

11-
This is the story of how we adopted [OpenTelemetry][otel] at the Norwegian Labour and Welfare Administration (NAV). We will cover the journey from the first steps to the first traces in production. We will also share some of the challenges we faced and how we overcame them.
11+
This is the (long) story of how we adopted [OpenTelemetry][otel] at the Norwegian Labour and Welfare Administration (NAV). We will cover the journey from the first commits to real traces in production. We will also share some of the challenges we faced along the way and how we overcame them.
1212

1313
[otel]: https://opentelemetry.io/
1414

15-
At NAV, we have a microservices architecture with thousands of services running in our Kubernetes clusters. We have been teaching our teams to adopt Prometheus metrics and Grafana, but to a large degree they still rely on digging through application log using Elastic and Kibana.
15+
At NAV, we have a microservices architecture with thousands of services running in our Kubernetes clusters. We have been telling our teams to adopt Prometheus metrics and Grafana from early on, but to a large degree they still rely on digging through application logs in Kibana.
1616

17-
Without proper request tracing, it is hard to get a good overview of how requests flow through the system. This makes it hard to troubleshoot errors in complex value chains and optimize slow requests, this was particularly challenging for our teams that have adopted an event-driven architecture with Kafka. It is like trying to navigate a city without a map.
17+
Without proper request tracing, it is hard to get a good overview of how requests flow through a system. This makes it hard to troubleshoot errors in long and often complex value chains or optimize slow requests. This was a particular challenge for our teams that have adopted an event-driven architecture with Kafka. It is like trying to navigate a city without a map.
1818

19-
There has been several attempts to shoehorn some form of request tracing using HTTP headers over the years. I have found `Nav-Callid`, `nav-call-id`, `callId`, `X-Correlation-ID`, `x_correlationId`, `correlationId`, `x_correlation-id`, and even `x_korrelasjonsId` (Norwegian for correlationId). There are probably more variations out there that I haven't found yet.
19+
There has been several attempts to shoehorn some form of request tracing using HTTP headers over the years. I have found `Nav-Callid`, `nav-call-id`, `callId`, `X-Correlation-ID`, `x_correlationId`, `correlationId`, `x_correlation-id`, and even `x_korrelasjonsId` (Norwegian for correlationId). There are probably even more variations out there in the wild as I only had so much time for digging around.
2020

2121
![Standards](https://imgs.xkcd.com/comics/standards.png)
2222

23-
It seams we are stuck in an endless loop of trying to get everyone to agree on a standard, and then trying to get everyone to implement it correctly. This is where OpenTelemetry comes in. It provides a standard way to define telemetry data from your applications, and it provides libraries for most programming languages to make it easy to implement.
23+
It seams we are stuck in an endless loop of trying to get everyone to agree on a standard, and then trying to get everyone to implement it correctly... This is where OpenTelemetry comes in! It provides a standard way to define telemetry data from your applications, and it provides libraries and SDKs for all common programming languages (there are even initiatives to make [OpenTelemetry available on mainframes][otel-mainframe-sig]) to make it easy to implement. Maybe there is still hope for us!
24+
25+
[otel-mainframe-sig]: https://openmainframeproject.org/blog/new-opentelemetry-on-mainframe-sig/
2426

2527
## The first steps
2628

27-
Even though OpenTelemetry does a lot of the heavy lifting for you, it is still a complex system with many moving parts. Did you know that OpenTelemetry is the fastest growing project in the Cloud Native Computing Foundation (CNCF)? It has an even steeper adoption curve than Kubernetes had in the early days!
29+
Even though OpenTelemetry does a lot of the heavy lifting for you, it is still a complex system with many moving parts. Did you know that OpenTelemetry is the fastest growing project in the Cloud Native Computing Foundation (CNCF)? It has an even steeper adoption curve than Kubernetes had back in the early days!
2830

2931
In order to get started with OpenTelemetry, you need two things:
3032

@@ -44,9 +46,9 @@ You can send OpenTelemetry data directly to Tempo, but the recommended way is to
4446

4547
![OpenTelemetry Collector](/blog/images/otel-collector.svg)
4648

47-
Installing things like this in a Kubernetes cluster is something the NAIS team have done for the better part of ten years, so we had no problem setting up the Collector and connecting it to Tempo in our clusters, we run one Tempo instance for each environment (dev and prod) accessible from a global Grafana instance.
49+
Installing things like this in a Kubernetes cluster is something the NAIS team have done for the better part of ten years, so we had no problem setting up the Collector and connecting it to Tempo in our clusters. We run one Tempo instance for each environment (dev and prod) accessible from a global Grafana instance.
4850

49-
The hard part was getting the developers to instrument their applications...
51+
The hard part would be to get the developers to instrument their applications...
5052

5153
[nav-traceql]: https://doc.nais.io/reference/observability/tracing/traceql/
5254
[nav-tempo]: https://docs.nais.io/how-to-guides/observability/tracing/tempo/
@@ -58,13 +60,13 @@ From the very beginning, we knew that the key to success was to make the develop
5860

5961
With most of our backend services written in Kotlin and Java, we started by testing the [OpenTelemetry Java Agent][otel-java-agent]. A java agent is a small piece of software that runs alongside your application and can modify the bytecode as it is loaded into the JVM. This allows it to automatically instrument your application without any changes to the source code.
6062

61-
To our pleasant surprise, the agent worked out of the box with most of our applications. It was able to correctly correlate incoming and outgoing requests, understand the different frameworks we use, and even capture database queries and async calls to message queues like Kafka. In fact the OpenTelemetry Java agent [supports over 100 different libraries and frameworks][otel-java-agent-support] out of the box!
63+
To our pleasant surprise, the agent worked out of the box with most of our applications. It was able to correctly correlate incoming and outgoing requests, understand the different frameworks we use, and even capture database queries and async calls to message queues like Kafka despite the nay-sayers who claimed this could not work as well as it was advertised. In fact, the OpenTelemetry Java agent [supports over 100 different libraries and frameworks][otel-java-agent-support] out of the box!
6264

6365
[otel-java-agent-support]: https://github.com/open-telemetry/opentelemetry-java-instrumentation/blob/main/docs/supported-libraries.md
6466

65-
Previously we have been able to install such agents in the node, but with the move to Kubernetes, we had to find a way to get the agent onto the container as there are no shared jvm runtime. In the past we have made pre-built Docker images with the agent installed, but this had a high maintenance cost as we had to keep the images up to date with the latest version of the agent across different base images and major versions.
67+
Previously we would have been able to install such agents on thPreviously we would have been able to install such agents on the node, but with all of our applications now running on Kubernetes that was no longer an option. We had to find a way to get the agent onto the container as there are no shared jvm runtime to hook into. In the past we have made pre-built Docker images with agents pre-installed, but this had a high maintenance cost as we had to keep the images up to date with the latest version of the agent across different base images and major versions. And not all applications are using the same base image either.
6668

67-
This is where the [OpenTelemetry Operator][otel-operator] comes in. The Operator is a Kubernetes operator that can automatically inject the OpenTelemetry Java Agent (and agents for other programming languages) directly into your application pods. It can also configure the agent to send data to the correct Collector and set up the correct service name and environment for each application since it has access to the Kubernetes API.
69+
This is where the [OpenTelemetry Operator][otel-operator] comes in. This is a Kubernetes operator that can automatically inject the OpenTelemetry Java Agent (and agents for other programming languages as well) directly into your pod. It can also configure the agent to send data to the correct Collector and set up the correct service name and environment variables for each application since it has access to the Kubernetes API.
6870

6971
[otel-java-agent]: https://opentelemetry.io/docs/languages/java/automatic/
7072
[otel-operator]: https://opentelemetry.io/docs/operator/
@@ -73,7 +75,7 @@ This is where the [OpenTelemetry Operator][otel-operator] comes in. The Operator
7375

7476
In case you are new to NAV, we have a open source application platform called [nais][nais] that provides everything our application teams need to develop, run and operate their applications. It's main component is [naiserator][naiserator] (a Kubernetes Operator) and the [`nais.yaml`][nais-manifest] that defines how an application should run in our Kubernetes clusters.
7577

76-
It looks something like this:
78+
A minimal application manifest looks something like this:
7779

7880
```yaml
7981
apiVersion: "nais.io/v1alpha1"
@@ -87,7 +89,7 @@ spec:
8789
...
8890
```
8991

90-
This is a very powerful abstraction that have allowed us to add new features to the platform with as little effort on the developer's part as possible. We added a new field to `nais.yaml` called `observability` that allows the developers to enable tracing for their applications with a single line of code:
92+
This is a very powerful abstraction that have allowed us to add new features to the platform with as little effort on the developer's part as possible. We added a new field to `nais.yaml` called `observability` that allows the developers to enable tracing for their applications with only four lines of yaml configuration:
9193

9294
```yaml
9395
...
@@ -97,7 +99,7 @@ This is a very powerful abstraction that have allowed us to add new features to
9799
runtime: "java"
98100
```
99101
100-
When naiserator sees this field, it sets the required OpenTelemetry Operator annotations to get the correct OpenTelemetry configuration and agent according to the runtime. We currently support auto-instrumenting `java`, `nodejs` and `python`. This way, the developers don't have to worry about how to set up tracing in their applications, they just have to enable it in the manifest. This is a huge win for us! :rocket:
102+
When naiserator sees this field, it sets the required OpenTelemetry Operator annotations to get the correct OpenTelemetry configuration and agent according to the runtime. We currently support auto-instrumenting `java`, `nodejs` and `python`. This way, the developers don't have to worry about how to set up tracing in their applications, they just have to enable it in the manifest. This is a huge win for us! 🎉
101103

102104
For many of our applications, this was all that was needed to get traces flowing. Developers can still add additional spans and attributes to their traces using the OpenTelemetry SDKs directly, or they can choose to disable auto-instrumentation and instrument their applications manually.
103105

0 commit comments

Comments
 (0)