Add docs

datalayer · Jul 26, 2024 · 6248319 · 6248319
1 parent 9931465
commit 6248319
Show file tree

Hide file tree

Showing 2 changed files with 171 additions and 0 deletions.
diff --git a/charts/datalayer-observer/README.md b/charts/datalayer-observer/README.md
@@ -1,3 +1,170 @@
 [![Datalayer](https://assets.datalayer.tech/datalayer-25.svg)](https://datalayer.io)
 
 # Datalayer Observer Helm Chart
+
+Install observability tools for Datalayer stack.
+
+The tools used:
+- OpenTelemetry Collector:
+    - As deployment to proxy metrics and traces from Datalayer services to Prometheus and Tempo
+    - As daemonset to parse pod log files and send them to Loki
+- Prometheus: To gather metrics
+- Tempo: To gather traces
+- Loki: To gather logs
+- AlertManager: To manage alerts
+- Grafana: To visualize and analyze the telemetry
+
+## How to install?
+
+```
+plane up datalayer-observer
+```
+
+If you face some issues due to the opentelemetry operator, it is likely
+related to the CRDs being undefined in the cluster. You can install them
+manually from `plane/etc/helm/charts/datalayer-observer/charts/crds/crds`.
+
+> [!NOTE]
+> Helm should install them the first time. But this is a complex
+> thing to handle; see https://helm.sh/docs/chart_best_practices/custom_resource_definitions/#install-a-crd-declaration-before-using-the-resource
+
+## What is deployed?
+
+This chart is built on top of multiple subcharts:
+- kube-prometheus-stack - Full Prometheus stack activating:
+    - AlertManager
+    - Grafana
+    - Prometheus Operator
+    - Prometheus
+    - Prometheus Node Exporter
+- loki
+    - Loki as single binary
+- tempo
+    - Tempo as single binary
+- opentelemetry-operator - using collector-contrib image
+
+In addition to the subcharts elements, it creates:
+
+- An opentelemetry collector as singleton instance to proxy traces and metrics from services and remote kernels to Prometheus and Tempo
+- An opentelemetry collector as daemonset to parse the container log files and proxy them to loki
+- An opentelemetry instrumentation to add Python auto instrumentation on Jupyter Server when a pod is created.
+- A custom ingress for grafana to use similar config as for Datalayer services
+- A service monitor to tell prometheus to fetch the metrics from the opentelemetry collector singleton
+
+
+```mermaid
+flowchart LR
+    subgraph node1
+    subgraph pod1
+    ai[Auto instrumentation]-.->rk[remote kernels]
+    oc[Operator Companion]
+    end
+    lc[Log collector]-- parse logs -->rk
+    lc-- parse logs -->oc
+    ne[Node exporter]
+    end
+    lc-. send logs .->Loki
+    pr-->ne
+    subgraph node2
+    subgraph pod2
+    iam
+    end
+    lc2[Log collector]-- parse logs -->iam
+    ne2[Node exporter]
+    end
+    lc2-. send logs .->Loki
+    pr-->ne2
+    otelc[OpenTelemetry Collector]
+    iam-- metrics & traces -->otelc
+    pr[Prometheus]-- metrics -->otelc
+    rk-- metrics & traces -->otelc
+    oc-- metrics & traces -->otelc
+    otelc-- traces -->Tempo
+    Grafana-->Tempo
+    Grafana-->Loki
+    Grafana-->pr
+
+    style pr stroke:salmon
+    style lc stroke:green
+    style lc2 stroke:green
+    style Loki stroke:green
+    style Tempo stroke:salmon
+    linkStyle 1,2,3,5,6 stroke:green
+```
+
+## Tips and tricks
+
+### Prometheus
+
+Prometheus gets its data source definition from CRs `PodMonitor` and
+`ServiceMonitor` (recommended). Third-parties that don't support
+opentelemetry metrics use such monitors and therefore are
+not proxied by the opentelemetry collector. For now:
+- `ServiceMonitor`: used by Grafana, AlertManager, Loki, Tempo, Prometheus, PrometheusOperator, Prometheus Node Exporter and OpenTelemetry Collector singleton.
+   - To be detected by Prometheus the ServiceMonitor must have the two labels:
+
+```
+        monitoring.datalayer.io/instance: "observer"
+        monitoring.datalayer.io/enabled: "true"
+```
+
+  - Kubernetes metrics are also gathered through service monitors defined in the kube-prometheus-stack.
+
+- `PodMonitor`: used by Pulsar stack (default in helm chart).
+   - PodMonitor can be defined in any namespace 
+   - To be detected by Prometheus the PodMonitor must have a label `app=pulsar`. Other app name could be defined in the `kube-prometheus-stack.prometheus.prometheusSpec.podMonitorSelector`.
+
+### Instrumentation
+
+#### Datalayer services
+
+The services based on connexion are instrumented explicitly using the code
+defined in `datalayer_common.instrumentation` as a custom version of the
+Python instrumentation ASGI was needed in particular to push the http route
+metadata.
+
+> [!IMPORTANT]
+> The logging instrumentor is used as by default it calls `basicConfig`. The
+> service must not call it.
+
+Configuring the metrics and traces targets is done through environment variables:
+
+```
+export OTEL_EXPORTER_OTLP_METRICS_ENDPOINT=http://datalayer-collector-collector.datalayer-observer.svc.cluster.local:4317
+export OTEL_EXPORTER_OTLP_TRACES_ENDPOINT=http://datalayer-collector-collector.datalayer-observer.svc.cluster.local:4317
+```
+
+> [!NOTE]
+> Currently the data is sent using gRPC. Http is also available but would
+> require to change the instrumentation code as the library to use is different.
+
+#### Jupyter Remote Kernels
+
+The remote kernels pod are auto-instrumented by the OpenTelemetry operator
+via a CR `Instrumentation`.
+
+That CR must be defined in the namespace the pod are gonna be created and
+the instrumentation will occur only at the pod creation.
+
+A pod is selected for instrumentation if it gets some annotations. In this 
+specific case, to instrument Python on a multi-container pod:
+
+
+```python
+    instrumentation.opentelemetry.io/inject-python: "true"
+    instrumentation.opentelemetry.io/container-names: "{KERNEL_CONTAINER_NAME}"
+```
+
+> See https://github.com/open-telemetry/opentelemetry-operator?tab=readme-ov-file#opentelemetry-auto-instrumentation-injection for more information and available options (to be set through environment variables).
+
+The Python auto-instrumentation is using http to send data to the OpenTelemetry Collector.
+
+## TODO
+
+- [ ] Drop the Prometheus Node Exporter to use the OpenTelemetry Collector Daemonset
+- [ ] OpenTelemetry to gather kubernetes metrics (needed?) - for now the system is monitor by Prometheus directly.
+- [ ] Quid about storage
+- [ ] Link traces -> metrics (exemplar) and traces -> logs (tags?)
+- [ ] Quid accounts and roles
+- [ ] Fix Traefik observability
+- [ ] Some logs don't have trace id in Datalayer services
diff --git a/charts/datalayer-observer/values.yaml b/charts/datalayer-observer/values.yaml
@@ -102,6 +102,7 @@ kube-prometheus-stack:
         type: loki
         typeName: Loki
         access: proxy
+        # Look for loki.server.http_listen_port in loki helm chart to find the port
         url: http://datalayer-observer-loki.datalayer-observer:3100
         password: ''
         user: ''
@@ -110,6 +111,7 @@ kube-prometheus-stack:
         isDefault: false
         jsonData:
           derivedFields:
+            # Link log to trace
             - datasourceUid: tempo
               matcherRegex: (?:trace_id)=(\w+)
               name: TraceID
@@ -122,6 +124,7 @@ kube-prometheus-stack:
         type: tempo
         typeName: Tempo
         access: proxy
+        # Look for tempo.server.http_listen_port in tempo helm chart to find the port
         url: http://datalayer-observer-tempo.datalayer-observer:3100
         password: ''
         user: ''
@@ -214,6 +217,7 @@ kube-prometheus-stack:
     prometheusSpec:
       routePrefix: /prometheus
       enableFeatures:
+        # Needed to link metrics and traces
         - exemplar-storage
       # Scan all namespace for podMonitor
       podMonitorNamespaceSelector: {}