Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Observability #13

Merged
merged 27 commits into from
Jul 27, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
27 commits
Select commit Hold shift + click to select a range
7337617
Init observer chart
fcollonval Jul 16, 2024
37c06f4
Fix helm chart paths
fcollonval Jul 16, 2024
28ff74c
Fix grafana ingress
fcollonval Jul 16, 2024
ce404de
Fix monitoring services across namespace
fcollonval Jul 16, 2024
4b9cc08
Finalize monitors selectors and iam serviceMonitor
fcollonval Jul 17, 2024
2c5e27a
Add opentelemetry collector
fcollonval Jul 18, 2024
ba139cf
Remove serviceMonitor from iam
fcollonval Jul 19, 2024
a05c83b
Add servicemonitor for the otel collector
fcollonval Jul 19, 2024
af24123
Integrate back opentelemetry operator subchart
fcollonval Jul 19, 2024
1919cc0
Move opentelemetry crds
fcollonval Jul 19, 2024
ae54081
Add loki and logs collectors
fcollonval Jul 19, 2024
fee705d
Fix mounting log into otel daemonset collectors
fcollonval Jul 22, 2024
e034da1
Fix loki config
fcollonval Jul 22, 2024
f8bbdf9
Add loki as grafana datasource
fcollonval Jul 22, 2024
7c12a6c
Add tempo
fcollonval Jul 23, 2024
2cec933
Remove loki gateway and fix its servicemonitor
fcollonval Jul 23, 2024
446b514
Fix opentelemetry collector servicemonitor
fcollonval Jul 23, 2024
4e49830
Enable default loki dashboards
fcollonval Jul 23, 2024
3c846a3
WIP add traefik observability
fcollonval Jul 23, 2024
b439fbe
Add affinity on observer
fcollonval Jul 25, 2024
f1eb342
Fix service deployment
fcollonval Jul 25, 2024
43cf10a
Fix tempo datasource
fcollonval Jul 25, 2024
4f9f1ec
Add jupyter server auto-instrumentation
fcollonval Jul 25, 2024
f2a3481
Remove unneeded files
fcollonval Jul 25, 2024
bdf7c19
Fix values
fcollonval Jul 25, 2024
9931465
Some customization for the logs collector
fcollonval Jul 25, 2024
6248319
Add docs
fcollonval Jul 26, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 5 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
# charts
.chart-packages/**
.chart-index/**
charts/**/*.tgz
charts/**/*.lock
2 changes: 1 addition & 1 deletion charts/datalayer-iam/Chart.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@ version: 0.1.1
appVersion: 0.1.1
home: https://datalayer.io
sources:
- https://github.com/datalayer/helm-charts/tree/main/datalayer-iam
- https://github.com/datalayer/helm-charts/tree/main/charts/datalayer-iam
icon: https://assets.datalayer.tech/datalayer-square.png
maintainers:
- name: Datalayer
Expand Down
2 changes: 1 addition & 1 deletion charts/datalayer-iam/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,7 @@ Datalayer IAM

## Source Code

* <https://github.com/datalayer/helm-charts/tree/main/datalayer-iam>
* <https://github.com/datalayer/helm-charts/tree/main/charts/datalayer-iam>

## Values

Expand Down
4 changes: 3 additions & 1 deletion charts/datalayer-iam/values.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -56,8 +56,10 @@ iam:
DATALAYER_STRIPE_PRODUCT_ID: ""
DATALAYER_STRIPE_WEBHOOK_SECRET: ""
DATALAYER_SUPPORT_EMAIL: ""
OTEL_EXPORTER_OTLP_METRICS_ENDPOINT: ""
OTEL_EXPORTER_OTLP_TRACES_ENDPOINT: ""

# resources:
# limits:
# memory: "8192Mi"
# cpu: "3000m"
# cpu: "3000m"
2 changes: 1 addition & 1 deletion charts/datalayer-jump/Chart.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@ version: 0.0.6
appVersion: 0.0.6
home: https://datalayer.io
sources:
- https://github.com/datalayer/helm-charts/tree/main/datalayer-jump
- https://github.com/datalayer/helm-charts/tree/main/charts/datalayer-jump
icon: https://assets.datalayer.tech/datalayer-square.png
maintainers:
- name: Datalayer
Expand Down
2 changes: 1 addition & 1 deletion charts/datalayer-jump/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,7 @@ Datalayer Jump

## Source Code

* <https://github.com/datalayer/helm-charts/tree/main/datalayer-jump>
* <https://github.com/datalayer/helm-charts/tree/main/charts/datalayer-jump>

## Values

Expand Down
2 changes: 1 addition & 1 deletion charts/datalayer-jupyter/Chart.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@ version: 0.1.1
appVersion: 0.1.1
home: https://datalayer.io
sources:
- https://github.com/datalayer/helm-charts/tree/main/datalayer-jupyter
- https://github.com/datalayer/helm-charts/tree/main/charts/datalayer-jupyter
icon: https://assets.datalayer.tech/datalayer-square.png
maintainers:
- name: Datalayer
Expand Down
2 changes: 1 addition & 1 deletion charts/datalayer-jupyter/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,7 @@ Datalayer Jupyter

## Source Code

* <https://github.com/datalayer/helm-charts/tree/main/datalayer-jupyter>
* <https://github.com/datalayer/helm-charts/tree/main/charts/datalayer-jupyter>

## Values

Expand Down
5 changes: 4 additions & 1 deletion charts/datalayer-jupyter/values.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -31,15 +31,18 @@ jupyter:
DATALAYER_JWT_ALGORITHM: ""
DATALAYER_JWT_ISSUER: "https://id.datalayer.run"
DATALAYER_JWT_SECRET: ""
DATALAYER_JWT_CACHE_VALIDATE: "false"
DATALAYER_OPENFGA_AUTHZ_MODEL_ID: ""
DATALAYER_OPENFGA_REST_URL: "http://datalayer-openfga.datalayer-openfga.svc.cluster.local:8080"
DATALAYER_OPENFGA_STORE_ID: ""
DATALAYER_OPERATOR_API_KEY: ""
DATALAYER_RUNTIME_ENV: "prod"
DATALAYER_RUN_HOST: ""
DATALAYER_SOLR_ZK_HOST: "solr-datalayer-solrcloud-zookeeper-headless.datalayer-solr.svc.cluster.local"
OTEL_EXPORTER_OTLP_METRICS_ENDPOINT: ""
OTEL_EXPORTER_OTLP_TRACES_ENDPOINT: ""

# resources:
# limits:
# memory: "8192Mi"
# cpu: "3000m"
# cpu: "3000m"
2 changes: 1 addition & 1 deletion charts/datalayer-library/Chart.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@ version: 0.0.6
appVersion: 0.0.6
home: https://datalayer.io
sources:
- https://github.com/datalayer/helm-charts/tree/main/datalayer-library
- https://github.com/datalayer/helm-charts/tree/main/charts/datalayer-library
icon: https://assets.datalayer.tech/datalayer-square.png
maintainers:
- name: Datalayer
Expand Down
2 changes: 1 addition & 1 deletion charts/datalayer-library/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,7 @@ Datalayer Library

## Source Code

* <https://github.com/datalayer/helm-charts/tree/main/datalayer-library>
* <https://github.com/datalayer/helm-charts/tree/main/charts/datalayer-library>

## Values

Expand Down
56 changes: 10 additions & 46 deletions charts/datalayer-library/templates/deployment.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -39,53 +39,17 @@ spec:
ports:
- containerPort: {{ .Values.library.port }}
protocol: TCP
{{- if or .Values.library.env .Values.library.envValueFrom }}
env:
- name: DATALAYER_RUN_HOST
value: {{ .Values.library.env.DATALAYER_RUN_HOST }}
- name: DATALAYER_CDN_URL
value: {{ .Values.library.env.DATALAYER_CDN_URL }}
- name: DATALAYER_RUNTIME_ENV
value: {{ .Values.library.env.DATALAYER_RUNTIME_ENV }}
- name: DATALAYER_LIBRARY_AUTH_CALLBACK
value: {{ .Values.library.env.DATALAYER_LIBRARY_AUTH_CALLBACK }}
- name: DATALAYER_LIBRARY_UI_REDIRECT
value: {{ .Values.library.env.DATALAYER_LIBRARY_UI_REDIRECT }}
- name: DATALAYER_SOLR_ZK_HOST
value: {{ .Values.library.env.DATALAYER_SOLR_ZK_HOST }}
- name: DATALAYER_SOLR_USERNAME
value: {{ .Values.library.env.DATALAYER_SOLR_USERNAME }}
- name: DATALAYER_SOLR_PASSWORD
value: {{ .Values.library.env.DATALAYER_SOLR_PASSWORD }}
- name: AWS_ACCESS_KEY_ID
value: {{ .Values.library.env.AWS_ACCESS_KEY_ID }}
- name: AWS_SECRET_ACCESS_KEY
value: {{ .Values.library.env.AWS_SECRET_ACCESS_KEY }}
- name: AWS_DEFAULT_REGION
value: {{ .Values.library.env.AWS_DEFAULT_REGION }}
- name: DATALAYER_SMTP_HOST
value: {{ .Values.library.env.DATALAYER_SMTP_HOST }}
- name: DATALAYER_SMTP_PORT
value: {{ .Values.library.env.DATALAYER_SMTP_PORT | quote }}
- name: DATALAYER_SMTP_USERNAME
value: {{ .Values.library.env.DATALAYER_SMTP_USERNAME }}
- name: DATALAYER_SMTP_PASSWORD
value: {{ .Values.library.env.DATALAYER_SMTP_PASSWORD }}
- name: DATALAYER_JWT_CACHE_VALIDATE
value: "false"
- name: DATALAYER_JWT_ISSUER
value: {{ .Values.library.env.DATALAYER_JWT_ISSUER }}
- name: DATALAYER_JWT_SECRET
value: {{ .Values.library.env.DATALAYER_JWT_SECRET }}
- name: DATALAYER_JWT_ALGORITHM
value: {{ .Values.library.env.DATALAYER_JWT_ALGORITHM }}
- name: DATALAYER_AUTHZ_ENGINE
value: {{ .Values.library.env.DATALAYER_AUTHZ_ENGINE }}
- name: DATALAYER_OPENFGA_REST_URL
value: {{ .Values.library.env.DATALAYER_OPENFGA_REST_URL }}
- name: DATALAYER_OPENFGA_STORE_ID
value: {{ .Values.library.env.DATALAYER_OPENFGA_STORE_ID }}
- name: DATALAYER_OPENFGA_AUTHZ_MODEL_ID
value: {{ .Values.library.env.DATALAYER_OPENFGA_AUTHZ_MODEL_ID }}
{{- range $key, $value := .Values.library.envValueFrom }}
- name: {{ $key }}
valueFrom: {{- $value | toYaml | nindent 16 }}
{{- end }}
{{- range $key, $value := .Values.library.env }}
- name: {{ $key }}
value: {{ $value | quote }}
{{- end }}
{{- end }}
readinessProbe:
httpGet:
path: /api/library/version
Expand Down
3 changes: 3 additions & 0 deletions charts/datalayer-library/values.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -21,6 +21,7 @@ library:
DATALAYER_JWT_ISSUER: ""
DATALAYER_JWT_SECRET: ""
DATALAYER_JWT_ALGORITHM: ""
DATALAYER_JWT_CACHE_VALIDATE: "false"
DATALAYER_AUTHZ_ENGINE: ""
DATALAYER_OPENFGA_REST_URL: "http://datalayer-openfga.datalayer-openfga.svc.cluster.local:8080"
DATALAYER_OPENFGA_STORE_ID: ""
Expand All @@ -29,3 +30,5 @@ library:
DATALAYER_SMTP_PORT: ""
DATALAYER_SMTP_USERNAME: ""
DATALAYER_SMTP_PASSWORD: ""
OTEL_EXPORTER_OTLP_METRICS_ENDPOINT: ""
OTEL_EXPORTER_OTLP_TRACES_ENDPOINT: ""
33 changes: 33 additions & 0 deletions charts/datalayer-observer/Chart.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,33 @@
apiVersion: v1
description: Datalayer Observer
name: datalayer-observer
version: 0.1.0
appVersion: 0.1.0
home: https://datalayer.io
sources:
- https://github.com/datalayer/helm-charts/tree/main/charts/datalayer-observer
icon: https://assets.datalayer.tech/datalayer-square.png
maintainers:
- name: Datalayer
email: [email protected]
url: https://datalayer.io
dependencies:
- name: crds
version: "0.0.0"
condition: crds.enabled
- name: kube-prometheus-stack
version: 61.x.x
repository: https://prometheus-community.github.io/helm-charts
condition: kube-prometheus-stack.enabled
- name: loki
version: 6.x.x
repository: https://grafana.github.io/helm-charts
condition: loki.enabled
- name: tempo
version: 1.x.x
repository: https://grafana.github.io/helm-charts
condition: tempo.enabled
- name: opentelemetry-operator
version: 0.64.x
repository: https://open-telemetry.github.io/opentelemetry-helm-charts
condition: opentelemetry-operator.enabled
167 changes: 167 additions & 0 deletions charts/datalayer-observer/README.md
Original file line number Diff line number Diff line change
@@ -1,3 +1,170 @@
[![Datalayer](https://assets.datalayer.tech/datalayer-25.svg)](https://datalayer.io)

# Datalayer Observer Helm Chart

Install observability tools for Datalayer stack.

The tools used:
- OpenTelemetry Collector:
- As deployment to proxy metrics and traces from Datalayer services to Prometheus and Tempo
- As daemonset to parse pod log files and send them to Loki
- Prometheus: To gather metrics
- Tempo: To gather traces
- Loki: To gather logs
- AlertManager: To manage alerts
- Grafana: To visualize and analyze the telemetry

## How to install?

```
plane up datalayer-observer
```

If you face some issues due to the opentelemetry operator, it is likely
related to the CRDs being undefined in the cluster. You can install them
manually from `plane/etc/helm/charts/datalayer-observer/charts/crds/crds`.

> [!NOTE]
> Helm should install them the first time. But this is a complex
> thing to handle; see https://helm.sh/docs/chart_best_practices/custom_resource_definitions/#install-a-crd-declaration-before-using-the-resource

## What is deployed?

This chart is built on top of multiple subcharts:
- kube-prometheus-stack - Full Prometheus stack activating:
- AlertManager
- Grafana
- Prometheus Operator
- Prometheus
- Prometheus Node Exporter
- loki
- Loki as single binary
- tempo
- Tempo as single binary
- opentelemetry-operator - using collector-contrib image

In addition to the subcharts elements, it creates:

- An opentelemetry collector as singleton instance to proxy traces and metrics from services and remote kernels to Prometheus and Tempo
- An opentelemetry collector as daemonset to parse the container log files and proxy them to loki
- An opentelemetry instrumentation to add Python auto instrumentation on Jupyter Server when a pod is created.
- A custom ingress for grafana to use similar config as for Datalayer services
- A service monitor to tell prometheus to fetch the metrics from the opentelemetry collector singleton


```mermaid
flowchart LR
subgraph node1
subgraph pod1
ai[Auto instrumentation]-.->rk[remote kernels]
oc[Operator Companion]
end
lc[Log collector]-- parse logs -->rk
lc-- parse logs -->oc
ne[Node exporter]
end
lc-. send logs .->Loki
pr-->ne
subgraph node2
subgraph pod2
iam
end
lc2[Log collector]-- parse logs -->iam
ne2[Node exporter]
end
lc2-. send logs .->Loki
pr-->ne2
otelc[OpenTelemetry Collector]
iam-- metrics & traces -->otelc
pr[Prometheus]-- metrics -->otelc
rk-- metrics & traces -->otelc
oc-- metrics & traces -->otelc
otelc-- traces -->Tempo
Grafana-->Tempo
Grafana-->Loki
Grafana-->pr

style pr stroke:salmon
style lc stroke:green
style lc2 stroke:green
style Loki stroke:green
style Tempo stroke:salmon
linkStyle 1,2,3,5,6 stroke:green
```

## Tips and tricks

### Prometheus

Prometheus gets its data source definition from CRs `PodMonitor` and
`ServiceMonitor` (recommended). Third-parties that don't support
opentelemetry metrics use such monitors and therefore are
not proxied by the opentelemetry collector. For now:
- `ServiceMonitor`: used by Grafana, AlertManager, Loki, Tempo, Prometheus, PrometheusOperator, Prometheus Node Exporter and OpenTelemetry Collector singleton.
- To be detected by Prometheus the ServiceMonitor must have the two labels:

```
monitoring.datalayer.io/instance: "observer"
monitoring.datalayer.io/enabled: "true"
```

- Kubernetes metrics are also gathered through service monitors defined in the kube-prometheus-stack.

- `PodMonitor`: used by Pulsar stack (default in helm chart).
- PodMonitor can be defined in any namespace
- To be detected by Prometheus the PodMonitor must have a label `app=pulsar`. Other app name could be defined in the `kube-prometheus-stack.prometheus.prometheusSpec.podMonitorSelector`.

### Instrumentation

#### Datalayer services

The services based on connexion are instrumented explicitly using the code
defined in `datalayer_common.instrumentation` as a custom version of the
Python instrumentation ASGI was needed in particular to push the http route
metadata.

> [!IMPORTANT]
> The logging instrumentor is used as by default it calls `basicConfig`. The
> service must not call it.

Configuring the metrics and traces targets is done through environment variables:

```
export OTEL_EXPORTER_OTLP_METRICS_ENDPOINT=http://datalayer-collector-collector.datalayer-observer.svc.cluster.local:4317
export OTEL_EXPORTER_OTLP_TRACES_ENDPOINT=http://datalayer-collector-collector.datalayer-observer.svc.cluster.local:4317
```

> [!NOTE]
> Currently the data is sent using gRPC. Http is also available but would
> require to change the instrumentation code as the library to use is different.

#### Jupyter Remote Kernels

The remote kernels pod are auto-instrumented by the OpenTelemetry operator
via a CR `Instrumentation`.

That CR must be defined in the namespace the pod are gonna be created and
the instrumentation will occur only at the pod creation.

A pod is selected for instrumentation if it gets some annotations. In this
specific case, to instrument Python on a multi-container pod:


```python
instrumentation.opentelemetry.io/inject-python: "true"
instrumentation.opentelemetry.io/container-names: "{KERNEL_CONTAINER_NAME}"
```

> See https://github.com/open-telemetry/opentelemetry-operator?tab=readme-ov-file#opentelemetry-auto-instrumentation-injection for more information and available options (to be set through environment variables).

The Python auto-instrumentation is using http to send data to the OpenTelemetry Collector.

## TODO

- [ ] Drop the Prometheus Node Exporter to use the OpenTelemetry Collector Daemonset
- [ ] OpenTelemetry to gather kubernetes metrics (needed?) - for now the system is monitor by Prometheus directly.
- [ ] Quid about storage
- [ ] Link traces -> metrics (exemplar) and traces -> logs (tags?)
- [ ] Quid accounts and roles
- [ ] Fix Traefik observability
- [ ] Some logs don't have trace id in Datalayer services
3 changes: 3 additions & 0 deletions charts/datalayer-observer/charts/crds/Chart.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
apiVersion: v2
name: crds
version: 0.0.0
Loading
Loading