Skip to content

Commit

Permalink
Merge pull request #13 from datalayer/ft/observability
Browse files Browse the repository at this point in the history
Observability
  • Loading branch information
echarles authored Jul 27, 2024
2 parents d557768 + 6248319 commit 716129b
Show file tree
Hide file tree
Showing 41 changed files with 12,926 additions and 6,205 deletions.
5 changes: 5 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
# charts
.chart-packages/**
.chart-index/**
charts/**/*.tgz
charts/**/*.lock
2 changes: 1 addition & 1 deletion charts/datalayer-iam/Chart.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@ version: 0.1.1
appVersion: 0.1.1
home: https://datalayer.io
sources:
- https://github.com/datalayer/helm-charts/tree/main/datalayer-iam
- https://github.com/datalayer/helm-charts/tree/main/charts/datalayer-iam
icon: https://assets.datalayer.tech/datalayer-square.png
maintainers:
- name: Datalayer
Expand Down
2 changes: 1 addition & 1 deletion charts/datalayer-iam/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,7 @@ Datalayer IAM

## Source Code

* <https://github.com/datalayer/helm-charts/tree/main/datalayer-iam>
* <https://github.com/datalayer/helm-charts/tree/main/charts/datalayer-iam>

## Values

Expand Down
4 changes: 3 additions & 1 deletion charts/datalayer-iam/values.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -56,8 +56,10 @@ iam:
DATALAYER_STRIPE_PRODUCT_ID: ""
DATALAYER_STRIPE_WEBHOOK_SECRET: ""
DATALAYER_SUPPORT_EMAIL: ""
OTEL_EXPORTER_OTLP_METRICS_ENDPOINT: ""
OTEL_EXPORTER_OTLP_TRACES_ENDPOINT: ""

# resources:
# limits:
# memory: "8192Mi"
# cpu: "3000m"
# cpu: "3000m"
2 changes: 1 addition & 1 deletion charts/datalayer-jump/Chart.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@ version: 0.0.6
appVersion: 0.0.6
home: https://datalayer.io
sources:
- https://github.com/datalayer/helm-charts/tree/main/datalayer-jump
- https://github.com/datalayer/helm-charts/tree/main/charts/datalayer-jump
icon: https://assets.datalayer.tech/datalayer-square.png
maintainers:
- name: Datalayer
Expand Down
2 changes: 1 addition & 1 deletion charts/datalayer-jump/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,7 @@ Datalayer Jump

## Source Code

* <https://github.com/datalayer/helm-charts/tree/main/datalayer-jump>
* <https://github.com/datalayer/helm-charts/tree/main/charts/datalayer-jump>

## Values

Expand Down
2 changes: 1 addition & 1 deletion charts/datalayer-jupyter/Chart.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@ version: 0.1.1
appVersion: 0.1.1
home: https://datalayer.io
sources:
- https://github.com/datalayer/helm-charts/tree/main/datalayer-jupyter
- https://github.com/datalayer/helm-charts/tree/main/charts/datalayer-jupyter
icon: https://assets.datalayer.tech/datalayer-square.png
maintainers:
- name: Datalayer
Expand Down
2 changes: 1 addition & 1 deletion charts/datalayer-jupyter/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,7 @@ Datalayer Jupyter

## Source Code

* <https://github.com/datalayer/helm-charts/tree/main/datalayer-jupyter>
* <https://github.com/datalayer/helm-charts/tree/main/charts/datalayer-jupyter>

## Values

Expand Down
3 changes: 3 additions & 0 deletions charts/datalayer-jupyter/values.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -32,6 +32,7 @@ jupyter:
DATALAYER_JWT_ISSUER: "https://id.datalayer.run"
DATALAYER_JWT_CACHE_VALIDATE: "true"
DATALAYER_JWT_SECRET: ""
DATALAYER_JWT_CACHE_VALIDATE: "false"
DATALAYER_OPENFGA_AUTHZ_MODEL_ID: ""
DATALAYER_OPENFGA_REST_URL: "http://datalayer-openfga.datalayer-openfga.svc.cluster.local:8080"
DATALAYER_OPENFGA_STORE_ID: ""
Expand All @@ -40,6 +41,8 @@ jupyter:
DATALAYER_RUNTIME_ENV: "prod"
DATALAYER_RUN_HOST: ""
DATALAYER_SOLR_ZK_HOST: "solr-datalayer-solrcloud-zookeeper-headless.datalayer-solr.svc.cluster.local"
OTEL_EXPORTER_OTLP_METRICS_ENDPOINT: ""
OTEL_EXPORTER_OTLP_TRACES_ENDPOINT: ""

# resources:
# limits:
Expand Down
2 changes: 1 addition & 1 deletion charts/datalayer-library/Chart.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@ version: 0.0.6
appVersion: 0.0.6
home: https://datalayer.io
sources:
- https://github.com/datalayer/helm-charts/tree/main/datalayer-library
- https://github.com/datalayer/helm-charts/tree/main/charts/datalayer-library
icon: https://assets.datalayer.tech/datalayer-square.png
maintainers:
- name: Datalayer
Expand Down
2 changes: 1 addition & 1 deletion charts/datalayer-library/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,7 @@ Datalayer Library

## Source Code

* <https://github.com/datalayer/helm-charts/tree/main/datalayer-library>
* <https://github.com/datalayer/helm-charts/tree/main/charts/datalayer-library>

## Values

Expand Down
56 changes: 10 additions & 46 deletions charts/datalayer-library/templates/deployment.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -39,53 +39,17 @@ spec:
ports:
- containerPort: {{ .Values.library.port }}
protocol: TCP
{{- if or .Values.library.env .Values.library.envValueFrom }}
env:
- name: DATALAYER_RUN_HOST
value: {{ .Values.library.env.DATALAYER_RUN_HOST }}
- name: DATALAYER_CDN_URL
value: {{ .Values.library.env.DATALAYER_CDN_URL }}
- name: DATALAYER_RUNTIME_ENV
value: {{ .Values.library.env.DATALAYER_RUNTIME_ENV }}
- name: DATALAYER_LIBRARY_AUTH_CALLBACK
value: {{ .Values.library.env.DATALAYER_LIBRARY_AUTH_CALLBACK }}
- name: DATALAYER_LIBRARY_UI_REDIRECT
value: {{ .Values.library.env.DATALAYER_LIBRARY_UI_REDIRECT }}
- name: DATALAYER_SOLR_ZK_HOST
value: {{ .Values.library.env.DATALAYER_SOLR_ZK_HOST }}
- name: DATALAYER_SOLR_USERNAME
value: {{ .Values.library.env.DATALAYER_SOLR_USERNAME }}
- name: DATALAYER_SOLR_PASSWORD
value: {{ .Values.library.env.DATALAYER_SOLR_PASSWORD }}
- name: AWS_ACCESS_KEY_ID
value: {{ .Values.library.env.AWS_ACCESS_KEY_ID }}
- name: AWS_SECRET_ACCESS_KEY
value: {{ .Values.library.env.AWS_SECRET_ACCESS_KEY }}
- name: AWS_DEFAULT_REGION
value: {{ .Values.library.env.AWS_DEFAULT_REGION }}
- name: DATALAYER_SMTP_HOST
value: {{ .Values.library.env.DATALAYER_SMTP_HOST }}
- name: DATALAYER_SMTP_PORT
value: {{ .Values.library.env.DATALAYER_SMTP_PORT | quote }}
- name: DATALAYER_SMTP_USERNAME
value: {{ .Values.library.env.DATALAYER_SMTP_USERNAME }}
- name: DATALAYER_SMTP_PASSWORD
value: {{ .Values.library.env.DATALAYER_SMTP_PASSWORD }}
- name: DATALAYER_JWT_CACHE_VALIDATE
value: "false"
- name: DATALAYER_JWT_ISSUER
value: {{ .Values.library.env.DATALAYER_JWT_ISSUER }}
- name: DATALAYER_JWT_SECRET
value: {{ .Values.library.env.DATALAYER_JWT_SECRET }}
- name: DATALAYER_JWT_ALGORITHM
value: {{ .Values.library.env.DATALAYER_JWT_ALGORITHM }}
- name: DATALAYER_AUTHZ_ENGINE
value: {{ .Values.library.env.DATALAYER_AUTHZ_ENGINE }}
- name: DATALAYER_OPENFGA_REST_URL
value: {{ .Values.library.env.DATALAYER_OPENFGA_REST_URL }}
- name: DATALAYER_OPENFGA_STORE_ID
value: {{ .Values.library.env.DATALAYER_OPENFGA_STORE_ID }}
- name: DATALAYER_OPENFGA_AUTHZ_MODEL_ID
value: {{ .Values.library.env.DATALAYER_OPENFGA_AUTHZ_MODEL_ID }}
{{- range $key, $value := .Values.library.envValueFrom }}
- name: {{ $key }}
valueFrom: {{- $value | toYaml | nindent 16 }}
{{- end }}
{{- range $key, $value := .Values.library.env }}
- name: {{ $key }}
value: {{ $value | quote }}
{{- end }}
{{- end }}
readinessProbe:
httpGet:
path: /api/library/version
Expand Down
3 changes: 3 additions & 0 deletions charts/datalayer-library/values.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -21,6 +21,7 @@ library:
DATALAYER_JWT_ISSUER: ""
DATALAYER_JWT_SECRET: ""
DATALAYER_JWT_ALGORITHM: ""
DATALAYER_JWT_CACHE_VALIDATE: "false"
DATALAYER_AUTHZ_ENGINE: ""
DATALAYER_OPENFGA_REST_URL: "http://datalayer-openfga.datalayer-openfga.svc.cluster.local:8080"
DATALAYER_OPENFGA_STORE_ID: ""
Expand All @@ -29,3 +30,5 @@ library:
DATALAYER_SMTP_PORT: ""
DATALAYER_SMTP_USERNAME: ""
DATALAYER_SMTP_PASSWORD: ""
OTEL_EXPORTER_OTLP_METRICS_ENDPOINT: ""
OTEL_EXPORTER_OTLP_TRACES_ENDPOINT: ""
33 changes: 33 additions & 0 deletions charts/datalayer-observer/Chart.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,33 @@
apiVersion: v1
description: Datalayer Observer
name: datalayer-observer
version: 0.1.0
appVersion: 0.1.0
home: https://datalayer.io
sources:
- https://github.com/datalayer/helm-charts/tree/main/charts/datalayer-observer
icon: https://assets.datalayer.tech/datalayer-square.png
maintainers:
- name: Datalayer
email: [email protected]
url: https://datalayer.io
dependencies:
- name: crds
version: "0.0.0"
condition: crds.enabled
- name: kube-prometheus-stack
version: 61.x.x
repository: https://prometheus-community.github.io/helm-charts
condition: kube-prometheus-stack.enabled
- name: loki
version: 6.x.x
repository: https://grafana.github.io/helm-charts
condition: loki.enabled
- name: tempo
version: 1.x.x
repository: https://grafana.github.io/helm-charts
condition: tempo.enabled
- name: opentelemetry-operator
version: 0.64.x
repository: https://open-telemetry.github.io/opentelemetry-helm-charts
condition: opentelemetry-operator.enabled
167 changes: 167 additions & 0 deletions charts/datalayer-observer/README.md
Original file line number Diff line number Diff line change
@@ -1,3 +1,170 @@
[![Datalayer](https://assets.datalayer.tech/datalayer-25.svg)](https://datalayer.io)

# Datalayer Observer Helm Chart

Install observability tools for Datalayer stack.

The tools used:
- OpenTelemetry Collector:
- As deployment to proxy metrics and traces from Datalayer services to Prometheus and Tempo
- As daemonset to parse pod log files and send them to Loki
- Prometheus: To gather metrics
- Tempo: To gather traces
- Loki: To gather logs
- AlertManager: To manage alerts
- Grafana: To visualize and analyze the telemetry

## How to install?

```
plane up datalayer-observer
```

If you face some issues due to the opentelemetry operator, it is likely
related to the CRDs being undefined in the cluster. You can install them
manually from `plane/etc/helm/charts/datalayer-observer/charts/crds/crds`.

> [!NOTE]
> Helm should install them the first time. But this is a complex
> thing to handle; see https://helm.sh/docs/chart_best_practices/custom_resource_definitions/#install-a-crd-declaration-before-using-the-resource
## What is deployed?

This chart is built on top of multiple subcharts:
- kube-prometheus-stack - Full Prometheus stack activating:
- AlertManager
- Grafana
- Prometheus Operator
- Prometheus
- Prometheus Node Exporter
- loki
- Loki as single binary
- tempo
- Tempo as single binary
- opentelemetry-operator - using collector-contrib image

In addition to the subcharts elements, it creates:

- An opentelemetry collector as singleton instance to proxy traces and metrics from services and remote kernels to Prometheus and Tempo
- An opentelemetry collector as daemonset to parse the container log files and proxy them to loki
- An opentelemetry instrumentation to add Python auto instrumentation on Jupyter Server when a pod is created.
- A custom ingress for grafana to use similar config as for Datalayer services
- A service monitor to tell prometheus to fetch the metrics from the opentelemetry collector singleton


```mermaid
flowchart LR
subgraph node1
subgraph pod1
ai[Auto instrumentation]-.->rk[remote kernels]
oc[Operator Companion]
end
lc[Log collector]-- parse logs -->rk
lc-- parse logs -->oc
ne[Node exporter]
end
lc-. send logs .->Loki
pr-->ne
subgraph node2
subgraph pod2
iam
end
lc2[Log collector]-- parse logs -->iam
ne2[Node exporter]
end
lc2-. send logs .->Loki
pr-->ne2
otelc[OpenTelemetry Collector]
iam-- metrics & traces -->otelc
pr[Prometheus]-- metrics -->otelc
rk-- metrics & traces -->otelc
oc-- metrics & traces -->otelc
otelc-- traces -->Tempo
Grafana-->Tempo
Grafana-->Loki
Grafana-->pr
style pr stroke:salmon
style lc stroke:green
style lc2 stroke:green
style Loki stroke:green
style Tempo stroke:salmon
linkStyle 1,2,3,5,6 stroke:green
```

## Tips and tricks

### Prometheus

Prometheus gets its data source definition from CRs `PodMonitor` and
`ServiceMonitor` (recommended). Third-parties that don't support
opentelemetry metrics use such monitors and therefore are
not proxied by the opentelemetry collector. For now:
- `ServiceMonitor`: used by Grafana, AlertManager, Loki, Tempo, Prometheus, PrometheusOperator, Prometheus Node Exporter and OpenTelemetry Collector singleton.
- To be detected by Prometheus the ServiceMonitor must have the two labels:

```
monitoring.datalayer.io/instance: "observer"
monitoring.datalayer.io/enabled: "true"
```

- Kubernetes metrics are also gathered through service monitors defined in the kube-prometheus-stack.

- `PodMonitor`: used by Pulsar stack (default in helm chart).
- PodMonitor can be defined in any namespace
- To be detected by Prometheus the PodMonitor must have a label `app=pulsar`. Other app name could be defined in the `kube-prometheus-stack.prometheus.prometheusSpec.podMonitorSelector`.

### Instrumentation

#### Datalayer services

The services based on connexion are instrumented explicitly using the code
defined in `datalayer_common.instrumentation` as a custom version of the
Python instrumentation ASGI was needed in particular to push the http route
metadata.

> [!IMPORTANT]
> The logging instrumentor is used as by default it calls `basicConfig`. The
> service must not call it.
Configuring the metrics and traces targets is done through environment variables:

```
export OTEL_EXPORTER_OTLP_METRICS_ENDPOINT=http://datalayer-collector-collector.datalayer-observer.svc.cluster.local:4317
export OTEL_EXPORTER_OTLP_TRACES_ENDPOINT=http://datalayer-collector-collector.datalayer-observer.svc.cluster.local:4317
```

> [!NOTE]
> Currently the data is sent using gRPC. Http is also available but would
> require to change the instrumentation code as the library to use is different.
#### Jupyter Remote Kernels

The remote kernels pod are auto-instrumented by the OpenTelemetry operator
via a CR `Instrumentation`.

That CR must be defined in the namespace the pod are gonna be created and
the instrumentation will occur only at the pod creation.

A pod is selected for instrumentation if it gets some annotations. In this
specific case, to instrument Python on a multi-container pod:


```python
instrumentation.opentelemetry.io/inject-python: "true"
instrumentation.opentelemetry.io/container-names: "{KERNEL_CONTAINER_NAME}"
```

> See https://github.com/open-telemetry/opentelemetry-operator?tab=readme-ov-file#opentelemetry-auto-instrumentation-injection for more information and available options (to be set through environment variables).
The Python auto-instrumentation is using http to send data to the OpenTelemetry Collector.

## TODO

- [ ] Drop the Prometheus Node Exporter to use the OpenTelemetry Collector Daemonset
- [ ] OpenTelemetry to gather kubernetes metrics (needed?) - for now the system is monitor by Prometheus directly.
- [ ] Quid about storage
- [ ] Link traces -> metrics (exemplar) and traces -> logs (tags?)
- [ ] Quid accounts and roles
- [ ] Fix Traefik observability
- [ ] Some logs don't have trace id in Datalayer services
3 changes: 3 additions & 0 deletions charts/datalayer-observer/charts/crds/Chart.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
apiVersion: v2
name: crds
version: 0.0.0
Loading

0 comments on commit 716129b

Please sign in to comment.