Skip to content

Commit

Permalink
add promql-to-scrape (#74)
Browse files Browse the repository at this point in the history
* add an agent to promql-to-dd

just because im here

* promql-to-scrape

add basic example for scraping metrics out of the Temporal Cloud observability endpoint
and exposing a /metrics endpoint

* add Dockerfile, examples, and a README

* PR feedback

* add container image to example deployment
  • Loading branch information
TimSimmons authored Nov 17, 2023
1 parent 156add8 commit 1ff2f62
Show file tree
Hide file tree
Showing 18 changed files with 938 additions and 0 deletions.
3 changes: 3 additions & 0 deletions cloud/observability/promql-to-dd-go/prometheus/http.go
Original file line number Diff line number Diff line change
Expand Up @@ -45,6 +45,9 @@ func (c *HttpClient) Do(ctx context.Context, req *http.Request) (*http.Response,
if ctx != nil {
req = req.WithContext(ctx)
}

req.Header.Set("User-Agent", "promql-to-dd")

resp, err := c.Client.Do(req)
defer func() {
if resp != nil {
Expand Down
11 changes: 11 additions & 0 deletions cloud/observability/promql-to-scrape/Dockerfile
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
FROM golang:1.21-alpine

WORKDIR /usr/src/app

COPY go.mod go.sum ./
RUN go mod download && go mod verify

COPY . .
RUN go build -v -o /usr/local/bin/promql-to-scrape ./cmd/promql-to-scrape/main.go

ENTRYPOINT ["/usr/local/bin/promql-to-scrape"]
47 changes: 47 additions & 0 deletions cloud/observability/promql-to-scrape/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,47 @@
# promql-to-scrape

This basic application is meant to provide an example for how one could use the Temporal Cloud Observability endpoint to expose a typical Prometheus `/metrics` endpoint.

**This example is provided as-is, without support. It is intended as reference material only.**

## How to Use

Grab your client cert and key and place them at `client.crt`, `tls.key`, and your Temporal Cloud account number that has the observability endpoint enabled.

```
go mod tidy
go build -o promql-to-scrape cmd/promql-to-scrape/main.go
./promql-to-scrape -client-cert client.crt -client-
key tls.key -prom-endpoint https://<account>.tmprl.cloud/prometheus --config-file examples/config.yaml --debug
~~~
time=2023-11-16T17:43:20.260-06:00 level=DEBUG msg="successful metric retrieval" time=3.529039083s
```

This means you can now hit http://localhost:9001/metrics on your machine and see your metrics.

### Important Usability Information

**Important:** When you go to scrape this, you should do so with a **60s** scrape interval, unless you are meaningfully modifying this code. The example queries all assume a 1 minute rate and you'll want these to be equal.

**Very Important:** The data you will see here is approximately 1 minute delayed (should you conform to the guidance above). Due to the aggregation that happens before metrics are presented to you, it's necessary for us to send the queries from this application to look 60 seconds in the past. Otherwise data aggregation would not be complete, and there would be no results for each query.

## Deployment

Some example Kubernetes manifests are provided in the `/examples` directory. Filling in your certificates and account should get you going pretty quickly.

## Generating Config

There is a second binary you can build that can help you build a default configuration of queries to scrape and export.

```
go build -o genconfig cmd/genconfig/main.go
./genconfig -client-cert client.crt -client-key tls.key -prom-endpoint https://<account>.tmprl.cloud/prometheus
...
```

This will generate an example config at `config.yaml` that you may use. It looks for all the existing metrics and generates a reasonable query for you to export.
- For counters, a `rate(counter[1m])`
- For gauges, it simply queries for `gauge`
- For histograms, it does a p99 aggregated by `temporal_namespace` and `operation`. `histogram_quantile(0.99, sum(rate(metric[1m])) by (le, operation, temporal_namespace)`

Modify at your own risk. You may find you'd like to add a global latency across all namespaces for instance. You can add those queries to your config file.
84 changes: 84 additions & 0 deletions cloud/observability/promql-to-scrape/cmd/genconfig/main.go
Original file line number Diff line number Diff line change
@@ -0,0 +1,84 @@
package main

import (
"flag"
"fmt"
"log"
"os"
"sort"

"github.com/temporalio/samples-server/cloud/observability/promql-to-scrape/internal"

"gopkg.in/yaml.v3"
)

func main() {
set := flag.NewFlagSet("app", flag.ExitOnError)
promURL := set.String("prom-endpoint", "", "Required Prometheus API endpoint for the server eg. https://<account>.tmprl.cloud/prometheus")
serverRootCACert := set.String("server-root-ca-cert", "", "Optional path to root server CA cert")
clientCert := set.String("client-cert", "", "Required path to client cert")
clientKey := set.String("client-key", "", "Required path to client key")
serverName := set.String("server-name", "", "Optional server name to use for verifying the server's certificate")
insecureSkipVerify := set.Bool("insecure-skip-verify", false, "Skip verification of the server's certificate and host name")

if err := set.Parse(os.Args[1:]); err != nil {
log.Fatalf("failed parsing args: %s", err)
} else if *clientCert == "" || *clientKey == "" {
log.Fatalf("-client-cert and -client-key are required")
}

client, err := internal.NewAPIClient(
internal.APIConfig{
TargetHost: *promURL,
ServerRootCACert: *serverRootCACert,
ClientCert: *clientCert,
ClientKey: *clientKey,
ServerName: *serverName,
InsecureSkipVerify: *insecureSkipVerify,
},
)
if err != nil {
log.Fatalf("Failed to create Prometheus client: %s", err)
}

counters, gauges, histograms, err := client.ListMetrics("temporal_cloud_v0")
if err != nil {
log.Fatalf("Failed to pull metric names: %s", err)
}
fmt.Println(counters)
fmt.Println(gauges)
fmt.Println(histograms)

conf := internal.Config{}

for _, counter := range counters {
conf.Metrics = append(conf.Metrics, internal.Metric{
MetricName: fmt.Sprintf("%s:rate1m", counter),
Query: fmt.Sprintf("rate(%s[1m])", counter),
})
}
for _, gauge := range gauges {
conf.Metrics = append(conf.Metrics, internal.Metric{
MetricName: gauge,
Query: gauge,
})
}
for _, histogram := range histograms {
conf.Metrics = append(conf.Metrics, internal.Metric{
MetricName: fmt.Sprintf("%s:histogram_quantile_p99_1m", histogram),
Query: fmt.Sprintf("histogram_quantile(0.99, sum(rate(%s[1m])) by (le, operation, temporal_namespace))", histogram),
})
}

sort.Sort(internal.ByMetricName(conf.Metrics))

yamlData, err := yaml.Marshal(&conf)
if err != nil {
log.Fatalf("error marshalling yaml: %v", err)
}

err = os.WriteFile("config.yaml", yamlData, 0644)
if err != nil {
log.Fatalf("error: %v", err)
}
}
59 changes: 59 additions & 0 deletions cloud/observability/promql-to-scrape/cmd/promql-to-scrape/main.go
Original file line number Diff line number Diff line change
@@ -0,0 +1,59 @@
package main

import (
"flag"
"log"
"os"

"github.com/temporalio/samples-server/cloud/observability/promql-to-scrape/internal"

"golang.org/x/exp/slog"
)

func main() {
set := flag.NewFlagSet("promql-to-scrape", flag.ExitOnError)
promURL := set.String("prom-endpoint", "", "Required Prometheus API endpoint for the server eg. https://<account>.tmprl.cloud/prometheus")
configFile := set.String("config-file", "", "Config file for promql-to-scrape")
serverRootCACert := set.String("server-root-ca-cert", "", "Optional path to root server CA cert")
clientCert := set.String("client-cert", "", "Required path to client cert")
clientKey := set.String("client-key", "", "Required path to client key")
serverName := set.String("server-name", "", "Optional server name to use for verifying the server's certificate")
insecureSkipVerify := set.Bool("insecure-skip-verify", false, "Skip verification of the server's certificate and host name")
serverAddr := set.String("bind", "0.0.0.0:9001", "address:port to expose the metrics server on")
debugLogging := set.Bool("debug", false, "Toggle debug logging")

if err := set.Parse(os.Args[1:]); err != nil {
log.Fatalf("failed parsing args: %v", err)
} else if *clientCert == "" || *clientKey == "" || *configFile == "" || *promURL == "" {
log.Fatalf("-client-cert, -client-key, -config-file, -prom-endpoint are required")
}

logLevel := slog.LevelInfo
if *debugLogging {
logLevel = slog.LevelDebug
}
h := slog.NewTextHandler(os.Stderr, &slog.HandlerOptions{Level: logLevel})
slog.SetDefault(slog.New(h))

client, err := internal.NewAPIClient(
internal.APIConfig{
TargetHost: *promURL,
ServerRootCACert: *serverRootCACert,
ClientCert: *clientCert,
ClientKey: *clientKey,
ServerName: *serverName,
InsecureSkipVerify: *insecureSkipVerify,
},
)
if err != nil {
log.Fatalf("failed to create Prometheus client: %v", err)
}

conf, err := internal.LoadConfig(*configFile)
if err != nil {
log.Fatalf("failed to load config file: %v", err)
}

s := internal.NewPromToScrapeServer(client, conf, *serverAddr)
s.Start()
}
43 changes: 43 additions & 0 deletions cloud/observability/promql-to-scrape/examples/config.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,43 @@
metrics:
- metric_name: temporal_cloud_v0_frontend_service_error_count:rate1m
query: rate(temporal_cloud_v0_frontend_service_error_count[1m])
- metric_name: temporal_cloud_v0_frontend_service_pending_requests
query: temporal_cloud_v0_frontend_service_pending_requests
- metric_name: temporal_cloud_v0_frontend_service_request_count:rate1m
query: rate(temporal_cloud_v0_frontend_service_request_count[1m])
- metric_name: temporal_cloud_v0_poll_success_count:rate1m
query: rate(temporal_cloud_v0_poll_success_count[1m])
- metric_name: temporal_cloud_v0_poll_success_sync_count:rate1m
query: rate(temporal_cloud_v0_poll_success_sync_count[1m])
- metric_name: temporal_cloud_v0_poll_timeout_count:rate1m
query: rate(temporal_cloud_v0_poll_timeout_count[1m])
- metric_name: temporal_cloud_v0_resource_exhausted_error_count:rate1m
query: rate(temporal_cloud_v0_resource_exhausted_error_count[1m])
- metric_name: temporal_cloud_v0_schedule_action_success_count:rate1m
query: rate(temporal_cloud_v0_schedule_action_success_count[1m])
- metric_name: temporal_cloud_v0_schedule_buffer_overruns_count:rate1m
query: rate(temporal_cloud_v0_schedule_buffer_overruns_count[1m])
- metric_name: temporal_cloud_v0_schedule_missed_catchup_window_count:rate1m
query: rate(temporal_cloud_v0_schedule_missed_catchup_window_count[1m])
- metric_name: temporal_cloud_v0_service_latency_bucket:histogram_quantile_p99_1m
query: histogram_quantile(0.99, sum(rate(temporal_cloud_v0_service_latency_bucket[1m])) by (le, operation, temporal_namespace))
- metric_name: temporal_cloud_v0_service_latency_count:rate1m
query: rate(temporal_cloud_v0_service_latency_count[1m])
- metric_name: temporal_cloud_v0_service_latency_sum:rate1m
query: rate(temporal_cloud_v0_service_latency_sum[1m])
- metric_name: temporal_cloud_v0_state_transition_count:rate1m
query: rate(temporal_cloud_v0_state_transition_count[1m])
- metric_name: temporal_cloud_v0_total_action_count:rate1m
query: rate(temporal_cloud_v0_total_action_count[1m])
- metric_name: temporal_cloud_v0_workflow_cancel_count:rate1m
query: rate(temporal_cloud_v0_workflow_cancel_count[1m])
- metric_name: temporal_cloud_v0_workflow_continued_as_new_count:rate1m
query: rate(temporal_cloud_v0_workflow_continued_as_new_count[1m])
- metric_name: temporal_cloud_v0_workflow_failed_count:rate1m
query: rate(temporal_cloud_v0_workflow_failed_count[1m])
- metric_name: temporal_cloud_v0_workflow_success_count:rate1m
query: rate(temporal_cloud_v0_workflow_success_count[1m])
- metric_name: temporal_cloud_v0_workflow_terminate_count:rate1m
query: rate(temporal_cloud_v0_workflow_terminate_count[1m])
- metric_name: temporal_cloud_v0_workflow_timeout_count:rate1m
query: rate(temporal_cloud_v0_workflow_timeout_count[1m])
49 changes: 49 additions & 0 deletions cloud/observability/promql-to-scrape/examples/configmap.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,49 @@
apiVersion: v1
kind: ConfigMap
metadata:
name: promql-to-scrape-config
data:
config.yaml: |
metrics:
- metric_name: temporal_cloud_v0_frontend_service_error_count:rate1m
query: rate(temporal_cloud_v0_frontend_service_error_count[1m])
- metric_name: temporal_cloud_v0_frontend_service_pending_requests
query: temporal_cloud_v0_frontend_service_pending_requests
- metric_name: temporal_cloud_v0_frontend_service_request_count:rate1m
query: rate(temporal_cloud_v0_frontend_service_request_count[1m])
- metric_name: temporal_cloud_v0_poll_success_count:rate1m
query: rate(temporal_cloud_v0_poll_success_count[1m])
- metric_name: temporal_cloud_v0_poll_success_sync_count:rate1m
query: rate(temporal_cloud_v0_poll_success_sync_count[1m])
- metric_name: temporal_cloud_v0_poll_timeout_count:rate1m
query: rate(temporal_cloud_v0_poll_timeout_count[1m])
- metric_name: temporal_cloud_v0_resource_exhausted_error_count:rate1m
query: rate(temporal_cloud_v0_resource_exhausted_error_count[1m])
- metric_name: temporal_cloud_v0_schedule_action_success_count:rate1m
query: rate(temporal_cloud_v0_schedule_action_success_count[1m])
- metric_name: temporal_cloud_v0_schedule_buffer_overruns_count:rate1m
query: rate(temporal_cloud_v0_schedule_buffer_overruns_count[1m])
- metric_name: temporal_cloud_v0_schedule_missed_catchup_window_count:rate1m
query: rate(temporal_cloud_v0_schedule_missed_catchup_window_count[1m])
- metric_name: temporal_cloud_v0_service_latency_bucket:histogram_quantile_p99_1m
query: histogram_quantile(0.99, sum(rate(temporal_cloud_v0_service_latency_bucket[1m])) by (le, operation, temporal_namespace))
- metric_name: temporal_cloud_v0_service_latency_count:rate1m
query: rate(temporal_cloud_v0_service_latency_count[1m])
- metric_name: temporal_cloud_v0_service_latency_sum:rate1m
query: rate(temporal_cloud_v0_service_latency_sum[1m])
- metric_name: temporal_cloud_v0_state_transition_count:rate1m
query: rate(temporal_cloud_v0_state_transition_count[1m])
- metric_name: temporal_cloud_v0_total_action_count:rate1m
query: rate(temporal_cloud_v0_total_action_count[1m])
- metric_name: temporal_cloud_v0_workflow_cancel_count:rate1m
query: rate(temporal_cloud_v0_workflow_cancel_count[1m])
- metric_name: temporal_cloud_v0_workflow_continued_as_new_count:rate1m
query: rate(temporal_cloud_v0_workflow_continued_as_new_count[1m])
- metric_name: temporal_cloud_v0_workflow_failed_count:rate1m
query: rate(temporal_cloud_v0_workflow_failed_count[1m])
- metric_name: temporal_cloud_v0_workflow_success_count:rate1m
query: rate(temporal_cloud_v0_workflow_success_count[1m])
- metric_name: temporal_cloud_v0_workflow_terminate_count:rate1m
query: rate(temporal_cloud_v0_workflow_terminate_count[1m])
- metric_name: temporal_cloud_v0_workflow_timeout_count:rate1m
query: rate(temporal_cloud_v0_workflow_timeout_count[1m])
47 changes: 47 additions & 0 deletions cloud/observability/promql-to-scrape/examples/deployment.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,47 @@
apiVersion: apps/v1
kind: Deployment
metadata:
name: promql-to-scrape
labels:
app: promql-to-scrape
spec:
replicas: 1
selector:
matchLabels:
app: promql-to-scrape
template:
metadata:
labels:
app: promql-to-scrape
spec:
containers:
- name: promql-to-scrape
image: ghcr.io/temporalio/promql-to-scrape:7c0e91a
args:
- --client-cert=/var/run/secrets/ca_crt
- --client-key=/var/run/secrets/ca_key
- --prom-endpoint=https://<account>.tmprl.cloud/prometheus
- --config-file=/etc/promql-to-scrape/config.yaml
- --debug
ports:
- containerPort: 9001
volumeMounts:
- name: secrets
mountPath: /var/run/secrets
readOnly: true
- name: config-volume
mountPath: /etc/promql-to-scrape
resources:
limits:
cpu: "100m"
memory: "256Mi"
volumes:
- name: secrets
secret:
secretName: promql-to-scrape-secrets
- name: config-volume
configMap:
name: promql-to-scrape-config
items:
- key: config.yaml
path: config.yaml
10 changes: 10 additions & 0 deletions cloud/observability/promql-to-scrape/examples/secret.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
apiVersion: v1
kind: Secret
type: Opaque
metadata:
name: promql-to-scrape-secrets
labels:
app: promql-to-scrape
data:
ca_crt: "<cert | base64>"
ca_key: "<key | base64>"
17 changes: 17 additions & 0 deletions cloud/observability/promql-to-scrape/go.mod
Original file line number Diff line number Diff line change
@@ -0,0 +1,17 @@
module github.com/temporalio/samples-server/cloud/observability/promql-to-scrape

go 1.21

require (
github.com/prometheus/client_golang v1.17.0
github.com/prometheus/common v0.45.0
golang.org/x/exp v0.0.0-20231110203233-9a3e6036ecaa
gopkg.in/yaml.v3 v3.0.1
)

require (
github.com/json-iterator/go v1.1.12 // indirect
github.com/kr/text v0.2.0 // indirect
github.com/modern-go/concurrent v0.0.0-20180306012644-bacd9c7ef1dd // indirect
github.com/modern-go/reflect2 v1.0.2 // indirect
)
Loading

0 comments on commit 1ff2f62

Please sign in to comment.