Skip to content

Commit

Permalink
Migrated metrics to prometheus (#434)
Browse files Browse the repository at this point in the history
* allegro-internal/flex-roadmap#819 Migrated metrics to prometheus
  • Loading branch information
nastassia-dailidava authored Oct 9, 2024
1 parent 58be4d6 commit f43cedc
Show file tree
Hide file tree
Showing 22 changed files with 457 additions and 178 deletions.
3 changes: 3 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,9 @@

Lists all changes with user impact.
The format is based on [Keep a Changelog](http://keepachangelog.com/en/1.0.0/).
## [0.22.2]
### Changed
- Migrated metrics to prometheus

## [0.22.1]
### Changed
Expand Down
2 changes: 1 addition & 1 deletion build.gradle
Original file line number Diff line number Diff line change
Expand Up @@ -55,7 +55,7 @@ allprojects {
bytebuddy : '1.15.1',
re2j : '1.3',
xxhash : '0.10.1',
dropwizard : '4.2.26'
dropwizard : '4.2.26',
]

dependencyManagement {
Expand Down
56 changes: 21 additions & 35 deletions docs/deployment/observability.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,13 +5,17 @@
Envoy Control uses [SLF4J](https://www.slf4j.org/) with [Logback](https://logback.qos.ch/) for logging.

To override the default settings, point a file via environment variable

```bash
export ENVOY_CONTROL_RUNNER_OPTS="-Dlogging.config=/path/to/logback/logback.xml"
```

and then run the `bin/envoy-control-runner` created from `distZip` task.

`java-control-plane` produces quite a lot of logging on `INFO` level. Consider switching it to `WARN`

```xml

<logger name="io.envoyproxy.controlplane.cache.SimpleCache" level="WARN"/>
<logger name="io.envoyproxy.controlplane.cache.DiscoveryServer" level="WARN"/>
```
Expand All @@ -25,55 +29,37 @@ Sample logger configuration is available here.

### Envoy Control

Metric | Description
-----------------------------| -----------------------------------
**services.added** | Counter of added services events
**services.removed** | Counter of removed services events
**services.instanceChanged** | Counter of instance change events
Metric | Description | Labels
----------------------|------------------------------------|--------------------------------
**watch** | Counter of watched services events | status (added/removed/instances-changed/snapshot-changed), watch-type, metric-emitter

Standard [Spring metrics](https://docs.spring.io/spring-boot/docs/current/reference/html/production-ready-metrics.html#production-ready-metrics-meter) (JVM, CPU, HTTP server) are also included.
Standard [Spring metrics](https://docs.spring.io/spring-boot/docs/current/reference/html/production-ready-metrics.html#production-ready-metrics-meter) (
JVM, CPU, HTTP server) are also included.

### Envoy Control Runner

Envoy Control Runner exposes a set of metrics on standard Spring Actuator's `/actuator/metrics` endpoint.

#### xDS connections

Metric | Description
-----------------------------| --------------------------------------------------------
**grpc.connections.ads** | Number of running gRPC ADS connections
**grpc.connections.cds** | Number of running gRPC CDS connections
**grpc.connections.eds** | Number of running gRPC EDS connections
**grpc.connections.lds** | Number of running gRPC LDS connections
**grpc.connections.rds** | Number of running gRPC RDS connections
**grpc.connections.sds** | Number of running gRPC SDS connections
**grpc.connections.unknown** | Number of running gRPC connections for unknown resource
Metric | Description | Labels
----------------------|----------------------------------------------------|------------------------------------
**connections** | Number of running gRPC connections of a given type | stream-type (cds/xds/lds/rds/sds/unknown), connection-type (grpc)

#### xDS requests

Metric | Description
------------------------------- | --------------------------------------------------------
**grpc.requests.cds** | Counter of received gRPC CDS requests
**grpc.requests.eds** | Counter of received gRPC EDS requests
**grpc.requests.lds** | Counter of received gRPC LDS requests
**grpc.requests.rds** | Counter of received gRPC RDS requests
**grpc.requests.sds** | Counter of received gRPC SDS requests
**grpc.requests.unknown** | Counter of received gRPC requests for unknown resource
**grpc.requests.cds.delta** | Counter of received gRPC delta CDS requests
**grpc.requests.eds.delta** | Counter of received gRPC delta EDS requests
**grpc.requests.lds.delta** | Counter of received gRPC delta LDS requests
**grpc.requests.rds.delta** | Counter of received gRPC delta RDS requests
**grpc.requests.sds.delta** | Counter of received gRPC delta SDS requests
**grpc.requests.unknown.delta** | Counter of received gRPC delta requests for unknown resource
Metric | Description | Labels
-------------------------|---------------------------------------------------|--------------------------------------------------------------
**requests.total** | Counter of received gRPC requests of a given type | stream-type (cds/xds/lds/rds/sds/unknown), connection-type (grpc), discovery-request-type(total/delta)

#### Snapshot

Metric | Description
-------------------------| ----------------------------------
**cache.groupCount** | Number of unique groups in SnapshotCache
Metric | Description | Labels
------------------------|------------------------------------------|--------
**cache.groups.count** | Number of unique groups in SnapshotCache | -

#### Synchronization

Metric | Description
----------------------------------------| -------------------------------------------------
**cross-dc-synchronization.$dc.errors** | Counter of synchronization errors for given DC
Metric | Description | Labels
-------------------------------------------|----------------------------------------------------------------|----------------------------------------------
**errors.total** | Counter of synchronization errors for a given DC and operation | cluster, operation (get-instances/get-state)
1 change: 1 addition & 0 deletions envoy-control-core/build.gradle
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,7 @@ dependencies {
implementation group: 'org.jetbrains.kotlin', name: 'kotlin-reflect'
api group: 'io.dropwizard.metrics', name: 'metrics-core', version: versions.dropwizard
api group: 'io.micrometer', name: 'micrometer-core'

implementation group: 'com.google.re2j', name: 're2j', version: versions.re2j

api group: 'io.envoyproxy.controlplane', name: 'server', version: versions.java_controlplane
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,7 @@ import io.envoyproxy.controlplane.server.callback.SnapshotCollectingCallback
import io.grpc.Server
import io.grpc.netty.NettyServerBuilder
import io.micrometer.core.instrument.MeterRegistry
import io.micrometer.core.instrument.Tags
import io.micrometer.core.instrument.binder.jvm.ExecutorServiceMetrics
import io.netty.channel.nio.NioEventLoopGroup
import io.netty.channel.socket.nio.NioServerSocketChannel
Expand Down Expand Up @@ -221,10 +222,12 @@ class ControlPlane private constructor(
nioEventLoopExecutor
)
)
.bossEventLoopGroup(NioEventLoopGroup(
properties.server.nioBossEventLoopThreadCount,
nioBossEventLoopExecutor
))
.bossEventLoopGroup(
NioEventLoopGroup(
properties.server.nioBossEventLoopThreadCount,
nioBossEventLoopExecutor
)
)
.channelType(NioServerSocketChannel::class.java)
.executor(grpcServerExecutor)
.keepAliveTime(properties.server.netty.keepAliveTime.toMillis(), TimeUnit.MILLISECONDS)
Expand Down Expand Up @@ -410,7 +413,11 @@ class ControlPlane private constructor(
}

private fun meterExecutor(executor: ExecutorService, executorServiceName: String) {
ExecutorServiceMetrics(executor, executorServiceName, executorServiceName, emptySet())
ExecutorServiceMetrics(
executor,
executorServiceName,
Tags.of("executor", executorServiceName)
)
.bindTo(meterRegistry)
}
}
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -11,6 +11,8 @@ import io.micrometer.core.instrument.MeterRegistry
import pl.allegro.tech.servicemesh.envoycontrol.EnvoyControlMetrics
import pl.allegro.tech.servicemesh.envoycontrol.logger
import pl.allegro.tech.servicemesh.envoycontrol.utils.measureBuffer
import pl.allegro.tech.servicemesh.envoycontrol.utils.REACTOR_METRIC
import pl.allegro.tech.servicemesh.envoycontrol.utils.WATCH_TYPE_TAG
import reactor.core.publisher.Flux
import reactor.core.publisher.FluxSink
import java.util.function.Consumer
Expand All @@ -34,9 +36,14 @@ internal class GroupChangeWatcher(

fun onGroupAdded(): Flux<List<Group>> {
return groupsChanged
.measureBuffer("group-change-watcher-emitted", meterRegistry)
.measureBuffer("group-change-watcher", meterRegistry)
.checkpoint("group-change-watcher-emitted")
.name("group-change-watcher-emitted").metrics()
.name(REACTOR_METRIC)
.tag(WATCH_TYPE_TAG, "group")
.metrics()
.doOnSubscribe {
logger.info("Watching group changes")
}
.doOnCancel {
logger.warn("Cancelling watching group changes")
}
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -12,6 +12,7 @@ import java.util.function.Supplier

import io.envoyproxy.controlplane.server.serializer.DefaultProtoResourcesSerializer
import io.micrometer.core.instrument.Timer
import pl.allegro.tech.servicemesh.envoycontrol.utils.PROTOBUF_CACHE_METRIC

internal class CachedProtoResourcesSerializer(
private val meterRegistry: MeterRegistry,
Expand All @@ -27,7 +28,7 @@ internal class CachedProtoResourcesSerializer(
}

private val cache: Cache<Message, Any> = createCache("protobuf-cache")
private val timer = createTimer(reportMetrics, meterRegistry, "protobuf-cache.serialize.time")
private val timer = createTimer(reportMetrics, meterRegistry, PROTOBUF_CACHE_METRIC)

private fun <K, V> createCache(cacheName: String): Cache<K, V> {
return if (reportMetrics) {
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,12 @@ import io.envoyproxy.controlplane.server.DiscoveryServerCallbacks
import io.envoyproxy.envoy.service.discovery.v3.DiscoveryRequest as V3DiscoveryRequest
import io.envoyproxy.envoy.service.discovery.v3.DeltaDiscoveryRequest as V3DeltaDiscoveryRequest
import io.micrometer.core.instrument.MeterRegistry
import io.micrometer.core.instrument.Tags
import pl.allegro.tech.servicemesh.envoycontrol.utils.CONNECTION_TYPE_TAG
import pl.allegro.tech.servicemesh.envoycontrol.utils.CONNECTIONS_METRIC
import pl.allegro.tech.servicemesh.envoycontrol.utils.DISCOVERY_REQ_TYPE_TAG
import pl.allegro.tech.servicemesh.envoycontrol.utils.REQUESTS_METRIC
import pl.allegro.tech.servicemesh.envoycontrol.utils.STREAM_TYPE_TAG
import java.util.concurrent.atomic.AtomicInteger

class MetricsDiscoveryServerCallbacks(private val meterRegistry: MeterRegistry) : DiscoveryServerCallbacks {
Expand Down Expand Up @@ -34,9 +40,12 @@ class MetricsDiscoveryServerCallbacks(private val meterRegistry: MeterRegistry)
.map { type -> type to AtomicInteger(0) }
.toMap()

meterRegistry.gauge("grpc.all-connections", connections)
connectionsByType.forEach { (type, typeConnections) ->
meterRegistry.gauge("grpc.connections.${type.name.toLowerCase()}", typeConnections)
meterRegistry.gauge(
CONNECTIONS_METRIC,
Tags.of(CONNECTION_TYPE_TAG, "grpc", STREAM_TYPE_TAG, type.name.lowercase()),
typeConnections
)
}
}

Expand All @@ -51,15 +60,29 @@ class MetricsDiscoveryServerCallbacks(private val meterRegistry: MeterRegistry)
}

override fun onV3StreamRequest(streamId: Long, request: V3DiscoveryRequest) {
meterRegistry.counter("grpc.requests.${StreamType.fromTypeUrl(request.typeUrl).name.toLowerCase()}")
meterRegistry.counter(
REQUESTS_METRIC,
Tags.of(
CONNECTION_TYPE_TAG, "grpc",
STREAM_TYPE_TAG, StreamType.fromTypeUrl(request.typeUrl).name.lowercase(),
DISCOVERY_REQ_TYPE_TAG, "total"
)
)
.increment()
}

override fun onV3StreamDeltaRequest(
streamId: Long,
request: V3DeltaDiscoveryRequest
) {
meterRegistry.counter("grpc.requests.${StreamType.fromTypeUrl(request.typeUrl).name.toLowerCase()}.delta")
meterRegistry.counter(
REQUESTS_METRIC,
Tags.of(
CONNECTION_TYPE_TAG, "grpc",
STREAM_TYPE_TAG, StreamType.fromTypeUrl(request.typeUrl).name.lowercase(),
DISCOVERY_REQ_TYPE_TAG, "delta"
)
)
.increment()
}

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,7 @@ import io.envoyproxy.envoy.config.listener.v3.Listener
import io.envoyproxy.envoy.config.route.v3.RouteConfiguration
import io.envoyproxy.envoy.extensions.transport_sockets.tls.v3.Secret
import io.micrometer.core.instrument.MeterRegistry
import io.micrometer.core.instrument.Tags
import io.micrometer.core.instrument.Timer
import pl.allegro.tech.servicemesh.envoycontrol.groups.AllServicesGroup
import pl.allegro.tech.servicemesh.envoycontrol.groups.CommunicationMode
Expand All @@ -24,6 +25,7 @@ import pl.allegro.tech.servicemesh.envoycontrol.snapshot.resource.endpoints.Envo
import pl.allegro.tech.servicemesh.envoycontrol.snapshot.resource.listeners.EnvoyListenersFactory
import pl.allegro.tech.servicemesh.envoycontrol.snapshot.resource.routes.EnvoyEgressRoutesFactory
import pl.allegro.tech.servicemesh.envoycontrol.snapshot.resource.routes.EnvoyIngressRoutesFactory
import pl.allegro.tech.servicemesh.envoycontrol.utils.SNAPSHOT_FACTORY_SECONDS_METRIC
import java.util.SortedMap

class EnvoySnapshotFactory(
Expand Down Expand Up @@ -67,7 +69,12 @@ class EnvoySnapshotFactory(
endpoints = endpoints,
properties = properties.outgoingPermissions
)
sample.stop(meterRegistry.timer("snapshot-factory.new-snapshot.time"))
sample.stop(
meterRegistry.timer(
SNAPSHOT_FACTORY_SECONDS_METRIC,
Tags.of("operation", "new-snapshot", "type", "global")
)
)

return snapshot
}
Expand Down Expand Up @@ -155,7 +162,12 @@ class EnvoySnapshotFactory(
val groupSample = Timer.start(meterRegistry)

val newSnapshotForGroup = newSnapshotForGroup(group, globalSnapshot)
groupSample.stop(meterRegistry.timer("snapshot-factory.get-snapshot-for-group.time"))
groupSample.stop(
meterRegistry.timer(
SNAPSHOT_FACTORY_SECONDS_METRIC,
Tags.of("operation", "new-snapshot", "type", "group")
)
)
return newSnapshotForGroup
}

Expand Down
Loading

0 comments on commit f43cedc

Please sign in to comment.