Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Migrated metrics to prometheus #434

Merged
merged 11 commits into from
Oct 9, 2024
Merged
Show file tree
Hide file tree
Changes from 4 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 3 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,9 @@

Lists all changes with user impact.
The format is based on [Keep a Changelog](http://keepachangelog.com/en/1.0.0/).
## [0.22.2]
### Changed
- Migrated metrics to prometheus

## [0.22.1]
### Changed
Expand Down
56 changes: 21 additions & 35 deletions docs/deployment/observability.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,13 +5,17 @@
Envoy Control uses [SLF4J](https://www.slf4j.org/) with [Logback](https://logback.qos.ch/) for logging.

To override the default settings, point a file via environment variable

```bash
export ENVOY_CONTROL_RUNNER_OPTS="-Dlogging.config=/path/to/logback/logback.xml"
```

and then run the `bin/envoy-control-runner` created from `distZip` task.

`java-control-plane` produces quite a lot of logging on `INFO` level. Consider switching it to `WARN`

```xml

<logger name="io.envoyproxy.controlplane.cache.SimpleCache" level="WARN"/>
<logger name="io.envoyproxy.controlplane.cache.DiscoveryServer" level="WARN"/>
```
Expand All @@ -25,55 +29,37 @@ Sample logger configuration is available here.

### Envoy Control

Metric | Description
-----------------------------| -----------------------------------
**services.added** | Counter of added services events
**services.removed** | Counter of removed services events
**services.instanceChanged** | Counter of instance change events
Metric | Description | Labels
----------------------|------------------------------------|
**watched-services** | Counter of watched services events | status (added/removed/instances-changed/snapshot-changed)
nastassia-dailidava marked this conversation as resolved.
Show resolved Hide resolved

Standard [Spring metrics](https://docs.spring.io/spring-boot/docs/current/reference/html/production-ready-metrics.html#production-ready-metrics-meter) (JVM, CPU, HTTP server) are also included.
Standard [Spring metrics](https://docs.spring.io/spring-boot/docs/current/reference/html/production-ready-metrics.html#production-ready-metrics-meter) (
JVM, CPU, HTTP server) are also included.

### Envoy Control Runner

Envoy Control Runner exposes a set of metrics on standard Spring Actuator's `/actuator/metrics` endpoint.

#### xDS connections

Metric | Description
-----------------------------| --------------------------------------------------------
**grpc.connections.ads** | Number of running gRPC ADS connections
**grpc.connections.cds** | Number of running gRPC CDS connections
**grpc.connections.eds** | Number of running gRPC EDS connections
**grpc.connections.lds** | Number of running gRPC LDS connections
**grpc.connections.rds** | Number of running gRPC RDS connections
**grpc.connections.sds** | Number of running gRPC SDS connections
**grpc.connections.unknown** | Number of running gRPC connections for unknown resource
Metric | Description | Labels
----------------------|----------------------------------------------------|------------------------------------
**grpc.connections** | Number of running gRPC connections of a given type | type (cds/xds/lds/rds/sds/unknown)

#### xDS requests

Metric | Description
------------------------------- | --------------------------------------------------------
**grpc.requests.cds** | Counter of received gRPC CDS requests
**grpc.requests.eds** | Counter of received gRPC EDS requests
**grpc.requests.lds** | Counter of received gRPC LDS requests
**grpc.requests.rds** | Counter of received gRPC RDS requests
**grpc.requests.sds** | Counter of received gRPC SDS requests
**grpc.requests.unknown** | Counter of received gRPC requests for unknown resource
**grpc.requests.cds.delta** | Counter of received gRPC delta CDS requests
**grpc.requests.eds.delta** | Counter of received gRPC delta EDS requests
**grpc.requests.lds.delta** | Counter of received gRPC delta LDS requests
**grpc.requests.rds.delta** | Counter of received gRPC delta RDS requests
**grpc.requests.sds.delta** | Counter of received gRPC delta SDS requests
**grpc.requests.unknown.delta** | Counter of received gRPC delta requests for unknown resource
Metric | Description | Labels
-------------------------|---------------------------------------------------|--------------------------------------------------------------
**grpc.requests.count** | Counter of received gRPC requests of a given type | type (cds/xds/lds/rds/sds/unknown), metric-type(total/delta)

#### Snapshot

Metric | Description
-------------------------| ----------------------------------
**cache.groupCount** | Number of unique groups in SnapshotCache
Metric | Description | Labels
------------------------|------------------------------------------|--------
**cache.groups.count** | Number of unique groups in SnapshotCache | -

#### Synchronization

Metric | Description
----------------------------------------| -------------------------------------------------
**cross-dc-synchronization.$dc.errors** | Counter of synchronization errors for given DC
Metric | Description | Labels
-------------------------------------------|----------------------------------------------------------------|----------------------------------------------
**cross-dc-synchronization.errors.total** | Counter of synchronization errors for a given DC and operation | cluster, operation (get-instances/get-state)
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,7 @@ import io.envoyproxy.controlplane.server.callback.SnapshotCollectingCallback
import io.grpc.Server
import io.grpc.netty.NettyServerBuilder
import io.micrometer.core.instrument.MeterRegistry
import io.micrometer.core.instrument.Tags
import io.micrometer.core.instrument.binder.jvm.ExecutorServiceMetrics
import io.netty.channel.nio.NioEventLoopGroup
import io.netty.channel.socket.nio.NioServerSocketChannel
Expand Down Expand Up @@ -221,10 +222,12 @@ class ControlPlane private constructor(
nioEventLoopExecutor
)
)
.bossEventLoopGroup(NioEventLoopGroup(
properties.server.nioBossEventLoopThreadCount,
nioBossEventLoopExecutor
))
.bossEventLoopGroup(
NioEventLoopGroup(
properties.server.nioBossEventLoopThreadCount,
nioBossEventLoopExecutor
)
)
.channelType(NioServerSocketChannel::class.java)
.executor(grpcServerExecutor)
.keepAliveTime(properties.server.netty.keepAliveTime.toMillis(), TimeUnit.MILLISECONDS)
Expand Down Expand Up @@ -410,7 +413,12 @@ class ControlPlane private constructor(
}

private fun meterExecutor(executor: ExecutorService, executorServiceName: String) {
ExecutorServiceMetrics(executor, executorServiceName, executorServiceName, emptySet())
ExecutorServiceMetrics(
executor,
executorServiceName,
"envoy-control",
Tags.of("executor", executorServiceName)
)
.bindTo(meterRegistry)
}
}
Expand Down
Original file line number Diff line number Diff line change
@@ -1,10 +1,12 @@
package pl.allegro.tech.servicemesh.envoycontrol.server.callbacks

import com.google.common.net.InetAddresses.increment
import io.envoyproxy.controlplane.cache.Resources
import io.envoyproxy.controlplane.server.DiscoveryServerCallbacks
import io.envoyproxy.envoy.service.discovery.v3.DiscoveryRequest as V3DiscoveryRequest
import io.envoyproxy.envoy.service.discovery.v3.DeltaDiscoveryRequest as V3DeltaDiscoveryRequest
import io.micrometer.core.instrument.MeterRegistry
import io.micrometer.core.instrument.Tags
import java.util.concurrent.atomic.AtomicInteger

class MetricsDiscoveryServerCallbacks(private val meterRegistry: MeterRegistry) : DiscoveryServerCallbacks {
Expand Down Expand Up @@ -34,9 +36,9 @@ class MetricsDiscoveryServerCallbacks(private val meterRegistry: MeterRegistry)
.map { type -> type to AtomicInteger(0) }
.toMap()

meterRegistry.gauge("grpc.all-connections", connections)
meterRegistry.gauge("grpc.connections", Tags.of("type", "all"), connections)
nastassia-dailidava marked this conversation as resolved.
Show resolved Hide resolved
connectionsByType.forEach { (type, typeConnections) ->
meterRegistry.gauge("grpc.connections.${type.name.toLowerCase()}", typeConnections)
meterRegistry.gauge("grpc.connections", Tags.of("type", type.name.lowercase()), typeConnections)
nastassia-dailidava marked this conversation as resolved.
Show resolved Hide resolved
}
}

Expand All @@ -51,15 +53,21 @@ class MetricsDiscoveryServerCallbacks(private val meterRegistry: MeterRegistry)
}

override fun onV3StreamRequest(streamId: Long, request: V3DiscoveryRequest) {
meterRegistry.counter("grpc.requests.${StreamType.fromTypeUrl(request.typeUrl).name.toLowerCase()}")
meterRegistry.counter(
"grpc.requests.count",
Tags.of("type", StreamType.fromTypeUrl(request.typeUrl).name.lowercase(), "metric-type", "total")
)
.increment()
}

override fun onV3StreamDeltaRequest(
streamId: Long,
request: V3DeltaDiscoveryRequest
) {
meterRegistry.counter("grpc.requests.${StreamType.fromTypeUrl(request.typeUrl).name.toLowerCase()}.delta")
meterRegistry.counter(
"grpc.requests.count",
Tags.of("type", StreamType.fromTypeUrl(request.typeUrl).name.lowercase(), "metric-type", "delta")
)
.increment()
}

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,7 @@ import io.envoyproxy.envoy.config.listener.v3.Listener
import io.envoyproxy.envoy.config.route.v3.RouteConfiguration
import io.envoyproxy.envoy.extensions.transport_sockets.tls.v3.Secret
import io.micrometer.core.instrument.MeterRegistry
import io.micrometer.core.instrument.Tags
import io.micrometer.core.instrument.Timer
import pl.allegro.tech.servicemesh.envoycontrol.groups.AllServicesGroup
import pl.allegro.tech.servicemesh.envoycontrol.groups.CommunicationMode
Expand Down Expand Up @@ -67,7 +68,12 @@ class EnvoySnapshotFactory(
endpoints = endpoints,
properties = properties.outgoingPermissions
)
sample.stop(meterRegistry.timer("snapshot-factory.new-snapshot.time"))
sample.stop(
meterRegistry.timer(
"snapshot-factory.seconds",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe snapshot_factory to be consistent with other metrics?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yea, I was looking at some our other metrics, but looks like this way would be better, I changed to dots everywhere

Tags.of("operation", "new-snapshot", "type", "global")
)
)

return snapshot
}
Expand Down Expand Up @@ -155,7 +161,12 @@ class EnvoySnapshotFactory(
val groupSample = Timer.start(meterRegistry)

val newSnapshotForGroup = newSnapshotForGroup(group, globalSnapshot)
groupSample.stop(meterRegistry.timer("snapshot-factory.get-snapshot-for-group.time"))
groupSample.stop(
meterRegistry.timer(
"snapshot-factory.seconds",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here - snapshot_factory for consistency

Tags.of("operation", "new-snapshot", "type", "group")
)
)
return newSnapshotForGroup
}

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,7 @@ package pl.allegro.tech.servicemesh.envoycontrol.snapshot
import io.envoyproxy.controlplane.cache.SnapshotCache
import io.envoyproxy.controlplane.cache.v3.Snapshot
import io.micrometer.core.instrument.MeterRegistry
import io.micrometer.core.instrument.Tags
import io.micrometer.core.instrument.Timer
import pl.allegro.tech.servicemesh.envoycontrol.groups.CommunicationMode.ADS
import pl.allegro.tech.servicemesh.envoycontrol.groups.CommunicationMode.XDS
Expand Down Expand Up @@ -50,9 +51,12 @@ class SnapshotUpdater(
// step 2: only watches groups. if groups change we use the last services state and update those groups
groups().subscribeOn(globalSnapshotScheduler)
)
.measureBuffer("snapshot-updater-merged", meterRegistry, innerSources = 2)
.measureBuffer("snapshot.updater.count.total", meterRegistry, innerSources = 2)
.checkpoint("snapshot-updater-merged")
.name("snapshot-updater-merged").metrics()
.name("snapshot.updater.count.total")
.tag("status", "merged")
.tag("type", "global")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what this type means?

.metrics()
// step 3: group updates don't provide a snapshot,
// so we piggyback the last updated snapshot state for use
.scan { previous: UpdateResult, newUpdate: UpdateResult ->
Expand Down Expand Up @@ -87,28 +91,40 @@ class SnapshotUpdater(
// see GroupChangeWatcher
return onGroupAdded
.publishOn(globalSnapshotScheduler)
.measureBuffer("snapshot-updater-groups-published", meterRegistry)
.measureBuffer("snapshot-updater.count.total", meterRegistry)
.checkpoint("snapshot-updater-groups-published")
.name("snapshot-updater-groups-published").metrics()
.name("snapshot-updater.count.total")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

snapshot_updater/snapshot.updater - . is mapped to _, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

. is, - isn't

.tag("type", "groups")
.tag("status", "published").metrics()
.map { groups ->
UpdateResult(action = Action.SERVICES_GROUP_ADDED, groups = groups)
}
.onErrorResume { e ->
meterRegistry.counter("snapshot-updater.groups.updates.errors").increment()
meterRegistry.counter(
"snapshot-updater.errors.total",
Tags.of("type", "groups")
)
.increment()
logger.error("Unable to process new group", e)
Mono.justOrEmpty(UpdateResult(action = Action.ERROR_PROCESSING_CHANGES))
}
}

internal fun services(states: Flux<MultiClusterState>): Flux<UpdateResult> {
return states
.name("snapshot-updater-services-sampled").metrics()
.onBackpressureLatestMeasured("snapshot-updater-services-sampled", meterRegistry)
.name("snapshot-updater.count.total")
.tag("type", "services")
.tag("status", "sampled")
.metrics()
.onBackpressureLatestMeasured("snapshot-updater.count.total", meterRegistry)
// prefetch = 1, instead of default 256, to avoid processing stale states in case of backpressure
.publishOn(globalSnapshotScheduler, 1)
.measureBuffer("snapshot-updater-services-published", meterRegistry)
.measureBuffer("snapshot-updater.count.total", meterRegistry) // todo
.checkpoint("snapshot-updater-services-published")
.name("snapshot-updater-services-published").metrics()
.name("snapshot-updater.count.total")
.tag("type", "services")
.tag("status", "published")
.metrics()
.createClusterConfigurations()
.map { (states, clusters) ->
var lastXdsSnapshot: GlobalSnapshot? = null
Expand All @@ -135,14 +151,19 @@ class SnapshotUpdater(
}
.filter { it != emptyUpdateResult }
.onErrorResume { e ->
meterRegistry.counter("snapshot-updater.services.updates.errors").increment()
meterRegistry.counter(
"snapshot-updater.errors.total",
Tags.of("type", "services")
).increment()
logger.error("Unable to process service changes", e)
Mono.justOrEmpty(UpdateResult(action = Action.ERROR_PROCESSING_CHANGES))
}
}

private fun snapshotTimer(serviceName: String) = if (properties.metrics.cacheSetSnapshot) {
meterRegistry.timer("snapshot-updater.set-snapshot.$serviceName.time")
meterRegistry.timer(
"simple-cache.duration.seconds", Tags.of("service", serviceName, "operation", "set-snapshot")
)
} else {
noopTimer
}
Expand All @@ -154,12 +175,15 @@ class SnapshotUpdater(
cache.setSnapshot(group, groupSnapshot)
}
} catch (e: Throwable) {
meterRegistry.counter("snapshot-updater.services.${group.serviceName}.updates.errors").increment()
meterRegistry.counter(
"snapshot-updater.errors.total", Tags.of("service", group.serviceName)
).increment()
logger.error("Unable to create snapshot for group ${group.serviceName}", e)
}
}

private val updateSnapshotForGroupsTimer = meterRegistry.timer("snapshot-updater.update-snapshot-for-groups.time")
private val updateSnapshotForGroupsTimer =
meterRegistry.timer("snapshot-updater.duration.seconds", Tags.of("type", "groups"))

private fun updateSnapshotForGroups(
groups: Collection<Group>,
Expand All @@ -174,10 +198,13 @@ class SnapshotUpdater(
} else if (result.xdsSnapshot != null && group.communicationMode == XDS) {
updateSnapshotForGroup(group, result.xdsSnapshot)
} else {
meterRegistry.counter("snapshot-updater.communication-mode.errors").increment()
logger.error("Requested snapshot for ${group.communicationMode.name} mode, but it is not here. " +
"Handling Envoy with not supported communication mode should have been rejected before." +
" Please report this to EC developers.")
meterRegistry.counter("snapshot-updater.errors.total", Tags.of("type", "communication-mode"))
.increment()
logger.error(
"Requested snapshot for ${group.communicationMode.name} mode, but it is not here. " +
"Handling Envoy with not supported communication mode should have been rejected before." +
" Please report this to EC developers."
)
}
}
return results.then(Mono.fromCallable {
Expand Down
Loading
Loading