OpenWhisk distinguishes between system and user metrics (events).
System metrics typically contain information about system performance and provide a possibility to send them to Kamon or write them to log files in logmarker format. These metrics are typically used by OpenWhisk providers/operators.
User metrics encompass information about action performance which is sent to Kafka in a form of events. These metrics are to be consumed by OpenWhisk users, however they could be also used for billing or audit purposes. It is to be noted that at the moment the events are not directly exposed to the users and require an additional Kafka Consumer based micro-service for data processing.
Both capabilities can be enabled or disabled separately during deployment via Ansible configuration in the 'group_vars/all' file of an environment.
There are four configurations options available:
-
metrics_log [true / false (default: true)]
Enable/disable whether the metric information is written out to the log files in logmarker format.
Beware: Even if set to false all messages using the log markers are still written out to the log
-
metrics_kamon [true / false (default: false)]
Enable/disable whether metric information is sent to the configured StatsD server.
-
metrics_kamon_tags: false [true / false (default: false)]
Enable/disable whether to use the Kamon tags when sending metrics.
Notice: Tags are supported in only some Kamon backends. (OpenTSDB, Datadog, InfluxDB)
-
metrics_kamon_statsd_host [hostname or ip address]
Hostname or ip address of the StatsD server
-
metrics_kamon_statsd_port [port number (default:8125)]
Port number of the StatsD server
Example configuration:
metrics_kamon: true
metrics_kamon_tags: false
metrics_kamon_statsd_host: '192.168.99.100'
metrics_kamon_statsd_port: '8125'
metrics_log: true
The Kamon project provides an integrated docker image containing StatsD and a connected Grafana dashboard via this Github project. This image is helpful for testing the metrics sent via StatsD.
Please follow these instructions to start the docker image in your local docker environment.
The docker image exposes StatsD via the (standard) port 8125 and a Grafana dashboard via port 8080 on your docker host.
The address of your docker host has to be configured in the metrics_kamon_statsd_host
configuration property.
All metric names have to be prefixed by a prefix that you specify and are subject to modification by graphite, datadog, or statsd. For example if prefix used is openwhisk
then metric names would be like openwhisk.counter.controller_activation_start
. This document assumes that metric name prefix is openwhisk
Currently OpenWhisk emits following types of metrics
Counter record the count of metric and there names are prefixed with openwhisk.counter
. For example openwhisk.counter.controller_activation_start
. Counters just counts and resets to zero upon each flush.
Histogram record the distribution of given metric and there names are prefixed with openwhisk.histogram
. For example openwhisk.histogram.controller_activation_finish
. A histogram metrics may result in multiple values at the metric aggregator level. For example in Datadog for each histogram metric following values are record
my_metric.avg
- Average of aggregated values during the flush interval.my_metric.count
- Count of aggregated values during the flush interval.my_metric.median
- Median of aggregated values during the flush interval.my_metric.95percentile
- 95th percentile value of aggregated values during the flush interval.my_metric.max
- Max of aggregated values during the flush interval.my_metric.min
- Min of aggregated values during the flush interval.
Gauges record the distribution of given metric and their names are prefixed with openwhisk.gauge
. For example openwhisk.gauge.loadbalancer_totalHealthyInvoker_counter
. A gauge metrics provides the value at the given point and reports the same data unless the value has been changed be incremental or decremental than before. Gauges are useful for reporting metrics like kafka queue size or disk size.
Below are some of the important metrics emitted by OpenWhisk setup
Metrics below are emitted from within a Controller instance.
openwhisk.counter.controller_startup<controller_id>_counter
(counter)- Example openwhisk.counter.controller_startup0_counter
- Records count of controller instance startup
openwhisk.counter.controller_blockingActivationDatabaseRetrieval_counter
(counter) - Records the count of activations the controller has retrieved from the activation store during blocking invocations
Following metrics record stats around activation handling within Controller
- Normal actions
openwhisk.counter.controller_activation_start
(counter) - Records the count of non blocking activations started.openwhisk.histogram.controller_activation_finish
(histogram) - Records the overall time taken for non blocking activation to be submitted to Load balancer.
- Blocking actions
openwhisk.counter.controller_blockingActivation_start
(counter) - Records the count of blocking activations started.openwhisk.histogram.controller_blockingActivation_finish
(histogram) - Records the time taken for a blocking activation to finish or timeout.
Aggregate metrics for inflight activations.
openwhisk.gauge.loadbalancer<controllerId>_activationsInflight_counter
(gauge) - Records the number of activations being worked upon for a given controller. As a gauge this will give inflight activation count at the given point in time unless the change in value occurs.openwhisk.gauge.loadbalancer<controllerId>_memory<invokerType>Inflight_counter
(gauge) - Records the amount of RAM memory in use for in flight activations. This is not actual runtime memory but the memory specified per action limits. invokerType defines whether it is a managed or a blackbox invoker.
Metrics below are for current memory capacity
openwhisk.histogram.loadbalancer_totalCapacity<invokerType>_counter
(histogram) - Current memory capacity for all usable managed and blackbox invokers, total user memory in shard managed by controller. invokerType defines whether it is a managed or a blackbox invoker.
Metrics below are captured within load balancer
openwhisk.counter.loadbalancer_activations_counter
(counter) - Records the count of activations sent to Kafka.openwhisk.counter.controller_kafka_start
(counter) - Records the count of activations sent to Kafka.openwhisk.counter.controller_kafka_error
(counter) - Records the count of activations which encountered some failure while submitting to Kafka.openwhisk.histogram.controller_kafka_finish
(histogram) - Records the time taken when activation was successfully submitted to Kafka.openwhisk.histogram.controller_kafka_error
(histogram) - Records the time taken when activation submission to Kafka resulted in failure.openwhisk.counter.controller_loadbalancer_start
(counter) - Records the count of activations submitted to load balancer.openwhisk.histogram.controller_loadbalancer_finish
(histogram) - Records the time taken to submit to load balancer.
Metrics below are for invoker state as recorded within load balancer monitoring.
openwhisk.gauge.loadbalancer_totalHealthyInvoker<invokerType>_counter
(gauge) - Records the count of managed invokers considered healthy based on health pings. invokerType defines whether it is a managed or a blackbox invoker.openwhisk.gauge.loadbalancer_totalUnresponsiveInvoker<invokerType>_counter
(gauge) - Records the count of managed invokers considered unresponsive when health pings arriving fine but the invokers do not respond with active-acks in given time. invokerType defines whether it is a managed or a blackbox invoker.openwhisk.gauge.loadbalancer_totalOfflineInvoker<invokerType>_counter
(gauge) - Records the count of managed invokers considered offline when no health pings arrive from the invokers. invokerType defines whether it is a managed or a blackbox invoker.openwhisk.gauge.loadbalancer_totalUnhealthyInvoker<invokerType>_counter
(gauge) - Records the count of managed invokers considered unhealthy when health pings arrive fine but the invokers report system errors. invokerType defines whether it is a managed or a blackbox invoker.
Metrics below provide information about completion ack processing in load balancers. Depending on configuration setting metrics_kamon_tags
(see above), a base metric with tags or a set of metrics without tags will be emitted.
- Base metric
openwhisk.counter.loadbalancer_completionAck_counter
: count of processed regular or forced completion acks. - Tag
controller_id
: the controller's id. - Tag
type
: the exact type of completion ack.- Type
regular
: a regular completion ack sent by an invoker and received in time. Does not include completion acks for healthcheck actions. - Type
forced
: no completion ack was received in time and the timeout forced the completion ack to close. - Type
healthcheck
: a regular completion ack for healthcheck actions sent by an invoker and received in time. - Type
regularAfterForced
: a regular completion ack sent by an invoker and not received in time. The completion ack was already forced. - Type
forcedAfterRegular
: a timeout tries to force a completion ack that has already been closed by a regular completion ack. A race condition that can occur if the regular completion ack is received near the timeout.
- Type
- If
metrics_kamon_tags
is set tofalse
, a set of metrics will be emitted constructed using following scheme:openwhisk.counter.loadbalancer<controller_id>_completionAck_<type>_counter
.
openwhisk.counter.invoker_activationInit_start
(counter) - Count of container initializations done.openwhisk.histogram.invoker_activationInit_finish
(histogram) - Time taken for successful container initializations.openwhisk.histogram.invoker_activationInit_error
(histogram) - Time taken container initialization failed. Count metrics of this histogram would give insight on failed initialization count.
openwhisk.counter.invoker_activationRun_start
(counter) - Count of action executions performed.openwhisk.histogram.invoker_activationRun_finish
(histogram) - Time taken for action execution for success case.openwhisk.histogram.invoker_activationRun_error
(histogram) - Time taken for action execution for failed cases. Count metrics of this histogram would give insight on failed execution count.
openwhisk.counter.invoker_containerStart.cold_counter
(counter) - Count of number of cold starts.openwhisk.counter.invoker_containerStart.recreated_counter
(counter) - Count of number of times container is recreated.openwhisk.counter.invoker_containerStart.warm_counter
(counter) - Count of number of times a warm container is used.
openwhisk.counter.invoker_collectLogs_start
(counter) - Count of number of times log were collected.openwhisk.counter.invoker_collectLogs_error
(counter) - Count of number of failed logs collections.openwhisk.histogram.invoker_collectLogs_error
(histogram) - Time taken for failed log collection.openwhisk.histogram.invoker_collectLogs_finish
(histogram) - Time taken for successful log collection.
openwhisk.counter.invoker_activation_start
(counter) - Count of activations handled
Following metrics capture stats around various docker command executions.
- pause
openwhisk.counter.invoker_docker.pause_start
openwhisk.counter.invoker_docker.pause_error
openwhisk.counter.invoker_docker.pause_timeout
openwhisk.histogram.invoker_docker.pause_finish
openwhisk.histogram.invoker_docker.pause_error
- ps
openwhisk.counter.invoker_docker.ps_start
openwhisk.counter.invoker_docker.ps_error
openwhisk.counter.invoker_docker.ps_timeout
openwhisk.histogram.invoker_docker.ps_finish
openwhisk.histogram.invoker_docker.ps_error
- pull
openwhisk.counter.invoker_docker.pull_start
openwhisk.counter.invoker_docker.pull_error
openwhisk.counter.invoker_docker.pull_timeout
openwhisk.histogram.invoker_docker.pull_finish
openwhisk.histogram.invoker_docker.pull_error
- rm
openwhisk.counter.invoker_docker.rm_start
openwhisk.counter.invoker_docker.rm_error
openwhisk.counter.invoker_docker.rm_timeout
openwhisk.histogram.invoker_docker.rm_finish
openwhisk.histogram.invoker_docker.rm_error
- run
openwhisk.counter.invoker_docker.run_start
openwhisk.counter.invoker_docker.run_error
openwhisk.counter.invoker_docker.run_timeout
openwhisk.histogram.invoker_docker.run_finish
openwhisk.histogram.invoker_docker.run_error
- unpause
openwhisk.counter.invoker_docker.unpause_start
openwhisk.counter.invoker_docker.unpause_error
openwhisk.counter.invoker_docker.unpause_timeout
openwhisk.histogram.invoker_docker.unpause_finish
openwhisk.histogram.invoker_docker.unpause_error
Metrics below are emitted per kafka topic.
openwhisk.histogram.kafka_<topic name>.delay_start
- Time delay between when a message was pushed to Kafka and when it is read within a consumer. This metric is recorded for every message read.openwhisk.gauge.kafka_<topic name>_counter
- Records the Queue size of the topic. By default this metric is emitted every 60 secs.
Metrics per topic
cacheInvalidation
- Emitted per controller while reading the cache invalidation messages.openwhisk.histogram.kafka_cacheInvalidation.delay_start
openwhisk.histogram.kafka_cacheInvalidation_counter.count
health
- Emitted per controller while reading the invoker health pings.openwhisk.histogram.kafka_health.delay_start
openwhisk.histogram.kafka_health_counter
completed<controllerId>
- Topic to receive completed activations. This is emitted per controller for its own topic. For example for controller id 0 metric names would beopenwhisk.histogram.kafka_completed0.delay_start
openwhisk.histogram.kafka_completed0_counter
invoker<invokerId>
- Topic to receive activations to complete. This is emitted per invoker for its own topic. For example for invoker id 0 metric names would beopenwhisk.histogram.kafka_invoker0_counter
openwhisk.histogram.kafka_invoker0.delay_start
openwhisk.counter.database_cacheHit_counter
- Count of cache hits.openwhisk.counter.database_cacheMiss_counter
- Count of cache misses.
Metrics below are emitted for database related operations and follow a pattern
openwhisk.counter.database_<operation type>_start
- Count of database operations done for given type. Exampleopenwhisk.counter.database_getDocument_start
.openwhisk.counter.database_<operation type>_error
- Count of database operations done for given type which resulted in error. Exampleopenwhisk.counter.database_getDocument_error
.openwhisk.histogram.database_<operation type>_finish
- Time taken for successful completion of given database operation. Exampleopenwhisk.histogram.database_getDocument_finish
.openwhisk.histogram.database_<operation type>_error
- Time taken for failed completion of given database operation. Exampleopenwhisk.histogram.database_getDocument_error
.
Operation Types
deleteDocument
getDocument
queryView
saveDocument
saveDocumentBulk
When database used is CosmosDB then metrics related to CosmosDB Resource Units is also emitted.
If Kamon tags are enabled then metric name is openwhisk.counter.cosmosdb_ru_used
with following tags
mode
-read
orwrite
collection
- Name of collection. Exampleactivations
,whisks
andsubjects
action
- Type of operation performed. Exampleget
,put
,del
,query
andcount
If Kamon tags are not enabled then metric name is of the form openwhisk.counter.cosmosdb.ru.<collection>.<action>
User metrics are enabled by default and could be explicitly disabled by setting the following property in one of the Ansible configuration files:
user_events: false
Activation is an event that occurs after after each activation. It includes the following execution metadata:
waitTime - internal system hold time
initTime - time it took to initialize an action, e.g. docker init
statusCode - status code of the invocation: 0 - success, 1 - application error, 2 - action developer error, 3 - internal OpenWhisk error
duration - actual time the action code was running
kind - action flavor, e.g. Node.js
conductor - true for conductor backed actions
memory - maximum memory allowed for action container
causedBy - contains the "causedBy" annotation (can be "sequence" or nothing at the moment)
size - size (in bytes) of the invocation response
userDefinedStatusCode - status code represents `statusCode` set in result response. (if not set, this field will not be present)
Metric is any user specific event produced by the system and it at this moment includes the following information:
ConcurrentRateLimit - a user has exceeded its limit for concurrent invocations.
TimedRateLimit - the user has reached its per minute limit for the number of invocations.
ConcurrentInvocations - the number of in flight invocations per user.
Example events that could be consumed from Kafka. Activation:
{
"body": {
"statusCode": 0,
"duration": 3,
"name": "whisk.system/invokerHealthTestAction0",
"waitTime": 583915671,
"conductor": false,
"kind": "nodejs:6",
"initTime": 0,
"memory": 256,
"size": 463,
"causedBy": false
},
"eventType": "Activation",
"source": "invoker0",
"subject": "whisk.system",
"timestamp": 1524476122676,
"userId": "d0888ad5-5a92-435e-888a-d55a92935e54",
"namespace": "whisk.system"
}
Metric:
{
"body": {
"metricName": "ConcurrentInvocations",
"metricValue": 1
},
"eventType": "Metric",
"source": "controller0",
"subject": "guest",
"timestamp": 1524476104419,
"userId": "23bc46b1-71f6-4ed5-8c54-816aa4f8c502",
"namespace": "guest"
}
All user metrics can be consumed and published to various services such as Prometheus, Datadog etc via Kamon by using the user-events service.