Vector config for better collecting metrics perfomance #10881

EvgeniiAl · 2022-01-17T12:35:45Z

EvgeniiAl
Jan 17, 2022

Hello. I'm testing vector as a replacement for promtail and experienced some performance problems with straightforward configuration. Logs are collected and sent to loki from nginx in a JSON format with ~1500 lines per second rate.
Here is part of the example message:

{
  "msec": "1642071602.041",
  "bytes_sent": "14372",
  "request_uri": "/some-url",
  "http_host": "some.host.example",
  "server_name": "some.server",
  "server_port": "443",
  "request_time": "0.001",
  "request_method": "GET",
  "cdn_upstream": "someupstream",  
  "upstream_addr": "127.0.0.1:8081",
  "upstream_status": "200",
  "upstream_cache_status": "EXPIRED"
}

Agent should export metrics from logs to prometheus with labels cdn_upstream, status, upstream_addr, upstream_cache_status, upstream_status:

request counter
bytes_sent counter
request_time histogram

Here is promtail config:

server:
  http_listen_port: 30200
  grpc_listen_address: 127.0.0.1
  grpc_listen_port: 30201

positions:
  filename: /tmp/positions.yaml

clients:
  - url: http://loki:3100/loki/api/v1/push


scrape_configs:
    - job_name: var_log
      pipeline_stages:
      - json:
          expressions:
            timestamp: msec
            request_time: request_time
            bytes_sent: bytes_sent
            status: status
            upstream_status: upstream_status
            upstream_addr: upstream_addr
            cdn_upstream: cdn_upstream
            upstream_cache_status: upstream_cache_status
      - regex:
          source: timestamp
          expression: "(?P<timestamp>[0-9]+)"
      - timestamp:
          source: timestamp
          format: Unix
          action_on_failure: fudge
      - labels:
          cdn_upstream:
          status:
          upstream_addr:
          upstream_status:
          upstream_cache_status:
      - metrics:
          nginx_bytes_sent:
            type: Counter
            description: "total bytes sent"
            prefix: promtail_custom_
            max_idle_duration: 10m
            source: bytes_sent
            config:
              action: add
          nginx_request_time:
            type: Histogram
            description: "request time ms"
            prefix: promtail_custom_
            max_idle_duration: 10m
            source: request_time
            config:
              buckets: [0.050,0.100,0.500,0.800,1.0,1.5,2.0,5.0]
          nginx_request:
            type: Counter
            description: "request"
            prefix: promtail_custom_
            max_idle_duration: 10m
            config:
              match_all: true
              action: inc
      static_configs:
      - targets:
        - localhost
        labels:
          job: nginx_access_log
          host: myhost
          agent: promtail
          __path__: /var/log/nginx/*_json.log

Here is a straightforward vector config:

[sources.var_json_log]
type = "file"
read_from = "end"
include = [ "/var/log/nginx/*_json.log" ]


[transforms.parser]
type = "remap"
inputs = [ "var_json_log", "data_json_log" ]
source = """
. |= parse_json!(string!(.message))
del(.message)
"""


[transforms.meter]
type = "log_to_metric"
inputs = [ "parser" ]

  [[transforms.meter.metrics]]
  field = "bytes_sent"
  name = "bytes_sent"
  type = "counter"
  increment_by_value = true
  namespace = "vector_custom_nginx"

    [transforms.meter.metrics.tags]
    cdn_upstream = "{{ cdn_upstream }}"
    status = "{{ status }}"
    upstream_addr = "{{ upstream_addr }}"
    upstream_cache_status = "{{ upstream_cache_status }}"
    upstream_status = "{{ upstream_status }}"

  [[transforms.meter.metrics]]
  field = "request_time"
  name = "request_time"
  type = "histogram"
  increment_by_value = true
  namespace = "vector_custom_nginx"

    [transforms.meter.metrics.tags]
    cdn_upstream = "{{ cdn_upstream }}"
    status = "{{ status }}"
    upstream_addr = "{{ upstream_addr }}"
    upstream_cache_status = "{{ upstream_cache_status }}"
    upstream_status = "{{ upstream_status }}"

  [[transforms.meter.metrics]]
  field = "msec"
  name = "request"
  type = "counter"
  namespace = "vector_custom_nginx"

    [transforms.meter.metrics.tags]
    cdn_upstream = "{{ cdn_upstream }}"
    status = "{{ status }}"
    upstream_addr = "{{ upstream_addr }}"
    upstream_cache_status = "{{ upstream_cache_status }}"
    upstream_status = "{{ upstream_status }}"


[sinks.loki]
type = "loki"
inputs = [ "parser" ]
endpoint = "http://loki:3100"
encoding = "json"

  [sinks.loki.labels]
  cdn_upstream = "{{ cdn_upstream }}"
  status = "{{ status }}"
  upstream_addr = "{{ upstream_addr }}"
  upstream_cache_status = "{{ upstream_cache_status }}"
  upstream_status = "{{ upstream_status }}"


[sinks.prometheus]
type = "prometheus_exporter"
inputs = [ "meter" ]
address = "0.0.0.0:30201"
buckets = [0.050,0.100,0.500,0.800,1.0,1.5,2.0,5.0]

With this config, vector is consuming ~1.5 - 2 times more cpu than promtail. In my opinion, it happens because vector generates 3 metric events for one log event. So I've added reduce transformer for aggregating parameters and lua transformer for generating proper histogram metric event.

[transforms.parser]
type = "remap"
inputs = [ "var_json_log", "data_json_log" ]
drop_on_abort = true
drop_on_error = true
reroute_dropped = true
source = """
._let_me_flush = floor(to_float(to_unix_timestamp!(.timestamp)) / 10 ?? 0)
. |= object!(parse_json!(string!(.message)))
del(.message)
._i = 1
._bytes_sent = to_int(.bytes_sent) ?? 0
._request_time = to_float(.request_time) ?? 0.0
"""


[transforms.reducer]
type = "reduce"
inputs = ["parser"]
expire_after_ms = 8000
group_by = [
  "_let_me_flush",
  "cdn_upstream",
  "file",
  "status",
  "upstream_addr",
  "upstream_cache_status",
  "upstream_status"
]

  [transforms.reducer.merge_strategies]
  _request_time = "array"


[transforms.histogram]
type = "lua"
version = "2"
inputs = [ "reducer" ]

  [transforms.histogram.hooks]
  process = """
  function (event, emit)
    local freq = {}

    for _, v in ipairs(event.log._request_time) do
      freq[v] = (freq[v] or 0) + 1
    end

    local sample_rates = {}
    local values = {}
    for k, v in  pairs(freq) do
      table.insert(sample_rates, v)
      table.insert(values, k + 0.0)
    end

    emit({ 
      metric = {
        name = "request_time",
        namespace = "vector_custom_nginx",
        tags = {
          cdn_upstream = event.log.cdn_upstream,
          file = event.log.file,
          status = event.log.status,
          upstream_addr = event.log.upstream_addr,
          upstream_cache_status = event.log.upstream_cache_status,
          upstream_status =  event.log.upstream_status
        },
        timestamp = event.log.timestamp,
        kind = "incremental",
        distribution = {
          sample_rates = sample_rates,
          values = values,
          statistic = "histogram"
        }
      }
    })
  end
  """


[transforms.meter]
type = "log_to_metric"
inputs = [ "reducer" ]

  [[transforms.meter.metrics]]
  field = "_bytes_sent"
  name = "bytes_sent"
  type = "counter"
  increment_by_value = true
  namespace = "vector_custom_nginx"

    [transforms.meter.metrics.tags]
    cdn_upstream = "{{ cdn_upstream }}"
    file = "{{ file }}"
    status = "{{ status }}"
    upstream_addr = "{{ upstream_addr }}"
    upstream_cache_status = "{{ upstream_cache_status }}"
    upstream_status = "{{ upstream_status }}"

  [[transforms.meter.metrics]]
  field = "_i"
  name = "request"
  type = "counter"
  increment_by_value = true
  namespace = "vector_custom_nginx"

    [transforms.meter.metrics.tags]
    cdn_upstream = "{{ cdn_upstream }}"
    file = "{{ file }}"
    status = "{{ status }}"
    upstream_addr = "{{ upstream_addr }}"
    upstream_cache_status = "{{ upstream_cache_status }}"
    upstream_status = "{{ upstream_status }}"

With this config, vector cpu consumption decreased to promtail level.

I have a couple of questions:

Is there a better way to force reducer to emit event than generating a custom field that changes with a given interval
Is there a way to generate histogram metric event from reducer without lua transformer

Answered by StephenWakely

Feb 9, 2022

In answer to the questions the solution you have here is a pretty good one. You could potentially do the whole thing in the lua transform since that transform could maintain the histogram in global memory and only emit once every 10 seconds, but that could get complex.

However, the main question is why this is actually having a significant impact on performance. The processing required to reduce the events and generate the histogram shouldn't be less than what the prometheus exporter already does.

It is possible you are running into the issue here #10635, which has now been fixed. If so it would be worth trying your original config with the latest version of Vector - either v0.19.2 or v0.…

View full answer

vmrm · 2022-02-09T12:44:04Z

vmrm
Feb 9, 2022

Any news here?

0 replies

StephenWakely · 2022-02-09T15:22:26Z

StephenWakely
Feb 9, 2022
Collaborator

In answer to the questions the solution you have here is a pretty good one. You could potentially do the whole thing in the lua transform since that transform could maintain the histogram in global memory and only emit once every 10 seconds, but that could get complex.

However, the main question is why this is actually having a significant impact on performance. The processing required to reduce the events and generate the histogram shouldn't be less than what the prometheus exporter already does.

It is possible you are running into the issue here #10635, which has now been fixed. If so it would be worth trying your original config with the latest version of Vector - either v0.19.2 or v0.20, both are due to be released later today.

0 replies

xdatcloud · 2022-02-23T07:11:58Z

xdatcloud
Feb 23, 2022

Is there a better way to force reducer to emit event than generating a custom field that changes with a given interval

Could be solved with this PR (still open): #8749

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Vector config for better collecting metrics perfomance #10881

{{title}}

Replies: 3 comments

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Vector config for better collecting metrics perfomance #10881

EvgeniiAl Jan 17, 2022

Replies: 3 comments

vmrm Feb 9, 2022

StephenWakely Feb 9, 2022 Collaborator

xdatcloud Feb 23, 2022

EvgeniiAl
Jan 17, 2022

vmrm
Feb 9, 2022

StephenWakely
Feb 9, 2022
Collaborator

xdatcloud
Feb 23, 2022