Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Logs are being lost from a service running on Kubernetes. #1545

Open
slopezxrd opened this issue Nov 27, 2024 · 3 comments
Open

Logs are being lost from a service running on Kubernetes. #1545

slopezxrd opened this issue Nov 27, 2024 · 3 comments
Assignees
Labels

Comments

@slopezxrd
Copy link

slopezxrd commented Nov 27, 2024

Describe the bug

Our environment is located in a private Kubernetes cluster, for monitoring of the logs it is used fluentd.
The fluentd configuration reads from the log files generated by kubernetes in var/logs/containers/ 
There are different microservices, and for each of one a different regex is applied to extract the information to fields, 
adds the information of kubernetes with @type kubernetes_metadata. Finally, these logs are written in two ELK and in a 
local file. In case of failure there is a secundary with which to insert the failed ones.

One of the microservices has a high traffic and generates a large amount of logs. 
Every certain time, it is observed that traces are lost both in the ELKs and in the files

It is suspected that it has to do with the fast rotation of the log files. At some moment, a rotated file is finished and,
instead of continuing with the next rotated file, it continues with the current log file.

To Reproduce

Fast rotation speed in logs file

Expected behavior

Increase the speed of log processing so that this offset with the rotation of them does not occur.
- In case that offset does occur (temporarily), although it inserts behind, it inserts all traces.

Your Environment

fluent/fluentd-kubernetes-daemonset:v1.16.2-debian-elasticsearch7-1.0, as the main image. 
Although the following images have also been tested:
 - v1.16.2-debian-elasticsearch7-1.0
 - v1.16.2-debian-elasticsearch7-1.1
 - v1.16.2-debian-elasticsearch8-1.0
 - v1.16.2-debian-elasticsearch8-1.1
 - v1.16.3-debian-elasticsearch8-1.0
 - v1.16.3-debian-elasticsearch8-2.0
 - v1.16.3-debian-elasticsearch8-2.1
 - v1.16.5-debian-elasticsearch7-1.3
 - v1.16.5-debian-elasticsearch8-1.3
 - v1.17-debian-elasticsearch8-1

Your Configuration

<source>
      @type tail
      @id in_tail_container_logs
      @label @KUBERNETES
      path /var/logs/containers/**app-ms**.log
      pos_file /var/log/fluentd-containers.log.pos
      tag kubernetes.*
      read_from_head true
      time_key time
      <parse>
        @type multi_format
			<pattern>
				format json
				time_key time
				time_type string
				time_format %Y-%m-%dT%H:%M:%S.%N%z
				keep_time_key true
			</pattern>
            <pattern>
				format regexp
                expression /^(?<log>.*)$/
                keep_time_key false
        </pattern>
      </parse>
      emit_unmatched_lines true
    </source>

    <label @KUBERNETES>
      <match kubernetes.var.log.containers.fluentd**>
        @type relabel
        @label @FLUENT_LOG
      </match>

      <match kubernetes.var.log.containers.**_kube-system_**>
        @type null
        @id ignore_kube_system_logs
      </match>

      <match fluent.**>
       this tells fluentd to not output its log on stdout
        @type null
      </match>

      <filter kubernetes.**>
        @type concat
        key log
        multiline_start_regexp /^\d{4}-\d{1,2}-\d{1,2} \d{1,2}:\d{1,2}:\d{1,2}.\d{3}/
        continuous_line_regexp /^(?!\d{4}-\d{1,2}-\d{1,2} \d{1,2}:\d{1,2}:\d{1,2}.\d{3})/
        separator ""
        flush_interval 5
        timeout_label @LOGS
      </filter>
      <match **>
        @type relabel
        @label @LOGS
      </match>
    </label>

    <label @LOGS>
      <filter kubernetes.**app-ms**>
        @type parser
        key_name log
        <parse>
          @type regexp
          expression ^(?<log>((?<time>\d{4}-\d{1,2}-\d{1,2} \d{1,2}:\d{1,2}:\d{1,2}.\d{3})\s+(?<level>[^\s]+)?\s+(?<pid>\d+).*?\[\s*(?<thread>.*)\]\s+(?<class>.*)\s+:\s+(?<message>((?<idTraza>[\w-]+)\s-\s)?((?<operacion>[\w-]+)\.?\s-\s)?((?<caja>\w+)\.?\s-)?(.*?codigo":"(?<codigo>\w+?)")?(.*?texto":"(?<texto>[^"]+)")?(.*?txStatus":"(?<txStatus>[^"]+)")?(.*tiempo:\s(?<ms>\d+)\sms)?.*))?(\s|.)*)
        </parse>
      </filter>

      <filter kubernetes.**>
        @type kubernetes_metadata
        merge_json_log true
        preserve_json_log true
      </filter>

      <filter kubernetes.**>
        @type grep
        <regexp>
            key $.kubernetes.labels.fluentd
            pattern true
        </regexp>
      </filter>

       <match **>
        @type relabel
        @label @OUTPUT
       </match>
    </label>

  04_outputs.conf: |-
    <label @OUTPUT>
      <match **app-ms**>
        @type copy
		<store>
            @include elastic.conf_common
            data_stream_name log-pods-pro.app-ms
            <secondary>
              @type secondary_file
              directory /var/log/trazas/error/app
              basename error.app.logs
            </secondary>
         </store>
        <store>
          @include new_elastic.conf_common
          data_stream_name log-pods-pro.app-ms
          <secondary>
            @type secondary_file
            directory /var/log/trazas/error/app
            basename error.app.logs
          </secondary>
        </store>
		<store>
            @include file_save.conf_common
            path /var/log/trazas/file/app
        </store>
      </match>
    </label>
  elastic.conf_common: |-
    @type elasticsearch_data_stream
    host "url-elastic.es"
    port 443
    path ""
    request_timeout 20s
    scheme https
    user fluentd_pro
    password *************
    data_stream_name log-pods-pro.app-ms
    ssl_verify false
    include_tag_key true
    reload_connections true
    sniffer_class_name Fluent::Plugin::ElasticsearchSimpleSniffer
    reconnect_on_error true
    reload_on_failure false
    flush_interval 5s
    num_threads 2
    logstash_format false
    utc_index false
    time_key time
    time_key_exclude_timestamp true
    include_timestamp true
    <buffer>
      flush_thread_count 2
      flush_interval 5s
      retry_max_times 1
    </buffer>
  new_elastic.conf_common: |-
  @type elasticsearch_data_stream
      host "new_url-elastic.es"
      port 443
      path ""
      request_timeout 20s
      scheme https
      user fluentd_pro
      password *************
	  data_stream_name log-pods-pro.app-ms
      ssl_verify false
      include_tag_key true
      reload_connections true
      sniffer_class_name Fluent::Plugin::ElasticsearchSimpleSniffer
      reconnect_on_error true
      reload_on_failure false
      flush_interval 5s
      num_threads 2
      logstash_format false
      utc_index false
      time_key time
      time_key_exclude_timestamp true
      include_timestamp true
      <buffer>
        flush_thread_count 2
        flush_interval 5s
        retry_max_times 1
      </buffer>
  file_save.conf_common: |-
    @type file
      compress gzip
      <format>
        localtime true
      </format>
      <buffer time>
        timekey_wait 1m
        timekey 10m
        timekey_use_utc false
      </buffer>
      <inject>
        time_format %Y%m%dT%H%M%S%z
        localtime false
      </inject>

Your Error Log

No log error

Additional context

Tests performed, without success:

Only insert in the local file 
Only insert in the ELKs 
Manage the log file and its rotation manually 
Modification of the fluentd configuration, source, buffer in the output. 

Observations:
When there is a large amount of logs, it is observed that the insertion in ELK is getting behind,
until it stops writing and writes those from a few minutes later.
The files rotate every few minutes (3-4 minutes).
In one of the tests performed, it was perceived that in the file of positions there were two entries for the same file identifier.
One of the new files generated had been created with the same identifier as one that had been deleted (but was maintained
in the file of positions)

@Watson1978
Copy link

I guess this is related to fluent/fluentd#4693 ...

@Watson1978 Watson1978 self-assigned this Dec 4, 2024
@Watson1978 Watson1978 moved this from Triage to Work-In-Progress in Fluentd Kanban Dec 4, 2024
@jicowan
Copy link

jicowan commented Dec 12, 2024

@slopezxrd I've experienced a similar issue that you can read about fluent/fluentd#4693. I have temporarily worked around the issue by increasing the maximum log file size (the default is 10MB) on the kubelet. Increasing the size will reduce the frequency logs are rotated. I suspect the issue is related to inode tracking when logs are rotated quickly and often. I've noticed that when when logs go missing there's usually a warning in the fluentd logs that says Skip update_watcher because watcher has been already updated by other inotify event. I've tried using different types of buffers, e.g. file, hybrid, and memory; increasing the memory and CPU, using multiple workers, etc. None of them seem to have an impact. At significantly high volumes (>1000 log lines per second per container) fluentd has trouble reading and tracking when logs are rotated (I have had up to 10 of these containers running on 1 instance).

@slopezxrd
Copy link
Author

@jicowan
I changed my java program to write in logs, not controled by kubernetes, where i can determinate the rotation and size of this logs file.
When i told fluentd to use this logs instead of the kubernetes logs, they had a configuration to not rotate at all, but i missed traces in my ELK and in another output in file.

Increse the size of the file is not a solution because it´s only pospone the error to another moment, the problem it´s that fluentd read slower than the logs are written in the files, as i can see in another tests, and omit some files in the middle.

Increse the size to reduce the amount of files is a good option, but in kubernetes where you have multiples instances run at once you have multiples logs, and rotations.

While i don´t have another solution i increase the size to mitigate the impact in my environment

Thank you so much

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
Status: Work-In-Progress
Development

No branches or pull requests

3 participants