Skip to content

[kubernetes] Collector cannot verify tls: CERTIFICATE_VERIFY_FAILED #288

Open
@poswald

Description

@poswald

When deploying to an IBM Cloud cluster (kubernetes 1.9) the Datadog collector does not work. This has reported to Datadog support as ticket #129722

It is still not clear to me if this is an issue with IBM or with Datadog.

There is a workaround: after disabling TLS verification kubelet_tls_verify in /etc/dd-agent/conf.d/kubernetes.yaml it connects ok however, we cannot run this way in production. The bearer token path /var/run/secrets/kubernetes.io/serviceaccount/token appears to be populated ok:

root@dd-agent-mbshp:/# ls -al /var/run/secrets/kubernetes.io/serviceaccount/token
lrwxrwxrwx 1 root root 12 Feb 20 03:01 /var/run/secrets/kubernetes.io/serviceaccount/token -> ..data/token

**Output of the info page **

# /etc/init.d/datadog-agent info
2018-02-20 04:02:23,039 | DEBUG | dd.collector | utils.service_discovery.config(config.py:31) | No configuration backend provided for service discovery. Only auto config templates will be used.
2018-02-20 04:02:23,345 | DEBUG | dd.collector | utils.cloud_metadata(cloud_metadata.py:77) | Collecting GCE Metadata failed HTTPConnectionPool(host='169.254.169.254', port=80): Max retries exceeded with url: /computeMetadata/v1/?recursive=true (Caused by ConnectTimeoutError(<requests.packages.urllib3.connection.HTTPConnection object at 0x2b39fa0afa50>, 'Connection to 169.254.169.254 timed out. (connect timeout=0.3)'))
2018-02-20 04:02:23,348 | DEBUG | dd.collector | docker.auth.auth(auth.py:227) | Trying paths: ['/root/.docker/config.json', '/root/.dockercfg']
2018-02-20 04:02:23,348 | DEBUG | dd.collector | docker.auth.auth(auth.py:234) | No config file found
====================
Collector (v 5.22.0)
====================

  Status date: 2018-02-20 04:02:21 (2s ago)
  Pid: 38
  Platform: Linux-4.4.0-109-generic-x86_64-with-debian-8.10
  Python Version: 2.7.14, 64bit
  Logs: <stderr>, /var/log/datadog/collector.log

  Clocks
  ======

    NTP offset: -0.0047 s
    System UTC time: 2018-02-20 04:02:23.861592

  Paths
  =====

    conf.d: /etc/dd-agent/conf.d
    checks.d: Not found

  Hostnames
  =========

    socket-hostname: dd-agent-mbshp
    hostname: kube-tok02-crbf29e27a18ff4db58ff3873f3c748f61-w1.cloud.ibm
    socket-fqdn: dd-agent-mbshp

  Checks
  ======

    ntp (1.0.0)
    -----------
      - Collected 0 metrics, 0 events & 0 service checks

    disk (1.1.0)
    ------------
      - instance #0 [OK]
      - Collected 52 metrics, 0 events & 0 service checks

    network (1.4.0)
    ---------------
      - instance #0 [OK]
      - Collected 50 metrics, 0 events & 0 service checks

    docker_daemon (1.8.0)
    ---------------------
      - instance #0 [OK]
      - Collected 216 metrics, 0 events & 1 service check

    kubernetes (1.5.0)
    ------------------
      - initialize check class [ERROR]: Exception('Unable to initialize Kubelet client. Try setting the host parameter. The Kubernetes check failed permanently.',)

  Emitters
  ========

    - http_emitter [OK]

2018-02-20 04:02:26,458 | DEBUG | dd.dogstatsd | utils.service_discovery.config(config.py:31) | No configuration backend provided for service discovery. Only auto config templates will be used.
====================
Dogstatsd (v 5.22.0)
====================

  Status date: 2018-02-20 04:02:20 (6s ago)
  Pid: 27
  Platform: Linux-4.4.0-109-generic-x86_64-with-debian-8.10
  Python Version: 2.7.14, 64bit
  Logs: <stderr>, /var/log/datadog/dogstatsd.log

  Flush count: 358
  Packet Count: 1980
  Packets per second: 2.0
  Metric count: 20
  Event count: 0
  Service check count: 0

====================
Forwarder (v 5.22.0)
====================

  Status date: 2018-02-20 04:02:30 (0s ago)
  Pid: 20
  Platform: Linux-4.4.0-109-generic-x86_64-with-debian-8.10
  Python Version: 2.7.14, 64bit
  Logs: <stderr>, /var/log/datadog/forwarder.log

  Queue Size: 611 bytes
  Queue Length: 1
  Flush Count: 1110
  Transactions received: 883
  Transactions flushed: 882
  Transactions rejected: 0
  API Key Status: API Key is valid


======================
Trace Agent (v 5.22.0)
======================

  Pid: 18
  Uptime: 3642 seconds
  Mem alloc: 958344 bytes

  Hostname: dd-agent-mbshp
  Receiver: 0.0.0.0:8126
  API Endpoint: https://trace.agent.datadoghq.com

  --- Receiver stats (1 min) ---


  --- Writer stats (1 min) ---

  Traces: 0 payloads, 0 traces, 0 bytes
  Stats: 0 payloads, 0 stats buckets, 0 bytes
  Services: 0 payloads, 0 services, 0 bytes

Output of the collector log:

root@dd-agent-mbshp:/# head -100 /var/log/datadog/collector.log
2018-02-20 03:02:15 UTC | DEBUG | dd.collector | utils.service_discovery.config(config.py:31) | No configuration backend provided for service discovery. Only auto config templates will be used.
2018-02-20 03:02:15 UTC | DEBUG | dd.collector | utils.cloud_metadata(cloud_metadata.py:77) | Collecting GCE Metadata failed HTTPConnectionPool(host='169.254.169.254', port=80): Max retries exceeded with url: /computeMetadata/v1/?recursive=true (Caused by ConnectTimeoutError(<requests.packages.urllib3.connection.HTTPConnection object at 0x2ab4b4a3aad0>, 'Connection to 169.254.169.254 timed out. (connect timeout=0.3)'))
2018-02-20 03:02:15 UTC | DEBUG | dd.collector | docker.auth.auth(auth.py:227) | Trying paths: ['/root/.docker/config.json', '/root/.dockercfg']
2018-02-20 03:02:15 UTC | DEBUG | dd.collector | docker.auth.auth(auth.py:234) | No config file found
2018-02-20 03:02:15 UTC | INFO | dd.collector | utils.pidfile(pidfile.py:35) | Pid file is: /opt/datadog-agent/run/dd-agent.pid
2018-02-20 03:02:15 UTC | INFO | dd.collector | collector(agent.py:559) | Agent version 5.22.0
2018-02-20 03:02:15 UTC | INFO | dd.collector | daemon(daemon.py:234) | Starting
2018-02-20 03:02:15 UTC | DEBUG | dd.collector | checks.check_status(check_status.py:163) | Persisting status to /opt/datadog-agent/run/CollectorStatus.pickle
2018-02-20 03:02:15 UTC | DEBUG | dd.collector | utils.service_discovery.config(config.py:31) | No configuration backend provided for service discovery. Only auto config templates will be used.
2018-02-20 03:02:15 UTC | DEBUG | dd.collector | utils.subprocess_output(subprocess_output.py:54) | Popen(['grep', 'model name', '/host/proc/cpuinfo'], stderr = <open file '<fdopen>', mode 'w+b' at 0x2ab4b4a09780>, stdout = <open file '<fdopen>', mode 'w+b' at 0x2ab4b4a096f0>) called
2018-02-20 03:02:16 UTC | DEBUG | dd.collector | collector(kubeutil.py:264) | Couldn't query kubelet over HTTP, assuming it's not in no_auth mode.
2018-02-20 03:02:16 UTC | WARNING | dd.collector | collector(kubeutil.py:273) | Couldn't query kubelet over HTTP, assuming it's not in no_auth mode.
2018-02-20 03:02:16 UTC | ERROR | dd.collector | collector(kubeutil.py:209) | Failed to initialize kubelet connection. Will retry 0 time(s). Error: [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed (_ssl.c:661)
2018-02-20 03:02:16 UTC | INFO | dd.collector | config(config.py:998) | no bundled checks.d path (checks provided as wheels): /opt/datadog-agent/agent/checks.d
2018-02-20 03:02:16 UTC | DEBUG | dd.collector | config(config.py:1012) | No sdk integrations path found
2018-02-20 03:02:16 UTC | ERROR | dd.collector | config(config.py:1076) | Unable to initialize check kubernetes
Traceback (most recent call last):
  File "/opt/datadog-agent/agent/config.py", line 1060, in _initialize_check
    agentConfig=agentConfig, instances=instances)
  File "/opt/datadog-agent/embedded/lib/python2.7/site-packages/datadog_checks/kubernetes/kubernetes.py", line 106, in __init__
    raise Exception('Unable to initialize Kubelet client. Try setting the host parameter. The Kubernetes check failed permanently.')
Exception: Unable to initialize Kubelet client. Try setting the host parameter. The Kubernetes check failed permanently.
2018-02-20 03:02:16 UTC | DEBUG | dd.collector | config(config.py:1177) | Loaded kubernetes
2018-02-20 03:02:16 UTC | DEBUG | dd.collector | config(config.py:1177) | Loaded docker_daemon
2018-02-20 03:02:16 UTC | DEBUG | dd.collector | config(config.py:1177) | Loaded ntp
2018-02-20 03:02:16 UTC | DEBUG | dd.collector | config(config.py:1177) | Loaded disk
2018-02-20 03:02:16 UTC | DEBUG | dd.collector | config(config.py:1177) | Loaded network
2018-02-20 03:02:16 UTC | DEBUG | dd.collector | config(config.py:1177) | Loaded agent_metrics
2018-02-20 03:02:16 UTC | INFO | dd.collector | config(config.py:973) | Fetching service discovery check configurations.
2018-02-20 03:02:16 UTC | ERROR | dd.collector | utils.service_discovery.sd_docker_backend(sd_docker_backend.py:123) | kubelet client not initialized, cannot retrieve pod list.

Steps to reproduce the issue:

  1. Create a new cluster
  2. Deploy the dd-agent.yml as documented

Describe the results you received:

Infrastructure host info in Datadog console reports:

Datadog's kubernetes integration is reporting:
Instance #initialization[ERROR]:Exception('Unable to initialize Kubelet client. Try setting the host parameter. The Kubernetes check failed permanently.',)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions