Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handle thermal_zone errors gracefully #2980

Open
scotts-tp opened this issue Mar 29, 2024 · 4 comments
Open

Handle thermal_zone errors gracefully #2980

scotts-tp opened this issue Mar 29, 2024 · 4 comments

Comments

@scotts-tp
Copy link

Host operating system:

Linux 5.10.104-tegra #18 SMP PREEMPT aarch64 aarch64 aarch64 GNU/Linux

node_exporter version:

1.7.0

node_exporter command line flags:

--path.rootfs=/host

node_exporter log output

...
caller=collector.go:169 level=error msg="collector failed" name=thermal_zone duration_seconds=0.01870677 err="read /sys/class/thermal/thermal_zone10/temp: invalid argument"
caller=collector.go:169 level=error msg="collector failed" name=thermal_zone duration_seconds=0.001411717 err="read /sys/class/thermal/thermal_zone10/temp: invalid argument"
...

Are you running node_exporter in Docker?

Yes

What did you do that produced an error?

Running node_exporter in a docker container on a custom embedded device.

What did you expect to see?

Disabled thermal zones as either being ignored or optionally being filtered out.

What did you see instead?

The entire thermal_zone collector fails for all thermal_zones.

When a thermal zone is disabled which can be determined via /sys/class/thermal/thermal_zone10/mode, it would be nice for node_exporter to handle it gracefully whether natively or via flag, or allow specific files/devices be filtered out manually instead of as an entire class of devices.

My temporry workaround has been to use the Pushgateway with a curl container in my docker compose file as so:

  pushgateway:
    image: prom/pushgateway
    container_name: pushgateway
    restart: unless-stopped
    networks:
      - metrics
  curl_thermals:
    image: curlimages/curl
    container_name: curl_thermals
    command: '/bin/sh /pushgateway-thermal-zones.sh'
    pid: host
    restart: unless-stopped
    volumes:
      - /:/host:ro,rslave
      - ./pushgateway-thermal-zones.sh:/pushgateway-thermal-zones.sh:ro,rslave
    networks:
      - metrics

With this pushgateway-thermal-zones.sh script:

while true
do 
    output="# TYPE thermal_zone gauge\n# HELP thermal_zone Thermal zone temperatures in Celsius\n"

    # Loop through each thermal zone directory in /host/sys/class/thermal
    for zone in /host/sys/class/thermal/thermal_zone*; do
        # Check if the thermal zone is enabled by reading the mode file
        mode=$(cat "${zone}/mode")
        if [ "${mode}" = "enabled" ]; then
            zone_number=$(basename ${zone} | sed 's/thermal_zone//')
            zone_type=$(cat "${zone}/type")
            zone_temp=$(cat "${zone}/temp")
            zone_temp_scaled=$(echo "scale=2; ${zone_temp} / 1000.0" | bc)

            # Append the details to the output variable
            output="${output}thermal_zone{zone=\"${zone_number}\", type=\"${zone_type}\"} ${zone_temp_scaled}\n"
        fi
    done

    echo -e $output | curl -s --data-binary @- http://pushgateway:9091/metrics/job/thermal_zones/
    sleep 3
done
@Kylea650
Copy link

Kylea650 commented Apr 10, 2024

Seems like the error is coming from here: https://github.com/prometheus/procfs/blob/69fc8f61debb3bd7efca3a9a1c295d4012022830/sysfs/class_thermal.go#L73 / https://github.com/prometheus/procfs/blob/69fc8f61debb3bd7efca3a9a1c295d4012022830/sysfs/class_thermal.go#L52 - maybe there should be a check here if the error is of type os.ErrInvalid and either return an empty ClassThermalZonesStat{} or ignore it. Another option could be to check the mode for ‘disabled’ first in parseClassThermalZone() and return early.

not sure how to achieve this directly from node_exporter.

@discordianfish
Copy link
Member

@Kylea650 Checking mode for disabled sounds like a good option. If anyone wants to submit a PR to sysfs feel free to ping me there

@Kylea650
Copy link

Kylea650 commented Apr 15, 2024

@discordianfish Happy to raise a new issue mentioning this one and PR over in sysfs this week. Cheers!

@parthlaw
Copy link

Is this issue still open?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants