Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unable to start NVML integration #2382

Open
Julia-elsammak opened this issue May 15, 2024 · 6 comments · May be fixed by #2535
Open

Unable to start NVML integration #2382

Julia-elsammak opened this issue May 15, 2024 · 6 comments · May be fixed by #2535

Comments

@Julia-elsammak
Copy link

Julia-elsammak commented May 15, 2024

Output of the info page

When installing NVML integration, getting the following error:

Loading Errors

nvml
----
  Core Check Loader:
    Check nvml not found in Catalog

  JMX Check Loader:
    check is not a jmx check, or unable to determine if it's so

  Python Check Loader:
    unable to import module 'nvml': No module named 'nvml'`

Looking at the debug logs

2024-05-11 18:18:54 CST | CORE | DEBUG | (pkg/collector/python/loader.go:158 in Load) | Unable to load python module - datadog_checks.nvml: unable to import module 'datadog_checks.nvml': Traceback (most recent call last):
  File "/opt/datadog-agent/embedded/lib/python3.11/site-packages/datadog_checks/nvml/__init__.py", line 5, in <module>
    from .nvml import NvmlCheck
  File "/opt/datadog-agent/embedded/lib/python3.11/site-packages/datadog_checks/nvml/nvml.py", line 16, in <module>
    from .api_pb2 import ListPodResourcesRequest
  File "/opt/datadog-agent/embedded/lib/python3.11/site-packages/datadog_checks/nvml/api_pb2.py", line 25, in <module>
    _LISTPODRESOURCESREQUEST = _descriptor.Descriptor(
                               ^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/datadog-agent/embedded/lib/python3.11/site-packages/google/protobuf/descriptor.py", line 296, in __new__
    _message.Message._CheckCalledFromGeneratedFile()
TypeError: Descriptors cannot be created directly.
If this call came from a _pb2.py file, your generated code is out of date and must be regenerated with protoc >= 3.19.0.
If you cannot immediately regenerate your protos, some other possible workarounds are:
 1. Downgrade the protobuf package to 3.20.x or lower.
 2. Set PROTOCOL_BUFFERS_PYTHON_IMPLEMENTATION=python (but this will use pure-Python parsing and will be much slower).

More information: https://developers.google.com/protocol-buffers/docs/news/2022-05-06#python-updates`

To fix this issue:

  • Utilize the NVIDA DCGM Exporter:
    This method is recommended best practices as the feature is owned and supported by Datadog. Included in the accompanying documentation is an example configuration that executes the same processes as the NVML Integration.
    Nvidia DCGM Exporter: https://docs.datadoghq.com/integrations/dcgm/?tab=hostdocker#overview
  • Google/protobuf library isn't directly installed by the nvml check, but rather is packaged with the Datadog Agent, the nvml check will need to be updated to resolve this issue. The nvml manifest.json in Github.
  • Downgrade to the Agent version v7.50.3. The reason this may have started now is that v7.51.0 of the Agent upgraded the Python used from 3.9 to 3.11, which would have also updated the included libraries like google/protobuf.
@basilnsage
Copy link

Tagging @cep21 and @cswatt who have worked on this before, if you'd be so kind as to have a look please.

@cep21
Copy link
Contributor

cep21 commented May 18, 2024

All of those fixes seem reasonable. As datadog's officially supporting the NVIDA DCGM Exporter now, I've deprecated the nvml plugin internally. It may be best to add it as deprecated here as well. Someone could also modify the plugin to refuse to install for newer datadog versions,but I won't have time to contribute this.

@tmart-ops
Copy link

datadog-agent updates have broken this integration for me as well. I've been able to use the DCGM exporter but it requires running the DCGM exporter container which is less than ideal if it's a machine that doesn't run Docker.

@maxgio92
Copy link

maxgio92 commented Nov 11, 2024

While it's not an optimal workaround, I've made the check work using pure-Python Protobuf implementation:

PROTOCOL_BUFFERS_PYTHON_IMPLEMENTATION=python agent check nvml
...


  Running Checks
  ==============

    nvml (1.0.9)
    ------------
      Instance ID: nvml:b6f35e1900952b0b [OK]
      Configuration Source: file:/etc/datadog-agent/conf.d/nvml.yaml
      Total Runs: 1
      Metric Samples: Last Run: 0, Total: 0
      Events: Last Run: 0, Total: 0
      Service Checks: Last Run: 0, Total: 0
      Average Execution Time : 1ms
      Last Execution Date : 2024-11-11 12:28:56 UTC (1731328136000)
      Last Successful Execution Date : 2024-11-11 12:28:56 UTC (1731328136000)


  Metadata
  ========
    config.hash: nvml:b6f35e1900952b0b
    config.provider: file
Check has run only once, if some metrics are missing you can try again with --check-rate to see any other metric if available.
This check type has 1 instances. If you're looking for a different check instance, try filtering on a specific one using the --instance-filter flag or set --discovery-min-instances to a higher value

This means that it would needed to be applied at agent level for all checks I guess - I'm not aware of being able to use the non-C++ implementation only for this check.

@maxgio92
Copy link

maxgio92 commented Nov 11, 2024

Trying to solve the issue at the root, I think we can release a new patch version for nvml regenerating the Python protobuf code, with something like:

$ protoc --python_out=nvml/datadog_checks/nvml nvml/datadog_checks/nvml/api.proto

@maxgio92 maxgio92 linked a pull request Nov 12, 2024 that will close this issue
2 tasks
@maxgio92
Copy link

JFI I've opened #2535, tested against Datadog Agent v7.59.0.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants