Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prometheus exporter for SMART and OCP C0 Log Page #2189

Open
jmhands opened this issue Jan 21, 2024 · 16 comments
Open

Prometheus exporter for SMART and OCP C0 Log Page #2189

jmhands opened this issue Jan 21, 2024 · 16 comments

Comments

@jmhands
Copy link

jmhands commented Jan 21, 2024

Is there any roadmap for native integration for a Prometheus exporter? I saw some changes coming in json, it would be good to align any exporters on a specific format. My suggestion would be to track drives by "sn" with info on "mn" and "fw" from sudo nvme id-ctrl /dev/nvme1n1 -o json then have a standard option for exporting statistics to Prometheus from sudo nvme smart-log /dev/nvme1n1 -o json and the OCP log page C0.

A small issue is the C0 log page isn't available in the older releases, but this should be the most helpful log along with the normal smart-log to be able to calculate WAF for workloads with "Physical media units written". Other data in the OCP log will be useful for tracking fleet health across many NVMe SSDs.

works with the latest app image provided
sudo ./nvme-cli-latest-x86_64.AppImage ocp smart-add-log /dev/nvme0n1 -o json

@igaw
Copy link
Collaborator

igaw commented Jan 22, 2024

smart-log is using the log page 0x02 as defined in the NVME base spec. ocp smart-add-log reports the content for the vendor specific log page 0xc0.

If I understand you correctly, you would like to have a command which fused the output for both log page. I don't think we should mingle the existing commands.

Though first we need to figure out if we should solve this on the level of nvme-cli (I suppose this is your question on a roamap).

Anyway, couldn't this be something on top of nvme-cli which does the right thing? Technically, we could solve this as yet another plugin.

Thoughts?

@arthurshau @keithbusch @hreinecke

@jmhands
Copy link
Author

jmhands commented Jan 22, 2024

I wrote a sample exporter with chatgpt (not very good but works)
https://github.com/jmhands/nvme_exporter/tree/main
There are some 3rd party attempts but one coming from nvme-cli directly I could make some standard grafana dashboards. The use case I have in mind is graphing WAF over time for various workloads and monitoring the health of SSD fleet.

@igaw
Copy link
Collaborator

igaw commented Jan 23, 2024

Thanks for the python code, helps to understand what your really need as input for the integration.

I think we should first define a schema. Could you append here a complete JSON formatted one?

@jmhands
Copy link
Author

jmhands commented Jan 23, 2024

Sure, for the POC I just had pretty much everything that wasn't static be a gauge type but possible we want some of the items in the SMART log as a counter, such as errors or things that reset to zero after reboot. I think it would be fairly straightforward, use exactly the same names as nvme-cli and NVM Express use for NVMe SMART Log. For OCP log page, you can see the spec here https://www.opencompute.org/documents/datacenter-nvme-ssd-specification-v2-5-pdf section 4.8.6 for C0 log page. Most SSD vendors will now support this on data center SSD, and nvme-cli ocp plug-in already parses it just fine. For SSD that doesn't support it, just export the smart-log. Some of the other exporters tried getting SSD info from nvme list but better just to get the model, serial, and firmware from nvme id-ctrl or you could get from libnvme directly with the identify command.

@dobbi84
Copy link

dobbi84 commented Feb 6, 2024

There are several github projects out there based on the parsing of the nvme-cli command output and I have been using for some time the Node textfile exporter. I have recently updated it with parallel execution and included a Grafana dashboard. This has its own limitations.

I think that to avoid yet another exporter that is not aligned neither with the NVMe specifications nor with Prometheus:

  • NVMe CLI has a golang library for reporting the smart information
  • The NVMe Prometheus exporter is developed under the Node exporter project, for example added as nvme-smart collector.

To calculate things like WAF, that do not have directly an opcode, it could be possible to use Prometheus recording rules.

@jmhands
Copy link
Author

jmhands commented Feb 10, 2024

There are several github projects out there based on the parsing of the nvme-cli command output and I have been using for some time the Node textfile exporter. I have recently updated it with parallel execution and included a Grafana dashboard. This has its own limitations.

I think that to avoid yet another exporter that is not aligned neither with the NVMe specifications nor with Prometheus:

  • NVMe CLI has a golang library for reporting the smart information
  • The NVMe Prometheus exporter is developed under the Node exporter project, for example added as nvme-smart collector.

To calculate things like WAF, that do not have directly an opcode, it could be possible to use Prometheus recording rules.

Node exporter is great and that would be a good place to add. It would be nice to have the flexibility to enable more logs pages (like the OCP one I mentioned) for predictive failure and health monitoring for large deployments, but even the base NVMe smart-log in node exporter would be awesome. I found the smart exporter for smartmontools doesn't do SAS properly, for example, hence why I want to do right from nvme-cli instead of smartmontools.

@dswarbrick
Copy link

The NVMe Prometheus exporter is developed under the Node exporter project, for example added as nvme-smart collector.

Just to avoid any confusion, the existing nvme collector in node_exporter exposes information gleaned from sysfs, i.e., that which is exposed by the kernel.

It is against node_exporter policy to call external binaries. The textfile collector mechanism is exempt from this, since the apps which generate the textfile metrics are not called by node_exporter itself.

@dobbi84
Copy link

dobbi84 commented Feb 15, 2024

It is against node_exporter policy to call external binaries. The textfile collector mechanism is exempt from this, since the apps which generate the textfile metrics are not called by node_exporter itself.

I had this suspect. Probably the exporter should have some sort of alignment with the nvme-cli release, i wonder if it should be an exporter or a script for the textfile extension.

@igaw
Copy link
Collaborator

igaw commented Aug 1, 2024

I understand there is no need to add a new output command as the node exporter is not going to call any binaries anyway. So good to close? Or do I miss something?

@dobbi84
Copy link

dobbi84 commented Aug 1, 2024

I understand there is no need to add a new output command as the node exporter is not going to call any binaries anyway. So good to close? Or do I miss something?

I think that @jmhands had a wish list in his comment #2189 (comment)

@igaw
Copy link
Collaborator

igaw commented Aug 1, 2024

I am confused. I understood that the POC in the comment is parsing outputs from nvme list, nvme smart-log and nvme ocp smart-add-log and the wish was one command which will provide all the the infos needed for the logger in a single command. But the node exporter project doesn't want to depend on a binary. So there is 'no need' for a super smart log command.

Or is there still a general interest for getting such a command for non node exporter setups? If so, I'd say we should define the expected output. I don't know what is needed here, so I need this input first. I think it's possible to implement it without too much hustle (yeah, I know famous last words...).

@dobbi84
Copy link

dobbi84 commented Aug 1, 2024

I am confused. I understood that the POC in the comment is parsing outputs from nvme list, nvme smart-log and nvme ocp smart-add-log and the wish was one command which will provide all the the infos needed for the logger in a single command. But the node exporter project doesn't want to depend on a binary. So there is 'no need' for a super smart log command.

Or is there still a general interest for getting such a command for non node exporter setups? If so, I'd say we should define the expected output. I don't know what is needed here, so I need this input first. I think it's possible to implement it without too much hustle (yeah, I know famous last words...).

We can still parse the output through another exporter like the one created by @jmhands.

@igaw
Copy link
Collaborator

igaw commented Aug 2, 2024

After a bit of pondering, I would like to avoid mixing the different commands output. The id-ctrl issues the corresponding NVMe command and that is what is printed out. The same is true for smart-log. These are the low level commands from the spec and I'd like to keep it these this way.

If you want a single command which gives you all the necessary information in one go I am fine by introducing e a new command for this (plugin?). And this means we are back to my original question. Do we need this and what would the output look like.

@sbates130272
Copy link
Contributor

@jmhands @igaw where did we land on this? @jmhands are you still considering working up a PR for this? Or did you find an existing solution out there that works that we could maybe reference in documenation and close this PR?

@igaw
Copy link
Collaborator

igaw commented Nov 8, 2024

From my last comment, I didn't like the idea to change the existing low level commands as they are matching with the spec.

I don't mind to introduce a user friendly version which provides the summary. As I don't know what you expect, I'd like to hear what is the expected output? Can you come up with something?

@sbates130272
Copy link
Contributor

sbates130272 commented Nov 8, 2024

@igaw I did some digging and I don't think there is anything needed from nvme-cli. I think what we want to do is update the nvme textfile collector to issue the command to obtain the ocp log page and format that as needed. I see that on Ubuntu 22.04 the nvme collector is now on by default. I am not sure if the same is true for other modern distros.

Let me do some more digging. @jmhands do you have anything to add here? Note that the nvme textfile collector in node exporter does parse the smart-log log page and display its output for scraping.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants