Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support a choice between delta from the first call versus previous call in variorum_get_energy_json #575

Open
tpatki opened this issue Aug 3, 2024 · 6 comments
Assignees

Comments

@tpatki
Copy link
Member

tpatki commented Aug 3, 2024

Update the API to support nested calls in general, especially in Caliper-like tools. This might be useful for the Kokkos-update as well.

Merge #559 and #563 first, and then add a new flag to the API.

Current suggestion:
variorum_get_energy_json(char** s) will be updated to variorum_get_energy_json(char** s, bool prev_delta).
Setting prev_delta to true will return the accumulated energy since the previous call to the variorum_get_energy_json function from the application/tool's context. Setting this to false will return the accumulated energy since the first call to the variorum_get_energy_json.

  • @dbo: Creating this issue to track our discussion.
  • @masterleinad: Tagging you so you are aware of this upcoming change, I might need your help with testing on Intel GPUs as we don't have access to them at our end.
  • @tjeter @rountree Keeping you in the loop with this discussion as it is relevant to some of your research.

I will work on an initial WIP PR as soon as I can, hoping to get this merged in by end of August. Happy to take any feedback and suggestions on this.

@tpatki tpatki self-assigned this Aug 3, 2024
@rountree
Copy link
Collaborator

rountree commented Aug 3, 2024

@tpatki How does prev_delta=false handle counter rollover? Are we guaranteed to have a thread that's sampling in the background often enough to detect that?

@tpatki
Copy link
Member Author

tpatki commented Aug 4, 2024

Hi @rountree That's a great question, and it will vary by the underlying architecture. See details below.

(I am hoping these notes will also help @dbo understand the challenges at our end and why supporting this will take some time.)

  • On IBM systems, there are no MSRs/counters, so there is no issue of rollover. The hardware does not report energy directly, so we are already sampling instantaneous power using the OPAL file system interface every 250ms at the moment. On a related note, my recent experiments with Caliper tell me we should sample faster than that, IBM recommends 100ms and up. Will create an issue and update this sampling rate soon.

  • On GPUs (NVIDIA, AMD and Intel), energy values are reported directly from the underlying APIs (NVML, RSMI, APMI). See here as an example. These report energy values since the GPU driver was last loaded, typically. GPU vendors do not expose the counters directly to us, so we don't have to deal with overflows ourselves and can rely on their APIs. Currently, Variorum v0. 8 does not support GPU energy reporting -- mostly due to lack of time/resources at our end. We have some WIP PRs on this, see Add GPU Energy APIs #559 and Add Intel GPU Energy APIs #563.

  • On AMD CPUs (Milan and up), AMD provides ESMI library and the amd_energy kernel module (they also support msr-safe 👍 ). Here too, we do not interact with MSRs directly and rely on AMD's open source ESMI APIs for the processor, e.g. esmi_socket_energy_get, which are part of the port for Variorum that AMD contributed.

  • That leaves us with our most complex scenario on Intel CPUs. Here, we are directly using low-level MSR interfaces. We already have the infrastructure for MSR_PKG_ENERGY_STATUS and MSR_DRAM_ENERGY_STATUS in Variorum that we utilize to print_energy (and the same for JSON APIs). This is where I believe we will need to add a new thread, like we do with IBM, and sample often enough, say every 50ms. I could be wrong though, and we may be able to do this without adding explicit sampling in a separate thread. This support will be the greatest lift at our end but should be do-able.

  • I haven't looked at ARM as we haven't added support for print/get energy APIs there yet.

Let me know if I answered that in enough detail, happy to have a meeting next week to discuss.

@tpatki
Copy link
Member Author

tpatki commented Aug 13, 2024

@slabasan @rountree
Checking in if you had more questions or feedback here. If not, I can take a stab at a PR so we can test this out in Caliper (and also add better support for our Kokkos users).

@rountree
Copy link
Collaborator

@tpatki

  • IBM samples instantaneous power, no rollover.
  • GPUs we use vendor API to get energy, but this doesn't isolate us from rollovers. We can tell the user that we're just passing along whatever value the device gives us, but if we have to do better for Intel, we might as well make that a general solution.
  • AMD same thing.

So yes, I'd prefer to have the general case be sampling occasionally unless the vendor documentation we have makes rollover a once-per-decade thing. But I'm not implementing this, so it's just a preference.

@tpatki
Copy link
Member Author

tpatki commented Aug 14, 2024

Thanks @rountree.

Given that we have limited resources for Variorum at the moment, at least for the first cut at this, I am going to lean toward telling the user that we are passing along data that the vendor libraries are providing us (ESMI, RSMI, NVML, etc) and trusting that these vendor APIs take care of rollovers.

On some architectures (e.g. all GPUs), the low-level registers are not accessible at all, and we have no choice but to trust that APIs such as nvmlDeviceGetTotalEnergyConsumption will do the right thing -- which I believe they do. GPU APIs report energy values based on when the driver was last loaded -- see here, so they would have to take care of any rollovers (although we've not explicitly tested this).

Intel CPUs are the only exception to this situation, where we read directly from the MSR_PKG_ENERGY_STATUS or MSR_DRAM_PKG_STATUS in our code. These have 32-bits of energy data and are updated every millisecond, resulting in a wraparound every few minutes.

Looking at our port, I realized that we are already taking care of wraparounds for these registers in the Intel port when we calculate deltas, as we need to do this for reporting power on these systems too. Take a look here.

My understanding is that we will be reporting the correct values for energy with the current Intel port if we chose to do deltas (no sampling will be needed if I am understanding the code correctly, but I haven't refreshed my memory on this port enough yet). I will have to test this explicitly when I start working on this PR. I believe @slabasan has tested these wraparounds before, she may be able to comment as well.

TLDR: Let's try to get a first cut at this while trusting the vendor APIs (and our Intel port). Let's document this well and explain to the users this decision. And let's leave an issue open to test for rollovers on each architecture, so we can fix these if we run into them or if any users run into them.

@rountree
Copy link
Collaborator

@tpatki Sounds good.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants