-
-
Notifications
You must be signed in to change notification settings - Fork 456
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GPU monitoring (percentage and time) in Darwin/macOS #1606
base: main
Are you sure you want to change the base?
Conversation
darwin/DarwinMachine.c
Outdated
@@ -125,3 +125,17 @@ bool Machine_isCPUonline(const Machine* host, unsigned int id) { | |||
|
|||
return true; | |||
} | |||
|
|||
double Machine_getGpuUsage(const Machine* super, double values[]) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have to add the two APIs for all the platforms to make the code compile. It seems to me a good solution, because it works even if someone adds the &GPUMeter_class
reference in the Platform.c
for this machine, so it's quite robust.
The The display function should be independent of the platform; only retrieving the values for the meter ( |
I declared three of the (formerly) static variable in extern const MeterClass GPUMeter_class;
extern struct EngineData GPUMeter_engineData[4];
extern unsigned long long int totalGPUTimeDiff; Because now they must be shared between |
@BenBE Thank you for your explanation. I was trying to understand the meaning of the attributes structure by looking at other examples of meters. In some of them it was quite clear that these attributes refer to different kind of values displayed by the graph, e.g. this one static const int FileDescriptorMeter_attributes[] = {
FILE_DESCRIPTOR_USED,
FILE_DESCRIPTOR_MAX
}; In the case of the GPUMeter, I was mislead by the word "ENGINE", though |
Sorry, for zigzagging things here, but as noted in my reply to @Explorer09, a more natural solution is actually to put these with the The |
linux/Platform.c
Outdated
|
||
assert(ARRAYSIZE(GPUMeter_engineData) + 1 == ARRAYSIZE(GPUMeter_attributes)); | ||
|
||
totalGPUTimeDiff = saturatingSub(this->curGpuTime, this->prevGpuTime); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indent messed up. (There is a tab character here; please change it to spaces.)
linux/Platform.c
Outdated
@@ -146,6 +146,7 @@ static enum { BAT_PROC, BAT_SYS, BAT_ERR } Platform_Battery_method = BAT_PROC; | |||
static time_t Platform_Battery_cacheTime; | |||
static double Platform_Battery_cachePercent = NAN; | |||
static ACPresence Platform_Battery_cacheIsOnAC; | |||
static unsigned long long int prevResidueTime, curResidueTime; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When moving these two variables to platform code, I suggest renaming these to prevGPUResidueTime
and curGPUResidueTime
to reduce ambiguity.
By the way, curGPUResidueTime
may be converted into a local (function scope) variable, but that's a separate commit.
linux/Platform.c
Outdated
@@ -1104,3 +1105,33 @@ void Platform_done(void) { | |||
LibSensors_cleanup(); | |||
#endif | |||
} | |||
|
|||
|
|||
double Platform_getGpuUsage(Meter* meter) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
double Platform_getGpuUsage(Meter* meter) { | |
double Platform_setGPUValues(Meter* meter) { |
While I admit it's strange to use the "set" word on the function name like this, this naming style matches other functions' naming in Platform.c.
In a more ideal codebase we would have named it Platform_updateGPUValues
, but please stick with the current style for now (unless there is a PR to rename all of these functions together).
Also note the all-caps on the word GPU
.
linux/Platform.c
Outdated
prevResidueTime = curResidueTime; | ||
curResidueTime = this->curGpuTime; | ||
|
||
for (gpuEngineData = this->gpuEngineData, i = 0; gpuEngineData && i < ARRAYSIZE(GPUMeter_engineData); gpuEngineData = gpuEngineData->next, i++) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for (gpuEngineData = this->gpuEngineData, i = 0; gpuEngineData && i < ARRAYSIZE(GPUMeter_engineData); gpuEngineData = gpuEngineData->next, i++) { | |
for (gpuEngineData = lhost->gpuEngineData, i = 0; gpuEngineData && i < ARRAYSIZE(GPUMeter_engineData); gpuEngineData = gpuEngineData->next, i++) { |
Avoid the name this
for non-methods.
linux/Platform.c
Outdated
const Machine* super = meter->host; | ||
const LinuxMachine* this = (const LinuxMachine*) super; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
const Machine* super = meter->host; | |
const LinuxMachine* this = (const LinuxMachine*) super; | |
const Machine* host = meter->host; | |
const LinuxMachine* lhost = (const LinuxMachine*) host; |
Avoid the names this
and super
for non-methods.
@BenBE Ok, I am putting the double Machine_setGPUValues(Machine* this, double *values, size_t values_size); That way, although we do not have the meter structure, we still have the machine and the (bounded) values array. |
This function should be dedicated to updating the |
I did some refactoring for the code structure (WIP), that might better illustrate, where things should go. Code can be found at https://github.com/BenBE/htop/tree/%231597 Basically:
The refactoring isn't done yet, but most parts should already be visible. |
@BenBE Thank you for sharing your refactoring. But at this point I have to say that probably I didn't get the purpose of this function in the first place. Because I put most of my efforts in trying to update the meters' values from within the Machine module (that's the main reason why I was passing the array here, because I couldn't pass the Meter* directly). It's not a problem for me, at all. It has been an opportunity to learn something about htop's metrics. In this regard I was thinking about: what if I start a new branch writing some dev documentation about what we have done here? It could be useful for others. Anyway, I'll build the macOS version on top of that. |
@@ -105,6 +108,11 @@ Machine* Machine_new(UsersTable* usersTable, uid_t userId) { | |||
openzfs_sysctl_init(&this->zfs); | |||
openzfs_sysctl_updateArcStats(&this->zfs); | |||
|
|||
this->GPUService = IOServiceGetMatchingService(kIOMainPortDefault, IOServiceMatching("IOGPU")); | |||
if(!this->GPUService) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A space
if(!this->GPUService) { | |
if (!this->GPUService) { |
darwin/DarwinMachine.c
Outdated
|
||
deviceUtil = CFDictionaryGetValue(perfStats, CFSTR("Device Utilization %")); | ||
if (!deviceUtil) { | ||
return; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see a CFRelease
call after this return
statement. Would this early return cause a resource leak? Please help us check it out.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, my bad. I just used the code of the standalone script I wrote to reply to this issue, in which I obviously did not care to deallocate any resource. I fixed it in any early exit.
I tested for memory leaks on both platforms (the old Macs with x86 and those with Apple Silicon). I report the output of the tool leaks for both below
Process: htop [1162]
Path: /Users/USER/Documents/*/htop
Load Address: 0x1022b4000
Identifier: htop
Version: 0
Code Type: ARM64
Platform: macOS
Parent Process: leaks [1161]
Date/Time: 2025-02-16 22:27:28.742 +0100
Launch Time: 2025-02-16 22:27:06.245 +0100
OS Version: macOS 15.1.1 (24B91)
Report Version: 7
Analysis Tool: /usr/bin/leaks
Physical footprint: 8818K
Physical footprint (peak): 8866K
Idle exit: untracked
----
leaks Report Version: 4.0, multi-line stacks
Process 1162: 1260 nodes malloced for 1209 KB
Process 1162: 0 leaks for 0 total leaked bytes.
Process: htop [93318]
Path: /Users/USER/Documents/*/htop
Load Address: 0x10ccdc000
Identifier: htop
Version: 0
Code Type: X86-64
Platform: macOS
Parent Process: leaks [93317]
Date/Time: 2025-02-16 10:58:04.530 pm +0100
Launch Time: 2025-02-16 10:57:45.224 pm +0100
OS Version: macOS 12.7.6 (21H1320)
Report Version: 7
Analysis Tool: /usr/bin/leaks
Physical footprint: 5168K
Physical footprint (peak): 5176K
----
leaks Report Version: 4.0, multi-line stacks
Process 93318: 742 nodes malloced for 920 KB
Process 93318: 0 leaks for 0 total leaked bytes.
darwin/DarwinMachine.c
Outdated
CFDictionaryRef perfStats; | ||
CFNumberRef deviceUtil; | ||
int device = 0, ret; | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the total GPU time is not retrievable from this machine (I mean macOS), please set it to a dummy value that we can tell it's N/A.
I suggest (unsigned long long)-1
as a placeholder here.
totalGPUTimeDiff = (unsigned long long)-1;
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added a static function that initialize the two GPU fields in the Machine structure to -1, that way they are always set even in case of early exit.
static void DarwinMachine_setDeafultGPUValues(Machine* super) {
super->totalGPUUsage = (double)-1;
super->totalGPUTimeDiff = (unsigned long long int)-1;
}
This function is called only here, but I thought it would make the code more readable by clarifying the fact that -1 is the default value of these two parameters, and not just a magic number
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be initialized as part of DarwinMachine_new
. Or given these are part of the Machine
strcture as part of Machine_new
instead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Please inline the
DarwinMachine_setDeafultGPUValues
call intoMachine_new
. Don't make a separate function for that because the function is useless and doesn't align with the code styles of other objects in htop. - For floating point value, we tend to use
NAN
rather than-1
for "unknown" or "not available" value. Even though-1
could work currently, I suggest changing it toNAN
for consistency.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Explorer09 Ok, I'll initialize them directly in Machine_new. For the NAN part, should I change the Machine_updateGpuUsage
function of UsupportedMachine to make it return NAN as well?
double Machine_updateGpuUsage(Machine* super ATTR_UNUSED) {
return NAN;
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, I'll initialize them directly in Machine_new. For the NAN part, should I change the
Machine_updateGpuUsage
function of UsupportedMachine to make it return NAN as well?
When Machine_new
has initialized it to NAN, there should be no updates to the value for platforms that don't support GPU monitoring.
darwin/DarwinMachine.c
Outdated
@@ -79,7 +79,7 @@ static void DarwinMachine_getVMStats(DarwinMachine* this) { | |||
|
|||
static void DarwinMachine_setDeafultGPUValues(Machine* super) { | |||
super->totalGPUUsage = (double)-1; | |||
super->totalGPUTimeDiff = (unsigned long long int)-1; | |||
super->totalGPUTimeDiff = (unsigned long long)-1; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why remove the int
from unsigned long long int
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just to make it shorter. Should I put the int back again?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@aestriplex The point is unnecessary line changes that complicated our code review. The commit history is messy currently. You should squash commits so that one commit does one logical change, which can ease other people reviewing your code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@aestriplex Clarification. There should be two commits for this PR:
- Refactor linux/GPUMeter into platform independent GPUMeter.
- Add GPUMeter code for Darwin.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please rebase your commits to squash the code fixes into your commits. The final history in your PR should contain the necessary changes, not every single step you took. Instead, individual commits should detail, what set of changes you'd need to make (step by step) to achieve your implementation. Thus three commits (refactoring the GPU meter to be platform-independent, extra changes for Linux, additions for Darwin) should mostly cover things quite nicely.
@BenBE @Explorer09 Now There are just two commit in this branch:
Sorry I didn't do this sooner, but I was afraid that by squashing commits and force pushing, the comment references would be lost. This has happened to me in the past using other systems (e.g. BitBucket). |
@BenBE I notice that in Machine_scan of LinuxMachine, I am passing |
I squashed it in |
@@ -22,6 +22,9 @@ Machine* Machine_new(UsersTable* usersTable, uid_t userId) { | |||
super->existingCPUs = 1; | |||
super->activeCPUs = 1; | |||
|
|||
super->totalGPUUsage = NAN; | |||
super->totalGPUTimeDiff = (unsigned long long)-1; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added those two field here. I avoided to put them in Machine_scan too, because they do not change over time
When you push commits into GitHub, GitHub's database would keep permanent copies of those commits even when they're never referenced again in later branches. GitHub can even detect whether the commit being discussed is "outdated" and indicate as such in the UI. Therefore, no need to worry about references. |
darwin/DarwinMachine.c
Outdated
goto cleanup; | ||
} | ||
|
||
int device = NAN; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Integer cannot hold a NaN value. Does this not produce a compiler warning?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was clearly a mistake, as I misinterpreted the type of device var. Surprisingly enough, the compiler (clang) on macOS does not produce any warning for that. I tried to initialise another int variable to NAN an compiling it with gcc (running on Ubuntu) and I get this warning:
GPUMeter.c:97:18: warning: overflow in conversion from 'float' to 'int' changes value from '+QNaNf' to '0' [-Woverflow]
97 | int written = NAN;
This behaviour of clang is quite curious. I also tried to activate the option -Woverflow
, but nothing happened
This PR follows from the discussion in #1597. As said there, I split my changes in different commit, starting from the refactoring of
linux/GPUMeter.c
andlinux/GPUMeter.h
.@Explorer09 @BenBE I open it as a draft because of the GPUMeter refactoring thing. The code compiles and runs on linux and macOS (both x86 and Silicon), but I am going to express some perplexities (as inline comments) for what concerns the unified
GPUMeter.c
in the root directory.I should have pushed this one on Friday, but I've had a quite busy weekend, sorry. Anyway, this week I should be more available.