Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GPU monitoring (percentage and time) in Darwin/macOS #1606

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

aestriplex
Copy link
Contributor

This PR follows from the discussion in #1597. As said there, I split my changes in different commit, starting from the refactoring of linux/GPUMeter.c and linux/GPUMeter.h.
@Explorer09 @BenBE I open it as a draft because of the GPUMeter refactoring thing. The code compiles and runs on linux and macOS (both x86 and Silicon), but I am going to express some perplexities (as inline comments) for what concerns the unified GPUMeter.c in the root directory.

I should have pushed this one on Friday, but I've had a quite busy weekend, sorry. Anyway, this week I should be more available.

@@ -125,3 +125,17 @@ bool Machine_isCPUonline(const Machine* host, unsigned int id) {

return true;
}

double Machine_getGpuUsage(const Machine* super, double values[]) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have to add the two APIs for all the platforms to make the code compile. It seems to me a good solution, because it works even if someone adds the &GPUMeter_class reference in the Platform.c for this machine, so it's quite robust.

@BenBE
Copy link
Member

BenBE commented Feb 10, 2025

The GPUMeter_attributes is the list of attributes that a certain meter wants to display and is linked to the colors to use for these values. This can/should be platform-independent and will contain only 1 value usually (the usage). You could have more attributes, if there are distinct usage patterns that can be distinguished, like low priority, user space load, kernel space load, soft/hard IRQs, … are for the normal CPU meter.

The display function should be independent of the platform; only retrieving the values for the meter (*_updateValues) should call into platform-specific code as needed.

@aestriplex
Copy link
Contributor Author

I declared three of the (formerly) static variable in GPUMetric.c as extern in GPUMetric.h

extern const MeterClass GPUMeter_class;
extern struct EngineData GPUMeter_engineData[4];
extern unsigned long long int totalGPUTimeDiff;

Because now they must be shared between /GPUMetric.c and <platform>/Platform.c.

@aestriplex
Copy link
Contributor Author

@BenBE Thank you for your explanation. I was trying to understand the meaning of the attributes structure by looking at other examples of meters. In some of them it was quite clear that these attributes refer to different kind of values displayed by the graph, e.g. this one

static const int FileDescriptorMeter_attributes[] = {
   FILE_DESCRIPTOR_USED,
   FILE_DESCRIPTOR_MAX
};

In the case of the GPUMeter, I was mislead by the word "ENGINE", though

@BenBE
Copy link
Member

BenBE commented Feb 10, 2025

I declared three of the (formerly) static variable in GPUMetric.c as extern in GPUMetric.h

extern const MeterClass GPUMeter_class;
extern struct EngineData GPUMeter_engineData[4];
extern unsigned long long int totalGPUTimeDiff;

Because now they must be shared between /GPUMetric.c and <platform>/Platform.c.

Sorry, for zigzagging things here, but as noted in my reply to @Explorer09, a more natural solution is actually to put these with the <Platform>Machine structure, similar to how information for CPUs are handled.

The GPUMeter_class should go platform-independent in GPUMeter.c, The GPUMeter_engineData[4] should go into the <Platform>Machine structure, and totalGPUTimeDiff would go to the Machine structure. Have a look how similar fields for CPU information are placed inside these structures for comparison. If you are unsure, feel free to ask (if you would like to avoid refactoring things over and over again).

linux/Platform.c Outdated

assert(ARRAYSIZE(GPUMeter_engineData) + 1 == ARRAYSIZE(GPUMeter_attributes));

totalGPUTimeDiff = saturatingSub(this->curGpuTime, this->prevGpuTime);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indent messed up. (There is a tab character here; please change it to spaces.)

linux/Platform.c Outdated
@@ -146,6 +146,7 @@ static enum { BAT_PROC, BAT_SYS, BAT_ERR } Platform_Battery_method = BAT_PROC;
static time_t Platform_Battery_cacheTime;
static double Platform_Battery_cachePercent = NAN;
static ACPresence Platform_Battery_cacheIsOnAC;
static unsigned long long int prevResidueTime, curResidueTime;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When moving these two variables to platform code, I suggest renaming these to prevGPUResidueTime and curGPUResidueTime to reduce ambiguity.

By the way, curGPUResidueTime may be converted into a local (function scope) variable, but that's a separate commit.

linux/Platform.c Outdated
@@ -1104,3 +1105,33 @@ void Platform_done(void) {
LibSensors_cleanup();
#endif
}


double Platform_getGpuUsage(Meter* meter) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
double Platform_getGpuUsage(Meter* meter) {
double Platform_setGPUValues(Meter* meter) {

While I admit it's strange to use the "set" word on the function name like this, this naming style matches other functions' naming in Platform.c.

In a more ideal codebase we would have named it Platform_updateGPUValues, but please stick with the current style for now (unless there is a PR to rename all of these functions together).

Also note the all-caps on the word GPU.

linux/Platform.c Outdated
prevResidueTime = curResidueTime;
curResidueTime = this->curGpuTime;

for (gpuEngineData = this->gpuEngineData, i = 0; gpuEngineData && i < ARRAYSIZE(GPUMeter_engineData); gpuEngineData = gpuEngineData->next, i++) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
for (gpuEngineData = this->gpuEngineData, i = 0; gpuEngineData && i < ARRAYSIZE(GPUMeter_engineData); gpuEngineData = gpuEngineData->next, i++) {
for (gpuEngineData = lhost->gpuEngineData, i = 0; gpuEngineData && i < ARRAYSIZE(GPUMeter_engineData); gpuEngineData = gpuEngineData->next, i++) {

Avoid the name this for non-methods.

linux/Platform.c Outdated
Comment on lines 1113 to 1114
const Machine* super = meter->host;
const LinuxMachine* this = (const LinuxMachine*) super;
Copy link
Contributor

@Explorer09 Explorer09 Feb 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
const Machine* super = meter->host;
const LinuxMachine* this = (const LinuxMachine*) super;
const Machine* host = meter->host;
const LinuxMachine* lhost = (const LinuxMachine*) host;

Avoid the names this and super for non-methods.

@aestriplex
Copy link
Contributor Author

The GPUMeter_class should go platform-independent in GPUMeter.c, The GPUMeter_engineData[4] should go into the <Platform>Machine structure, and totalGPUTimeDiff would go to the Machine structure. Have a look how similar fields for CPU information are placed inside these structures for comparison. If you are unsure, feel free to ask (if you would like to avoid refactoring things over and over again).

@BenBE Ok, I am putting the Machine_setGPUValues (with the name modified as indicate here) in <Platform>Machine. I cannot use a pointer to a Meter structure in the signature though, because I have to declare the prototype in Machine.h, and the Meter structure depends on the Machine structure. So I am thinking of this

double Machine_setGPUValues(Machine* this, double *values, size_t values_size);

That way, although we do not have the meter structure, we still have the machine and the (bounded) values array.

@Explorer09
Copy link
Contributor

The GPUMeter_class should go platform-independent in GPUMeter.c, The GPUMeter_engineData[4] should go into the <Platform>Machine structure, and totalGPUTimeDiff would go to the Machine structure. Have a look how similar fields for CPU information are placed inside these structures for comparison. If you are unsure, feel free to ask (if you would like to avoid refactoring things over and over again).

@BenBE Ok, I am putting the Machine_setGPUValues (with the name modified as indicate here) in <Platform>Machine. I cannot use a pointer to a Meter structure in the signature though, because I have to declare the prototype in Machine.h, and the Meter structure depends on the Machine structure. So I am thinking of this

double Machine_setGPUValues(Machine* this, double *values, size_t values_size);

That way, although we do not have the meter structure, we still have the machine and the (bounded) values array.

This function should be dedicated to updating the Meter instance. Without a Meter instance, this function would be useless. The direct output of double *values makes no sense.

@BenBE
Copy link
Member

BenBE commented Feb 12, 2025

I did some refactoring for the code structure (WIP), that might better illustrate, where things should go.

Code can be found at https://github.com/BenBE/htop/tree/%231597

Basically:

  • GPU meter code in GPUMeter.c
  • GPU metric fetching in <Platform>Machine.c
  • GPU display code in GPUMeter.c
  • Platform-independent statistics (e.g. total GPU time) to Machine.h
  • Platform-dependent statistics (e.g. EngineData) to <Platform>Machine.h

The refactoring isn't done yet, but most parts should already be visible.

@aestriplex
Copy link
Contributor Author

@BenBE Thank you for sharing your refactoring. But at this point I have to say that probably I didn't get the purpose of this function in the first place. Because I put most of my efforts in trying to update the meters' values from within the Machine module (that's the main reason why I was passing the array here, because I couldn't pass the Meter* directly). It's not a problem for me, at all. It has been an opportunity to learn something about htop's metrics. In this regard I was thinking about: what if I start a new branch writing some dev documentation about what we have done here? It could be useful for others.

Anyway, I'll build the macOS version on top of that.

@aestriplex aestriplex marked this pull request as ready for review February 15, 2025 02:04
@@ -105,6 +108,11 @@ Machine* Machine_new(UsersTable* usersTable, uid_t userId) {
openzfs_sysctl_init(&this->zfs);
openzfs_sysctl_updateArcStats(&this->zfs);

this->GPUService = IOServiceGetMatchingService(kIOMainPortDefault, IOServiceMatching("IOGPU"));
if(!this->GPUService) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A space

Suggested change
if(!this->GPUService) {
if (!this->GPUService) {


deviceUtil = CFDictionaryGetValue(perfStats, CFSTR("Device Utilization %"));
if (!deviceUtil) {
return;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see a CFRelease call after this return statement. Would this early return cause a resource leak? Please help us check it out.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, my bad. I just used the code of the standalone script I wrote to reply to this issue, in which I obviously did not care to deallocate any resource. I fixed it in any early exit.
I tested for memory leaks on both platforms (the old Macs with x86 and those with Apple Silicon). I report the output of the tool leaks for both below

Process:         htop [1162]
Path:            /Users/USER/Documents/*/htop
Load Address:    0x1022b4000
Identifier:      htop
Version:         0
Code Type:       ARM64
Platform:        macOS
Parent Process:  leaks [1161]

Date/Time:       2025-02-16 22:27:28.742 +0100
Launch Time:     2025-02-16 22:27:06.245 +0100
OS Version:      macOS 15.1.1 (24B91)
Report Version:  7
Analysis Tool:   /usr/bin/leaks

Physical footprint:         8818K
Physical footprint (peak):  8866K
Idle exit:                  untracked
----

leaks Report Version: 4.0, multi-line stacks
Process 1162: 1260 nodes malloced for 1209 KB
Process 1162: 0 leaks for 0 total leaked bytes.
Process:         htop [93318]
Path:            /Users/USER/Documents/*/htop
Load Address:    0x10ccdc000
Identifier:      htop
Version:         0
Code Type:       X86-64
Platform:        macOS
Parent Process:  leaks [93317]

Date/Time:       2025-02-16 10:58:04.530 pm +0100
Launch Time:     2025-02-16 10:57:45.224 pm +0100
OS Version:      macOS 12.7.6 (21H1320)
Report Version:  7
Analysis Tool:   /usr/bin/leaks

Physical footprint:         5168K
Physical footprint (peak):  5176K
----

leaks Report Version: 4.0, multi-line stacks
Process 93318: 742 nodes malloced for 920 KB
Process 93318: 0 leaks for 0 total leaked bytes.

CFDictionaryRef perfStats;
CFNumberRef deviceUtil;
int device = 0, ret;

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the total GPU time is not retrievable from this machine (I mean macOS), please set it to a dummy value that we can tell it's N/A.
I suggest (unsigned long long)-1 as a placeholder here.

   totalGPUTimeDiff = (unsigned long long)-1;

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a static function that initialize the two GPU fields in the Machine structure to -1, that way they are always set even in case of early exit.

static void DarwinMachine_setDeafultGPUValues(Machine* super) {
   super->totalGPUUsage = (double)-1;
   super->totalGPUTimeDiff = (unsigned long long int)-1;
}

This function is called only here, but I thought it would make the code more readable by clarifying the fact that -1 is the default value of these two parameters, and not just a magic number

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be initialized as part of DarwinMachine_new. Or given these are part of the Machine strcture as part of Machine_new instead.

Copy link
Contributor

@Explorer09 Explorer09 Feb 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Please inline the DarwinMachine_setDeafultGPUValues call into Machine_new. Don't make a separate function for that because the function is useless and doesn't align with the code styles of other objects in htop.
  2. For floating point value, we tend to use NAN rather than -1 for "unknown" or "not available" value. Even though -1 could work currently, I suggest changing it to NAN for consistency.

Copy link
Contributor Author

@aestriplex aestriplex Feb 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Explorer09 Ok, I'll initialize them directly in Machine_new. For the NAN part, should I change the Machine_updateGpuUsage function of UsupportedMachine to make it return NAN as well?

double Machine_updateGpuUsage(Machine* super ATTR_UNUSED) {
   return NAN;
}

Copy link
Contributor

@Explorer09 Explorer09 Feb 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I'll initialize them directly in Machine_new. For the NAN part, should I change the Machine_updateGpuUsage function of UsupportedMachine to make it return NAN as well?

When Machine_new has initialized it to NAN, there should be no updates to the value for platforms that don't support GPU monitoring.

@@ -79,7 +79,7 @@ static void DarwinMachine_getVMStats(DarwinMachine* this) {

static void DarwinMachine_setDeafultGPUValues(Machine* super) {
super->totalGPUUsage = (double)-1;
super->totalGPUTimeDiff = (unsigned long long int)-1;
super->totalGPUTimeDiff = (unsigned long long)-1;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why remove the int from unsigned long long int?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to make it shorter. Should I put the int back again?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@aestriplex The point is unnecessary line changes that complicated our code review. The commit history is messy currently. You should squash commits so that one commit does one logical change, which can ease other people reviewing your code.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@aestriplex Clarification. There should be two commits for this PR:

  1. Refactor linux/GPUMeter into platform independent GPUMeter.
  2. Add GPUMeter code for Darwin.

Copy link
Member

@BenBE BenBE left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please rebase your commits to squash the code fixes into your commits. The final history in your PR should contain the necessary changes, not every single step you took. Instead, individual commits should detail, what set of changes you'd need to make (step by step) to achieve your implementation. Thus three commits (refactoring the GPU meter to be platform-independent, extra changes for Linux, additions for Darwin) should mostly cover things quite nicely.

@aestriplex
Copy link
Contributor Author

@BenBE @Explorer09 Now There are just two commit in this branch:

  1. Refactor linux/GPUMeter into platform independent GPUMeter.
  2. Add GPUMeter code for Darwin.

Sorry I didn't do this sooner, but I was afraid that by squashing commits and force pushing, the comment references would be lost. This has happened to me in the past using other systems (e.g. BitBucket).

@aestriplex
Copy link
Contributor Author

@BenBE I notice that in Machine_scan of LinuxMachine, I am passing Machine_scanGPUUsage(this); instead of Machine_scanGPUUsage(super);, I fix it and squash it in Refactor linux/GPUMeter into platform independent GPUMeter

@aestriplex
Copy link
Contributor Author

I squashed it in Add GPUMeter code for Darwin, because the call in Machine_scan was made there

@@ -22,6 +22,9 @@ Machine* Machine_new(UsersTable* usersTable, uid_t userId) {
super->existingCPUs = 1;
super->activeCPUs = 1;

super->totalGPUUsage = NAN;
super->totalGPUTimeDiff = (unsigned long long)-1;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added those two field here. I avoided to put them in Machine_scan too, because they do not change over time

@Explorer09
Copy link
Contributor

Now There are just two commit in this branch:

  1. Refactor linux/GPUMeter into platform independent GPUMeter.
  2. Add GPUMeter code for Darwin.

Sorry I didn't do this sooner, but I was afraid that by squashing commits and force pushing, the comment references would be lost. This has happened to me in the past using other systems (e.g. BitBucket).

When you push commits into GitHub, GitHub's database would keep permanent copies of those commits even when they're never referenced again in later branches.

GitHub can even detect whether the commit being discussed is "outdated" and indicate as such in the UI. Therefore, no need to worry about references.

goto cleanup;
}

int device = NAN;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Integer cannot hold a NaN value. Does this not produce a compiler warning?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was clearly a mistake, as I misinterpreted the type of device var. Surprisingly enough, the compiler (clang) on macOS does not produce any warning for that. I tried to initialise another int variable to NAN an compiling it with gcc (running on Ubuntu) and I get this warning:

GPUMeter.c:97:18: warning: overflow in conversion from 'float' to 'int' changes value from '+QNaNf' to '0' [-Woverflow]
    97 |    int written = NAN;

This behaviour of clang is quite curious. I also tried to activate the option -Woverflow, but nothing happened

@BenBE BenBE added enhancement Extension or improvement to existing feature Linux 🐧 Linux related issues MacOS 🍏 MacOS / Darwin related issues labels Feb 24, 2025
@BenBE BenBE added this to the 3.4.1 milestone Feb 24, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Extension or improvement to existing feature Linux 🐧 Linux related issues MacOS 🍏 MacOS / Darwin related issues
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants