Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Issue]: Recent changes to 6.0.1-112 make it ABI incompatible to 6.0.0 #152

Open
bertwesarg opened this issue Jan 20, 2024 · 11 comments
Open

Comments

@bertwesarg
Copy link

bertwesarg commented Jan 20, 2024

Problem Description

6.0.1-112 includes this change to the rocm_smi.h header:

@@ -1091,6 +1099,19 @@
   uint16_t current_vclk0s[RSMI_MAX_NUM_CLKS];
   uint16_t current_dclk0s[RSMI_MAX_NUM_CLKS];
 
+  /*
+   * v1.5 additions
+   */
+  // JPEG activity percent (encode/decode)
+  uint16_t jpeg_activity[RSMI_MAX_NUM_JPEG_ENGS];
+
+  // PCIE NAK sent accumulated count
+  uint32_t pcie_nak_sent_count_acc;
+
+  // PCIE NAK received accumulated count
+  uint32_t pcie_nak_rcvd_count_acc;
+
+
   /// \endcond
 } rsmi_gpu_metrics_t;
 

but this change of the size of type rsmi_gpu_metrics_t breaks the ABI to 6.0.0. And currently both librocm_smi.so libs have the same interface version 6.

@oliveiradan
Copy link
Contributor

Problem Description

6.0.1-112 includes this change to the rocm_smi.h header:

@@ -1091,6 +1099,19 @@
   uint16_t current_vclk0s[RSMI_MAX_NUM_CLKS];
   uint16_t current_dclk0s[RSMI_MAX_NUM_CLKS];
 
+  /*
+   * v1.5 additions
+   */
+  // JPEG activity percent (encode/decode)
+  uint16_t jpeg_activity[RSMI_MAX_NUM_JPEG_ENGS];
+
+  // PCIE NAK sent accumulated count
+  uint32_t pcie_nak_sent_count_acc;
+
+  // PCIE NAK received accumulated count
+  uint32_t pcie_nak_rcvd_count_acc;
+
+
   /// \endcond
 } rsmi_gpu_metrics_t;
 

but this change of the size of type rsmi_gpu_metrics_t breaks the ABI to 6.0.0. And currently both librocm_smi.so libs have the same interface version 6.

Are you referring specifically to the fact the rsmi_gpu_metrics_t size changed? Or that the ABI versioning still needs to be updated to reflect the change? Like 6.0.1, for example?

@bertwesarg
Copy link
Author

Are you referring specifically to the fact the rsmi_gpu_metrics_t size changed?

Correct.

Like 6.0.1, for example?

This is a particular bad example in this case, as it does not tell the linker/runtime loader, that librocm_smi64.so.6.0.60001 is not compatible to librocm_smi64.so.6.0.60000. They both still have the same SONAME which is librocm_smi64.so.6, and binaries/libraries only record the SONAME in there dependencies (DT_NEEDED)

@oliveiradan
Copy link
Contributor

oliveiradan commented Jan 26, 2024

Are you referring specifically to the fact the rsmi_gpu_metrics_t size changed?

Correct.

Like 6.0.1, for example?

This is a particular bad example in this case, as it does not tell the linker/runtime loader, that librocm_smi64.so.6.0.60001 is not compatible to librocm_smi64.so.6.0.60000. They both still have the same SONAME which is librocm_smi64.so.6, and binaries/libraries only record the SONAME in there dependencies (DT_NEEDED)

That's a great point. We expect the rsmi_gpu_metric_t type size to change when new elements are added (which is happening more often due to requirements), but we need to change the SONAME to reflect the change and not break the existing ABI. As it follows the ROCm versioning, let me check on that so we can address it accordingly.

Question: Did it break your application? or are you just saying it should have a different version to identify the change?
Thank you for bringing that up!

@bertwesarg
Copy link
Author

I'm just reading the code. There are at least two possible ways I know to prevent changing the soname every time a struct changes its size:

  1. Use symbol versions for functions taking the struct as argument
  2. Let the caller pass the size of the struct to the functions

I think its too late for option 2., but 1. should still be able to add, but it wont avoid the increment in this bugfix

@ppanchad-amd
Copy link

@bertwesarg @oliveiradan Internal ticket has been created to fix this issue. Thanks!

@sohaibnd
Copy link

sohaibnd commented Feb 5, 2025

Hi @bertwesarg, sorry for the late follow up. Note that rocm-smi is going to be deprecated soon in favor of amd-smi.

Currently, amd-smi follows a versioning pattern of YEAR.MAJOR.MINOR.RELEASE where "Major version should be changed for every header change (adding/deleting APIs, changing names, fields of structures, etc.)" (see here). So the MAJOR version will be updated (but not necessarily the YEAR) in scenarios like the one you have described. As a workaround, at the start of your program, you can get the full version of libamd_smi.so loaded using amdsmi_get_lib_version and check the YEAR and MAJOR version against the version you compiled for.

Let me know if you have any follow-up questions.

@bertwesarg
Copy link
Author

You are using YEAR in the SONAME of libamd_smi.so, i.e., the app is linking against libamd_smi.so.24. MAJOR does not play a role anymore. If you make the YEAR bump this year, but there is no ABI incompatible change (i.e., changing or removing an interface) the app from last year will break without a reason, it should still work with libamd_smi.so.25. Checking at runtime the ABI version is too late.

@sohaibnd
Copy link

sohaibnd commented Feb 6, 2025

You are using YEAR in the SONAME of libamd_smi.so, i.e., the app is linking against libamd_smi.so.24. MAJOR does not play a role anymore. If you make the YEAR bump this year, but there is no ABI incompatible change (i.e., changing or removing an interface) the app from last year will break without a reason, it should still work with libamd_smi.so.25.

The YEAR is only updated alongside a MAJOR update (not for a MINOR update).

Checking at runtime the ABI version is too late.

Why is that? Unless you have multiple versions of ROCm installed it won't work anyways.

@sohaibnd
Copy link

sohaibnd commented Feb 11, 2025

Checking at runtime the ABI version is too late.

Why is that? Unless you have multiple versions of ROCm installed it won't work anyways.

@bertwesarg waiting on a response

@bertwesarg
Copy link
Author

You are using YEAR in the SONAME of libamd_smi.so, i.e., the app is linking against libamd_smi.so.24. MAJOR does not play a role anymore. If you make the YEAR bump this year, but there is no ABI incompatible change (i.e., changing or removing an interface) the app from last year will break without a reason, it should still work with libamd_smi.so.25.

The YEAR is only updated alongside a MAJOR update (not for a MINOR update).

But that was not mentioned in your first message: "So the MAJOR version will be updated (but not necessarily the YEAR)". if YEAR and MAJOR are interlinked why having both? So you could try use both YEAR and MAJOR in the SONAME. I saw such libraries already in the wild.

Checking at runtime the ABI version is too late.

Why is that? Unless you have multiple versions of ROCm installed it won't work anyways.

checking the ABI at runtime is not mandatory, you cannot force the users to do it and abort appropriately. And documentation does not help either. The SONAME will be obeyed by the loader, no checking from the user needed

you don't need multiple version installed at the same time, a version upgrade is all you need

@sohaibnd
Copy link

sohaibnd commented Feb 13, 2025

But that was not mentioned in your first message: "So the MAJOR version will be updated (but not necessarily the YEAR)". if YEAR and MAJOR are interlinked why having both?

Sorry if this wasn't clear. I meant to say that:

  1. MAJOR will always be updated if there is an API change
  2. YEAR is only updated alongside a MAJOR update and if the actual year has changed since the last MAJOR update

checking the ABI at runtime is not mandatory, you cannot force the users to do it and abort appropriately. And documentation does not help either. The SONAME will be obeyed by the loader, no checking from the user needed

you don't need multiple version installed at the same time, a version upgrade is all you need

Right, which is why I said that it is a workaround. If someone is using the AMD SMI C++ library as part of their own tool or something (note that there is already the cli tool for normal users) and they are worried about undefined behavior from a version upgrade that changes the library's API, they can check the version at runtime using amdsmi_get_lib_version and exit safely if the YEAR+MAJOR does not match. Even if the SONAME was properly updated as you say, it would just lead to the dynamic linker not being able to find the library when you run the program. Either way, the program needs to be recompiled with the newer version of the library to work.

So you could try use both YEAR and MAJOR in the SONAME. I saw such libraries already in the wild.

That is possible. Anyways amd smi may undergo a change in the versioning scheme matching what you're asking for, I will update this issue once there is more info.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants