[rocm6.3_internal_testing] [ROCm] [SWDEV-472265] Return correct AMDSMI socket_power metric #1458

jataylo · 2024-07-10T09:06:50Z

Copy of: #1457

Cherry pick of https://github.com/pytorch/pytorch/pull/130331/files

Extending on the change in pytorch#127729

Depending on gcnArch the API to return socket power will change. This will help us handle both cases for now until this is consolidated in the future.
MI200
Current socket power: N/A
Average socket power: 94

MI300
Current socket power: 717
Average socket power: N/A

Fixes https://ontrack-internal.amd.com/browse/SWDEV-472265

* changes to build Centos stream 9 images * Added scripts for centos and centos stream images * Added an extra line * Add ninja installation * Optimized code * Fixes * Add comment * Optimized code * Added AMDGPU mapping for ROCm 5.2 and invalid-url for rocm_baseurl Co-authored-by: Jithun Nair <[email protected]>

- Rocblas API support is requested - SWDEV-383635 & sub task - SWDEV-390218

* Add hip_basic tensorpipe support to PyTorch * Enabling hip_basic for Tensorpipe for pyTorch * removing upstream tensorpipe module * Adding ROCm specific tensopipe submodule * tensorpipe submodule updated * Update the hip invalid device string * Added ignore for tensorpipe git submodule * Moved include of tensorpipe_cuda.h to hipify * Updates based on review comments * Defining the variable __HIP_PLATFORM_AMD__ * Enabling the UTs Co-authored-by: Ronak Malik <[email protected]>

- Fortran package installation moved after gcc - Update libtinfo search code in cmake1 - Install libstdc++.so

To resolve https://ontrack-internal.amd.com/browse/SWDEV-403530 and https://ontrack-internal.amd.com/browse/SWDEV-419837. For more context check upstream issue pytorch#111834

Reversed the condition as required

- Add missing common_utils.sh - Update the install vision part - Move to amdgpu rhel 9.3 builds - Update to pick python from conda path - Add a missing package - Add ROCM_PATH and magma - Updated repo radeon path

This also fixes a problem in gesvd driver when UV is not needed.

- build_environment is hard coded to value from upstream when branch for created, since the dev/QA ENV build_environment value can be varing

* Fix the parsing of /etc/os-release The old code parses OS_DISTRO as 'PRETTY_Ubuntu' on Ubuntu and thus never links to libtinfo correctly. * Configurable CMAKE_PREFIX_PATH in CI script.

- This is done as per QA request, needs to be reverted and not required to be cherry-picked into later releases.

* Moved NAVI check to the test file * Revised NAVI check as a function

* Running triton kernel on ROCM only has one GB/s metric reported * Update test_kernel_benchmark.py

* Initial implementation of PyTorch ut parsing script * Extracted path variables * Use nested dict to save results * Fixes typo * Cleanup * Fixes several issues * Minor name change * Update run_pytorch_unit_tests.py * Added file banners * Supported running from API * Added more help info * Consistent naming * Format help text --------- Co-authored-by: Jithun Nair <[email protected]> Co-authored-by: Jithun Nair <[email protected]>

- Causing regression - SWDEV-463083

* Fix SWDEV-459623. The Rank of logsumexp Tensor must be 3. This tensor was considered for internal use only but apparently exposed to UTs. * Fix for mGPU. The stream should be selected after picking the current device according to input tensor.

* Add formal FP8 check in common_cuda.py * Enable inductor/test_valid_cast * Support for test_eager_fallback * allow fnuz types on amax test * Finalize passing tests vs failing * Fix fnuz constants in _to_fp8_saturated

* Enable batchnorm NHWC for MIOpen * cleanup * test to compare NHWC MIOpen batchnorm with CPU * fix 'use_miopen' condition for nhwc miopen * fix includes * use native nhwc batchnorm to verify miopen * remove extra spaces * remove empty lines * set PYTORCH_MIOPEN_SUGGEST_NHWC=1 for all test_nn.py test

…1433) * Print consolidated log file for pytorch uts * Update run_entire_tests subprocess call as well * lint * Add ERROR string

* Initial commit to port intra_node_comm to ROCm (cherry picked from commit 48d1c33) * gpt-fast running now with intra-node comm (cherry picked from commit 618c54e) --------- Co-authored-by: Prachi Gupta <[email protected]>

…1434)

Co-authored-by: Jithun Nair <[email protected]>

IFU for rocm6.3_internal_testing

jithunnair-amd · 2024-07-10T15:17:54Z

For rocm6.3_internal_testing, I'd rather we first resolve the discussion with amdsmi team to see if they can handle the situation on their end so we can always refer to average_socket_power in PyTorch side, since that's what the PyTorch api definition is looking for

rocm-mici · 2024-10-04T18:07:08Z

Jenkins build for e747d821f5a981520afc61b9ecd9ba82c35d2fe1 commit finished as FAILURE
Links: Blue Ocean view / Build artifacts

rocm-mici · 2024-10-04T18:14:13Z

Jenkins build for e747d821f5a981520afc61b9ecd9ba82c35d2fe1 commit finished as FAILURE
Links: Blue Ocean view / Build artifacts

rocm-repo-management-api · 2024-12-11T22:21:51Z

Jenkins build for e747d821f5a981520afc61b9ecd9ba82c35d2fe1 commit finished as FAILURE
Links: Blue Ocean view / Build artifacts

rraminen and others added 30 commits June 17, 2024 13:07

Updated to latest conda for CentOS stream 9

b8a2811

Temporarily skip test_conv3d_64bit_indexing

59e9341

- Rocblas API support is requested - SWDEV-383635 & sub task - SWDEV-390218

Updates to build on Jammy

0b08278

- Fortran package installation moved after gcc - Update libtinfo search code in cmake1 - Install libstdc++.so

Fix lstsq related regressions (part of SWDEV-392820)

6e7704d

[UB22.04] Updates to support latest scipy

108bf57

Build required version of libpng for CentOS7

15da21a

Update tensorpipe submodule to support ROCm 6.0

4003496

Set ROCM_PATH in env for centOS docker container

2cfad86

Updated condition for libstc++ for Jammy

3c19bf9

Skip ddp apply_optim_in_bwd tests for gloo (#1302)

b7e47fa

To resolve https://ontrack-internal.amd.com/browse/SWDEV-403530 and https://ontrack-internal.amd.com/browse/SWDEV-419837. For more context check upstream issue pytorch#111834

Changes to support docker v23

032320c

Reversed the condition as required

[CS9] Updates to CentOS stream 9 build (#1326)

50d56db

- Add missing common_utils.sh - Update the install vision part - Move to amdgpu rhel 9.3 builds - Update to pick python from conda path - Add a missing package - Add ROCM_PATH and magma - Updated repo radeon path

Update to hipify mapping

17ba54f

Correcting usage of USE_ROCM

e00045a

Enable gesvda for ROCM >= 6.1 (#1339)

7f3172f

This also fixes a problem in gesvd driver when UV is not needed.

Increase lifespan of test-times files

a2d6ace

- build_environment is hard coded to value from upstream when branch for created, since the dev/QA ENV build_environment value can be varing

Fixes CI build script (#1350)

00307cc

* Fix the parsing of /etc/os-release The old code parses OS_DISTRO as 'PRETTY_Ubuntu' on Ubuntu and thus never links to libtinfo correctly. * Configurable CMAKE_PREFIX_PATH in CI script.

[NO CP] Temporary dumping of test exec log to stderr

3120778

- This is done as per QA request, needs to be reverted and not required to be cherry-picked into later releases.

Add skipIfRocmArch decorator for Navi skips (#1356)

9726c26

Converted NAVI check as a function (#1364)

91125f1

* Moved NAVI check to the test file * Revised NAVI check as a function

Triton build conditionalized on ROCM_VERSION

623579f

Remove ROCmloops specific test

b39d5fa

Bad import in test_torchinductor and skip torchvision related UT (#1374)

6d3494e

skip test_inductor_freezing failing UTs (#1375)

f02e87f

Skip test_mm_triton_kernel_benchmark (#1376)

c1f1f51

* Running triton kernel on ROCM only has one GB/s metric reported * Update test_kernel_benchmark.py

[HIP] Returned error string update

6f65d22

PR #1255 to rocm6.2 release

98df198

pruthvistony and others added 20 commits June 20, 2024 15:58

Remove the installation of rocm-llvm-dev package

8f95824

- Causing regression - SWDEV-463083

Fix SWDEV-459623 (#1428)

5f9b3f4

* Fix SWDEV-459623. The Rank of logsumexp Tensor must be 3. This tensor was considered for internal use only but apparently exposed to UTs. * Fix for mGPU. The stream should be selected after picking the current device according to input tensor.

Enable fp8 inductor unit tests (#1421)

90df487

* Add formal FP8 check in common_cuda.py * Enable inductor/test_valid_cast * Support for test_eager_fallback * allow fnuz types on amax test * Finalize passing tests vs failing * Fix fnuz constants in _to_fp8_saturated

[HIP] Few more updates to the returned error string

a390471

skipIfRocm needs msg parameter

6be1d5d

[NO CP] Updated changes to skip few UTs

31b3681

Add new kernel config for AMD GPUs

cefda3a

Update gesvda USE_ROCM guards

8068d3d

Print consolidated log file for pytorch unit test automation scripts (#…

5187ca9

…1433) * Print consolidated log file for pytorch uts * Update run_entire_tests subprocess call as well * lint * Add ERROR string

Scale XBLOCK in triton reduction configs to avoid hitting max grid (#…

012c13b

…1434)

rocm6.3 related_commits

6e45ab1

caching test_times

3aa060d

Sync updates from hipify_torch. (#1168)

0c5d257

Co-authored-by: Jithun Nair <[email protected]>

fix install_centos() function

ecf4e8d

Merge pull request #1436 from ROCm/IFU_CP_06172024

8f19207

IFU for rocm6.3_internal_testing

Update apex commit to pick up wheel-related changes (#1443)

5de711c

increase tensor size to force out of memory exception on MI300X (#1449)

4459b67

Return correct AMDSMI socket_power metric

e747d82

jataylo requested review from pruthvistony and jithunnair-amd July 10, 2024 09:06

jithunnair-amd changed the title ~~[ROCm] [SWDEV-472265] Return correct AMDSMI socket_power metric~~ [rocm6.3_internal_testing] [ROCm] [SWDEV-472265] Return correct AMDSMI socket_power metric Jul 10, 2024

jithunnair-amd marked this pull request as draft July 10, 2024 15:16

pruthvistony force-pushed the rocm6.3_internal_testing branch 2 times, most recently from 9ae24a7 to 12b4a67 Compare August 12, 2024 05:22

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[rocm6.3_internal_testing] [ROCm] [SWDEV-472265] Return correct AMDSMI socket_power metric #1458

[rocm6.3_internal_testing] [ROCm] [SWDEV-472265] Return correct AMDSMI socket_power metric #1458

jataylo commented Jul 10, 2024

jithunnair-amd commented Jul 10, 2024

rocm-mici commented Oct 4, 2024

rocm-mici commented Oct 4, 2024

rocm-repo-management-api bot commented Dec 11, 2024 •

edited

Loading

[rocm6.3_internal_testing] [ROCm] [SWDEV-472265] Return correct AMDSMI socket_power metric #1458

Are you sure you want to change the base?

[rocm6.3_internal_testing] [ROCm] [SWDEV-472265] Return correct AMDSMI socket_power metric #1458

Conversation

jataylo commented Jul 10, 2024

jithunnair-amd commented Jul 10, 2024

rocm-mici commented Oct 4, 2024

rocm-mici commented Oct 4, 2024

rocm-repo-management-api bot commented Dec 11, 2024 • edited Loading

rocm-repo-management-api bot commented Dec 11, 2024 •

edited

Loading