rocm6.4 IFU CP 09122024 #1596

dnikolaev-amd · 2024-09-17T19:45:16Z

rocm6.4_internal_testing

* changes to build Centos stream 9 images * Added scripts for centos and centos stream images * Added an extra line * Add ninja installation * Optimized code * Fixes * Add comment * Optimized code * Added AMDGPU mapping for ROCm 5.2 and invalid-url for rocm_baseurl Co-authored-by: Jithun Nair <[email protected]>

- Rocblas API support is requested - SWDEV-383635 & sub task - SWDEV-390218

* Add hip_basic tensorpipe support to PyTorch * Enabling hip_basic for Tensorpipe for pyTorch * removing upstream tensorpipe module * Adding ROCm specific tensopipe submodule * tensorpipe submodule updated * Update the hip invalid device string * Added ignore for tensorpipe git submodule * Moved include of tensorpipe_cuda.h to hipify * Updates based on review comments * Defining the variable __HIP_PLATFORM_AMD__ * Enabling the UTs Co-authored-by: Ronak Malik <[email protected]>

- Fortran package installation moved after gcc - Update libtinfo search code in cmake1 - Install libstdc++.so

To resolve https://ontrack-internal.amd.com/browse/SWDEV-403530 and https://ontrack-internal.amd.com/browse/SWDEV-419837. For more context check upstream issue pytorch#111834

Reversed the condition as required

- Add missing common_utils.sh - Update the install vision part - Move to amdgpu rhel 9.3 builds - Update to pick python from conda path - Add a missing package - Add ROCM_PATH and magma - Updated repo radeon path

This also fixes a problem in gesvd driver when UV is not needed.

- build_environment is hard coded to value from upstream when branch for created, since the dev/QA ENV build_environment value can be varing

* Fix the parsing of /etc/os-release The old code parses OS_DISTRO as 'PRETTY_Ubuntu' on Ubuntu and thus never links to libtinfo correctly. * Configurable CMAKE_PREFIX_PATH in CI script.

- This is done as per QA request, needs to be reverted and not required to be cherry-picked into later releases.

* Moved NAVI check to the test file * Revised NAVI check as a function

* Running triton kernel on ROCM only has one GB/s metric reported * Update test_kernel_benchmark.py

(cherry picked from commit 9848db1)

* Initial implementation of PyTorch ut parsing script * Extracted path variables * Use nested dict to save results * Fixes typo * Cleanup * Fixes several issues * Minor name change * Update run_pytorch_unit_tests.py * Added file banners * Supported running from API * Added more help info * Consistent naming * Format help text --------- Co-authored-by: Jithun Nair <[email protected]> Co-authored-by: Jithun Nair <[email protected]>

…pired (#1399) * Skip certificate check only for CentOS7 since certificate expired * Naming

- PYTORCH_EXTRA_INSTALL_REQUIREMENTS is set in builder repo - Remove the PYTORCH_EXTRA_INSTALL_REQUIREMENTS step from this file

- Causing regression - SWDEV-463083

* Fix SWDEV-459623. The Rank of logsumexp Tensor must be 3. This tensor was considered for internal use only but apparently exposed to UTs. * Fix for mGPU. The stream should be selected after picking the current device according to input tensor.

* Add formal FP8 check in common_cuda.py * Enable inductor/test_valid_cast * Support for test_eager_fallback * allow fnuz types on amax test * Finalize passing tests vs failing * Fix fnuz constants in _to_fp8_saturated

* Enable batchnorm NHWC for MIOpen * cleanup * test to compare NHWC MIOpen batchnorm with CPU * fix 'use_miopen' condition for nhwc miopen * fix includes * use native nhwc batchnorm to verify miopen * remove extra spaces * remove empty lines * set PYTORCH_MIOPEN_SUGGEST_NHWC=1 for all test_nn.py test

…1433) * Print consolidated log file for pytorch uts * Update run_entire_tests subprocess call as well * lint * Add ERROR string

* Initial commit to port intra_node_comm to ROCm (cherry picked from commit 48d1c33) * gpt-fast running now with intra-node comm (cherry picked from commit 618c54e) --------- Co-authored-by: Prachi Gupta <[email protected]>

Co-authored-by: Jithun Nair <[email protected]>

* Check that >1 GPUs are visible when running TEST_CONFIG=distributed * Add EXECUTION_TIME to file-level and aggregate statistics

Fixes inductor.test_torchinductor_dynamic_shapes::TestInductorDynamicCUDA::test_item_unbacked_stride_nobreak_cuda

* Fail earlier for distributed-on-1-GPU scenario * print cmd in consolidated log with prettier formatting * python->python3 Fixes https://ontrack-internal.amd.com/browse/SWDEV-477264 --------- Co-authored-by: blorange-amd <[email protected]>

Relates to https://ontrack-internal.amd.com/browse/SWDEV-461590

… installstion (#1557) This PR pins sympy==1.12.1 in the .ci/docker/requirements-ci.txt file Also it skips pytorch-nightly installation in docker images Installation of pytorch-nightly is needed to prefetch mobilenet_v2 avd v3 models for some tests. Came from 85bd6bc Models are downloaded on first use to the folder /root/.cache/torch/hub But pytorch-nightly installation also overrides .ci/docker/requirements-ci.txt settings and upgrades some of python packages (sympy from 1.12.0 to 1.13.0) which causes several 'dynamic_shapes' tests to fail Skip prefetching models affects these tests without any errors (but **internet access required**): - python test/mobile/model_test/gen_test_model.py mobilenet_v2 - python test/quantization/eager/test_numeric_suite_eager.py -k test_mobilenet_v3 Issue ROCm/frameworks-internal#8772 Also, in case of some issues these models can be prefetched after pytorch building and before testing (cherry picked from commit b92b34d) Fixes #ISSUE_NUMBER

New tests introduced for testing NHWC and NCHW batchnorm on MIOpen : - test_batchnorm_nhwc_miopen_cuda_float32 - test_batchnorm_nchw_miopen_cuda_float32 This test verifies weight and bias gradients, running_mean and running_var We can add other dtypes later How to run: `MIOPEN_ENABLE_LOGGING_CMD=1 python -u test/test_nn.py -v -k test_batchnorm_nhwc_miopen_cuda_float32` There is a difference in running_variance for NHWC batchnorm fp32 between MIOpen and native ``` MIOPEN_ENABLE_LOGGING_CMD=1 python -u test/test_nn.py -v -k test_batchnorm_nhwc_miopen_cuda_float32 ... self.assertEqual(mod.running_var, ref_mod.running_var) AssertionError: Tensor-likes are not close! Mismatched elements: 8 / 8 (100.0%) Greatest absolute difference: 0.05455732345581055 at index (5,) (up to 1e-05 allowed) Greatest relative difference: 0.030772637575864792 at index (5,) (up to 1.3e-06 allowed) ```

Fixes SWDEV-472397

Cherry pick pytorch#133235 Fixes SWDEV-473498

Fixes SWDEV-475071: https://ontrack-internal.amd.com/browse/SWDEV-475071

rraminen and others added 30 commits September 13, 2024 10:31

Updated to latest conda for CentOS stream 9

05d6126

Temporarily skip test_conv3d_64bit_indexing

dd31176

- Rocblas API support is requested - SWDEV-383635 & sub task - SWDEV-390218

Updates to build on Jammy

8fffb23

- Fortran package installation moved after gcc - Update libtinfo search code in cmake1 - Install libstdc++.so

Fix lstsq related regressions (part of SWDEV-392820)

cbd0b44

[UB22.04] Updates to support latest scipy

9923275

Build required version of libpng for CentOS7

b0697b9

Update tensorpipe submodule to support ROCm 6.0

4fd6e13

Set ROCM_PATH in env for centOS docker container

f2de668

Updated condition for libstc++ for Jammy

e83f564

Skip ddp apply_optim_in_bwd tests for gloo (#1302)

b56588b

To resolve https://ontrack-internal.amd.com/browse/SWDEV-403530 and https://ontrack-internal.amd.com/browse/SWDEV-419837. For more context check upstream issue pytorch#111834

Changes to support docker v23

e59bfe3

Reversed the condition as required

[CS9] Updates to CentOS stream 9 build (#1326)

281e2bf

- Add missing common_utils.sh - Update the install vision part - Move to amdgpu rhel 9.3 builds - Update to pick python from conda path - Add a missing package - Add ROCM_PATH and magma - Updated repo radeon path

Update to hipify mapping

eea29cd

Correcting usage of USE_ROCM

e5067c2

Enable gesvda for ROCM >= 6.1 (#1339)

2be2a79

This also fixes a problem in gesvd driver when UV is not needed.

Increase lifespan of test-times files

8f6c7af

- build_environment is hard coded to value from upstream when branch for created, since the dev/QA ENV build_environment value can be varing

Fixes CI build script (#1350)

f1f2b4e

* Fix the parsing of /etc/os-release The old code parses OS_DISTRO as 'PRETTY_Ubuntu' on Ubuntu and thus never links to libtinfo correctly. * Configurable CMAKE_PREFIX_PATH in CI script.

[NO CP] Temporary dumping of test exec log to stderr

d98149c

- This is done as per QA request, needs to be reverted and not required to be cherry-picked into later releases.

Add skipIfRocmArch decorator for Navi skips (#1356)

8a4d1e2

Converted NAVI check as a function (#1364)

5b77292

* Moved NAVI check to the test file * Revised NAVI check as a function

Triton build conditionalized on ROCM_VERSION

4c93554

Remove ROCmloops specific test

7da900e

Bad import in test_torchinductor and skip torchvision related UT (#1374)

4aba300

skip test_inductor_freezing failing UTs (#1375)

5580969

Skip test_mm_triton_kernel_benchmark (#1376)

183802e

* Running triton kernel on ROCM only has one GB/s metric reported * Update test_kernel_benchmark.py

temporarily ignore certificate check for Miniconda

90c132a

(cherry picked from commit 9848db1)

[HIP] Returned error string update

f47dca8

alugorey and others added 28 commits September 13, 2024 14:22

Reformat test_float8_basics for current rocm support (#1415)

0ae8e99

Enable e5m2 x e4m3 test in test_float8_scale (#1419)

ed694e4

[release/2.1] Skip certificate check for CentOS7 since certificate ex…

6876373

…pired (#1399) * Skip certificate check only for CentOS7 since certificate expired * Naming

skip vmapvjpvjp_linalg_householder_product_cuda_float32 (#1420)

7dade27

Include the ROCm version in triton version

6e12b31

Change Torch extra install requirement

d4d80ee

- PYTORCH_EXTRA_INSTALL_REQUIREMENTS is set in builder repo - Remove the PYTORCH_EXTRA_INSTALL_REQUIREMENTS step from this file

Remove the installation of rocm-llvm-dev package

e6ff669

- Causing regression - SWDEV-463083

Fix SWDEV-459623 (#1428)

da0e1b4

* Fix SWDEV-459623. The Rank of logsumexp Tensor must be 3. This tensor was considered for internal use only but apparently exposed to UTs. * Fix for mGPU. The stream should be selected after picking the current device according to input tensor.

Enable fp8 inductor unit tests (#1421)

4b8aea1

* Add formal FP8 check in common_cuda.py * Enable inductor/test_valid_cast * Support for test_eager_fallback * allow fnuz types on amax test * Finalize passing tests vs failing * Fix fnuz constants in _to_fp8_saturated

[HIP] Few more updates to the returned error string

4c85c6c

skipIfRocm needs msg parameter

d10d2fa

[NO CP] Updated changes to skip few UTs

93c7b7f

Print consolidated log file for pytorch unit test automation scripts (#…

bf3a2cd

…1433) * Print consolidated log file for pytorch uts * Update run_entire_tests subprocess call as well * lint * Add ERROR string

Sync updates from hipify_torch. (#1168)

7f7d24b

Co-authored-by: Jithun Nair <[email protected]>

fix install_centos() function

d3201b0

[SWDEV-466849] Enhancements for PyTorch UT helper scripts (#1491)

a82ac7b

* Check that >1 GPUs are visible when running TEST_CONFIG=distributed * Add EXECUTION_TIME to file-level and aggregate statistics

Added functions imports (#1521)

2ec0172

Fixes inductor.test_torchinductor_dynamic_shapes::TestInductorDynamicCUDA::test_item_unbacked_stride_nobreak_cuda

[Navi] [Inductor] Unskip Navi inductor UTs (#1514)

e3ebe30

Relates to https://ontrack-internal.amd.com/browse/SWDEV-461590

Imported skipIfRocm in certain test suites (#1577)

115944d

Fixes SWDEV-472397

[SWDEV-473498] Pin sympy for >=python3.9 (#1576)

ac86642

Cherry pick pytorch#133235 Fixes SWDEV-473498

Several issues fix of QA helper script (#1564)

9833f2d

Fixes SWDEV-475071: https://ontrack-internal.amd.com/browse/SWDEV-475071

rocm6.4 related_commits

524bef2

rocm6.4 test-times

ccdc413

dnikolaev-amd requested a review from pruthvistony September 17, 2024 19:45

fix sympy version in requirements-ci.txt

ae08f9f

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

rocm6.4 IFU CP 09122024 #1596

rocm6.4 IFU CP 09122024 #1596

dnikolaev-amd commented Sep 17, 2024

rocm6.4 IFU CP 09122024 #1596

Are you sure you want to change the base?

rocm6.4 IFU CP 09122024 #1596

Conversation

dnikolaev-amd commented Sep 17, 2024