Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rocm6.4 IFU CP 09122024 #1596

Closed
wants to merge 60 commits into from

Commits on Sep 13, 2024

  1. [SOW MS3] Centos stream9 PyTorch image support (#1090)

    * changes to build Centos stream 9 images
    
    * Added scripts for centos and centos stream images
    
    * Added an extra line
    
    * Add ninja installation
    
    * Optimized code
    
    * Fixes
    
    * Add comment
    
    * Optimized code
    
    * Added AMDGPU mapping for ROCm 5.2 and invalid-url for rocm_baseurl
    
    Co-authored-by: Jithun Nair <[email protected]>
    2 people authored and dnikolaev-amd committed Sep 13, 2024
    Configuration menu
    Copy the full SHA
    98cc4e1 View commit details
    Browse the repository at this point in the history
  2. Configuration menu
    Copy the full SHA
    05d6126 View commit details
    Browse the repository at this point in the history
  3. Temporarily skip test_conv3d_64bit_indexing

    - Rocblas API support is requested
    - SWDEV-383635 & sub task - SWDEV-390218
    pruthvistony authored and dnikolaev-amd committed Sep 13, 2024
    Configuration menu
    Copy the full SHA
    dd31176 View commit details
    Browse the repository at this point in the history
  4. Enable tensorpipe with hip_basic backend (#1135)

    * Add hip_basic tensorpipe support to PyTorch
    
    * Enabling hip_basic for Tensorpipe for pyTorch
    
    * removing upstream tensorpipe module
    
    * Adding ROCm specific tensopipe submodule
    
    * tensorpipe submodule updated
    
    * Update the hip invalid device string
    
    * Added ignore for tensorpipe git submodule
    
    * Moved include of tensorpipe_cuda.h to hipify
    
    * Updates based on review comments
    
    * Defining the variable __HIP_PLATFORM_AMD__
    
    * Enabling the UTs
    
    Co-authored-by: Ronak Malik <[email protected]>
    2 people authored and dnikolaev-amd committed Sep 13, 2024
    Configuration menu
    Copy the full SHA
    e96aba2 View commit details
    Browse the repository at this point in the history
  5. Updates to build on Jammy

    - Fortran package installation moved after gcc
    - Update libtinfo search code in cmake1
    - Install libstdc++.so
    pruthvistony authored and dnikolaev-amd committed Sep 13, 2024
    Configuration menu
    Copy the full SHA
    8fffb23 View commit details
    Browse the repository at this point in the history
  6. Configuration menu
    Copy the full SHA
    cbd0b44 View commit details
    Browse the repository at this point in the history
  7. Configuration menu
    Copy the full SHA
    9923275 View commit details
    Browse the repository at this point in the history
  8. Configuration menu
    Copy the full SHA
    b0697b9 View commit details
    Browse the repository at this point in the history
  9. Configuration menu
    Copy the full SHA
    4fd6e13 View commit details
    Browse the repository at this point in the history
  10. Configuration menu
    Copy the full SHA
    f2de668 View commit details
    Browse the repository at this point in the history
  11. Configuration menu
    Copy the full SHA
    e83f564 View commit details
    Browse the repository at this point in the history
  12. Configuration menu
    Copy the full SHA
    b56588b View commit details
    Browse the repository at this point in the history
  13. Changes to support docker v23

    Reversed the condition as required
    pruthvistony authored and dnikolaev-amd committed Sep 13, 2024
    Configuration menu
    Copy the full SHA
    e59bfe3 View commit details
    Browse the repository at this point in the history
  14. [CS9] Updates to CentOS stream 9 build (#1326)

    - Add missing common_utils.sh
    - Update the install vision part
    - Move to amdgpu rhel 9.3 builds
    - Update to pick python from conda path
    - Add a missing package
    - Add ROCM_PATH and magma
    - Updated repo radeon path
    pruthvistony authored and dnikolaev-amd committed Sep 13, 2024
    Configuration menu
    Copy the full SHA
    281e2bf View commit details
    Browse the repository at this point in the history
  15. Update to hipify mapping

    pruthvistony authored and dnikolaev-amd committed Sep 13, 2024
    Configuration menu
    Copy the full SHA
    eea29cd View commit details
    Browse the repository at this point in the history
  16. Correcting usage of USE_ROCM

    pruthvistony authored and dnikolaev-amd committed Sep 13, 2024
    Configuration menu
    Copy the full SHA
    e5067c2 View commit details
    Browse the repository at this point in the history
  17. Enable gesvda for ROCM >= 6.1 (#1339)

    This also fixes a problem in gesvd driver when UV is not needed.
    xinyazhang authored and dnikolaev-amd committed Sep 13, 2024
    Configuration menu
    Copy the full SHA
    2be2a79 View commit details
    Browse the repository at this point in the history
  18. Increase lifespan of test-times files

    - build_environment is hard coded to value from upstream when
      branch for created, since the dev/QA ENV build_environment
      value can be varing
    pruthvistony authored and dnikolaev-amd committed Sep 13, 2024
    Configuration menu
    Copy the full SHA
    8f6c7af View commit details
    Browse the repository at this point in the history
  19. Fixes CI build script (#1350)

    * Fix the parsing of /etc/os-release
    
    The old code parses OS_DISTRO as 'PRETTY_Ubuntu' on Ubuntu and thus
    never links to libtinfo correctly.
    
    * Configurable CMAKE_PREFIX_PATH in CI script.
    xinyazhang authored and dnikolaev-amd committed Sep 13, 2024
    Configuration menu
    Copy the full SHA
    f1f2b4e View commit details
    Browse the repository at this point in the history
  20. [NO CP] Temporary dumping of test exec log to stderr

    - This is done as per QA request, needs to be reverted and
      not required to be cherry-picked into later releases.
    pruthvistony authored and dnikolaev-amd committed Sep 13, 2024
    Configuration menu
    Copy the full SHA
    d98149c View commit details
    Browse the repository at this point in the history
  21. Configuration menu
    Copy the full SHA
    8a4d1e2 View commit details
    Browse the repository at this point in the history
  22. Converted NAVI check as a function (#1364)

    * Moved NAVI check to the test file
    
    * Revised NAVI check as a function
    BLOrange-AMD authored and dnikolaev-amd committed Sep 13, 2024
    Configuration menu
    Copy the full SHA
    5b77292 View commit details
    Browse the repository at this point in the history
  23. Configuration menu
    Copy the full SHA
    4c93554 View commit details
    Browse the repository at this point in the history
  24. Remove ROCmloops specific test

    pruthvistony authored and dnikolaev-amd committed Sep 13, 2024
    Configuration menu
    Copy the full SHA
    7da900e View commit details
    Browse the repository at this point in the history
  25. Configuration menu
    Copy the full SHA
    4aba300 View commit details
    Browse the repository at this point in the history
  26. Configuration menu
    Copy the full SHA
    5580969 View commit details
    Browse the repository at this point in the history
  27. Skip test_mm_triton_kernel_benchmark (#1376)

    * Running triton kernel on ROCM only has one GB/s metric reported
    
    * Update test_kernel_benchmark.py
    pragupta authored and dnikolaev-amd committed Sep 13, 2024
    Configuration menu
    Copy the full SHA
    183802e View commit details
    Browse the repository at this point in the history
  28. temporarily ignore certificate check for Miniconda

    (cherry picked from commit 9848db1)
    yanyao-wang authored and dnikolaev-amd committed Sep 13, 2024
    Configuration menu
    Copy the full SHA
    90c132a View commit details
    Browse the repository at this point in the history
  29. Implementation of PyTorch ut parsing script - QA helper function (#1386)

    * Initial implementation of PyTorch ut parsing script
    
    * Extracted path variables
    
    * Use nested dict to save results
    
    * Fixes typo
    
    * Cleanup
    
    * Fixes several issues
    
    * Minor name change
    
    * Update run_pytorch_unit_tests.py
    
    * Added file banners
    
    * Supported running from API
    
    * Added more help info
    
    * Consistent naming
    
    * Format help text
    
    ---------
    
    Co-authored-by: Jithun Nair <[email protected]>
    Co-authored-by: Jithun Nair <[email protected]>
    3 people authored and dnikolaev-amd committed Sep 13, 2024
    Configuration menu
    Copy the full SHA
    0d89328 View commit details
    Browse the repository at this point in the history
  30. Configuration menu
    Copy the full SHA
    f47dca8 View commit details
    Browse the repository at this point in the history
  31. PR #1255 to rocm6.2 release

    ramcherukuri authored and dnikolaev-amd committed Sep 13, 2024
    Configuration menu
    Copy the full SHA
    e27ff6e View commit details
    Browse the repository at this point in the history
  32. Configuration menu
    Copy the full SHA
    0ae8e99 View commit details
    Browse the repository at this point in the history
  33. Configuration menu
    Copy the full SHA
    ed694e4 View commit details
    Browse the repository at this point in the history
  34. [release/2.1] Skip certificate check for CentOS7 since certificate ex…

    …pired (#1399)
    
    * Skip certificate check only for CentOS7 since certificate expired
    
    * Naming
    jithunnair-amd authored and dnikolaev-amd committed Sep 13, 2024
    Configuration menu
    Copy the full SHA
    6876373 View commit details
    Browse the repository at this point in the history
  35. Configuration menu
    Copy the full SHA
    7dade27 View commit details
    Browse the repository at this point in the history
  36. Configuration menu
    Copy the full SHA
    6e12b31 View commit details
    Browse the repository at this point in the history
  37. Change Torch extra install requirement

    - PYTORCH_EXTRA_INSTALL_REQUIREMENTS is set in builder repo
    - Remove the PYTORCH_EXTRA_INSTALL_REQUIREMENTS step from this file
    pruthvistony authored and dnikolaev-amd committed Sep 13, 2024
    Configuration menu
    Copy the full SHA
    d4d80ee View commit details
    Browse the repository at this point in the history
  38. Remove the installation of rocm-llvm-dev package

    - Causing regression - SWDEV-463083
    pruthvistony authored and dnikolaev-amd committed Sep 13, 2024
    Configuration menu
    Copy the full SHA
    e6ff669 View commit details
    Browse the repository at this point in the history
  39. Fix SWDEV-459623 (#1428)

    * Fix SWDEV-459623. The Rank of logsumexp Tensor must be 3.
    
    This tensor was considered for internal use only but apparently exposed to UTs.
    
    * Fix for mGPU.
    
    The stream should be selected after picking the current device according
    to input tensor.
    xinyazhang authored and dnikolaev-amd committed Sep 13, 2024
    Configuration menu
    Copy the full SHA
    da0e1b4 View commit details
    Browse the repository at this point in the history

Commits on Sep 16, 2024

  1. Enable fp8 inductor unit tests (#1421)

    * Add formal FP8 check in common_cuda.py
    
    * Enable inductor/test_valid_cast
    
    * Support for test_eager_fallback
    
    * allow fnuz types on amax test
    
    * Finalize passing tests vs failing
    
    * Fix fnuz constants in _to_fp8_saturated
    alugorey authored and dnikolaev-amd committed Sep 16, 2024
    Configuration menu
    Copy the full SHA
    4b8aea1 View commit details
    Browse the repository at this point in the history
  2. Enable NHWC batchnorm for miopen (#1400)

    * Enable batchnorm NHWC for MIOpen
    
    * cleanup
    
    * test to compare NHWC MIOpen batchnorm with CPU
    
    * fix 'use_miopen' condition for nhwc miopen
    
    * fix includes
    
    * use native nhwc batchnorm to verify miopen
    
    * remove extra spaces
    
    * remove empty lines
    
    * set PYTORCH_MIOPEN_SUGGEST_NHWC=1 for all test_nn.py test
    dnikolaev-amd committed Sep 16, 2024
    Configuration menu
    Copy the full SHA
    4c94122 View commit details
    Browse the repository at this point in the history
  3. Configuration menu
    Copy the full SHA
    4c85c6c View commit details
    Browse the repository at this point in the history
  4. Configuration menu
    Copy the full SHA
    d10d2fa View commit details
    Browse the repository at this point in the history
  5. Configuration menu
    Copy the full SHA
    93c7b7f View commit details
    Browse the repository at this point in the history
  6. Print consolidated log file for pytorch unit test automation scripts (#…

    …1433)
    
    * Print consolidated log file for pytorch uts
    
    * Update run_entire_tests subprocess call as well
    
    * lint
    
    * Add ERROR string
    jithunnair-amd authored and dnikolaev-amd committed Sep 16, 2024
    Configuration menu
    Copy the full SHA
    bf3a2cd View commit details
    Browse the repository at this point in the history
  7. [ROCm] Intra-node all reduce initial implementation (#1435)

    * Initial commit to port intra_node_comm to ROCm
    
    (cherry picked from commit 48d1c33)
    
    * gpt-fast running now with intra-node comm
    
    (cherry picked from commit 618c54e)
    
    ---------
    
    Co-authored-by: Prachi Gupta <[email protected]>
    2 people authored and dnikolaev-amd committed Sep 16, 2024
    Configuration menu
    Copy the full SHA
    a5641fa View commit details
    Browse the repository at this point in the history
  8. Sync updates from hipify_torch. (#1168)

    Co-authored-by: Jithun Nair <[email protected]>
    2 people authored and dnikolaev-amd committed Sep 16, 2024
    Configuration menu
    Copy the full SHA
    7f7d24b View commit details
    Browse the repository at this point in the history
  9. Configuration menu
    Copy the full SHA
    d3201b0 View commit details
    Browse the repository at this point in the history
  10. [SWDEV-466849] Enhancements for PyTorch UT helper scripts (#1491)

    * Check that >1 GPUs are visible when running TEST_CONFIG=distributed
    
    * Add EXECUTION_TIME to file-level and aggregate statistics
    jithunnair-amd authored and dnikolaev-amd committed Sep 16, 2024
    Configuration menu
    Copy the full SHA
    a82ac7b View commit details
    Browse the repository at this point in the history
  11. Added functions imports (#1521)

    Fixes
    inductor.test_torchinductor_dynamic_shapes::TestInductorDynamicCUDA::test_item_unbacked_stride_nobreak_cuda
    BLOrange-AMD authored and dnikolaev-amd committed Sep 16, 2024
    Configuration menu
    Copy the full SHA
    2ec0172 View commit details
    Browse the repository at this point in the history
  12. PyTorch unit test helper scripts enhancements (#1517)

    * Fail earlier for distributed-on-1-GPU scenario
    * print cmd in consolidated log with prettier formatting
    * python->python3
    
    Fixes https://ontrack-internal.amd.com/browse/SWDEV-477264
    
    ---------
    
    Co-authored-by: blorange-amd <[email protected]>
    2 people authored and dnikolaev-amd committed Sep 16, 2024
    Configuration menu
    Copy the full SHA
    8c1fa06 View commit details
    Browse the repository at this point in the history
  13. Configuration menu
    Copy the full SHA
    e3ebe30 View commit details
    Browse the repository at this point in the history
  14. [rocm6.3_internal_testing] pin sympy==1.12.1 and skip pytorch-nightly…

    … installstion (#1557)
    
    This PR pins sympy==1.12.1 in the .ci/docker/requirements-ci.txt file
    Also it skips pytorch-nightly installation in docker images
    
    Installation of pytorch-nightly is needed to prefetch mobilenet_v2 avd
    v3 models for some tests.
    Came from
    
    85bd6bc
    Models are downloaded on first use to the folder /root/.cache/torch/hub
    But pytorch-nightly installation also overrides
    .ci/docker/requirements-ci.txt settings and upgrades some of python
    packages (sympy from 1.12.0 to 1.13.0) which causes several
    'dynamic_shapes' tests to fail
    Skip prefetching models affects these tests without any errors (but
    **internet access required**):
    
    - python test/mobile/model_test/gen_test_model.py mobilenet_v2
    - python test/quantization/eager/test_numeric_suite_eager.py -k
    test_mobilenet_v3
    
    Issue ROCm/frameworks-internal#8772
    
    Also, in case of some issues these models can be prefetched after
    pytorch building and before testing
    
    (cherry picked from commit b92b34d)
    
    Fixes #ISSUE_NUMBER
    dnikolaev-amd committed Sep 16, 2024
    Configuration menu
    Copy the full SHA
    5b9a211 View commit details
    Browse the repository at this point in the history
  15. Add test_batchnorm_nhwc_miopen_cuda_float32 (#1561)

    New tests introduced for testing NHWC and NCHW batchnorm on MIOpen : 
    
    - test_batchnorm_nhwc_miopen_cuda_float32
    - test_batchnorm_nchw_miopen_cuda_float32
    
    This test verifies weight and bias gradients, running_mean and
    running_var
    We can add other dtypes later
    
    How to run:
    `MIOPEN_ENABLE_LOGGING_CMD=1 python -u test/test_nn.py -v -k
    test_batchnorm_nhwc_miopen_cuda_float32`
    
    There is a difference in running_variance for NHWC batchnorm fp32
    between MIOpen and native
    ```
    MIOPEN_ENABLE_LOGGING_CMD=1 python -u test/test_nn.py -v -k test_batchnorm_nhwc_miopen_cuda_float32
    ...
    self.assertEqual(mod.running_var, ref_mod.running_var)
    AssertionError: Tensor-likes are not close!
    Mismatched elements: 8 / 8 (100.0%)
    Greatest absolute difference: 0.05455732345581055 at index (5,) (up to 1e-05 allowed)
    Greatest relative difference: 0.030772637575864792 at index (5,) (up to 1.3e-06 allowed)
    ```
    dnikolaev-amd committed Sep 16, 2024
    Configuration menu
    Copy the full SHA
    5783557 View commit details
    Browse the repository at this point in the history
  16. Imported skipIfRocm in certain test suites (#1577)

    Fixes SWDEV-472397
    BLOrange-AMD authored and dnikolaev-amd committed Sep 16, 2024
    Configuration menu
    Copy the full SHA
    115944d View commit details
    Browse the repository at this point in the history
  17. [SWDEV-473498] Pin sympy for >=python3.9 (#1576)

    Cherry pick pytorch#133235
    
    Fixes SWDEV-473498
    jataylo authored and dnikolaev-amd committed Sep 16, 2024
    Configuration menu
    Copy the full SHA
    ac86642 View commit details
    Browse the repository at this point in the history
  18. Configuration menu
    Copy the full SHA
    9833f2d View commit details
    Browse the repository at this point in the history
  19. rocm6.4 related_commits

    dnikolaev-amd committed Sep 16, 2024
    Configuration menu
    Copy the full SHA
    524bef2 View commit details
    Browse the repository at this point in the history
  20. rocm6.4 test-times

    dnikolaev-amd committed Sep 16, 2024
    Configuration menu
    Copy the full SHA
    ccdc413 View commit details
    Browse the repository at this point in the history

Commits on Sep 17, 2024

  1. Configuration menu
    Copy the full SHA
    ae08f9f View commit details
    Browse the repository at this point in the history