Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[release/2.6] Cherry-picks from release/2.5 #1910

Merged
merged 21 commits into from
Feb 20, 2025

Conversation

jithunnair-amd
Copy link
Collaborator

@jithunnair-amd jithunnair-amd commented Feb 18, 2025

Methodology

  • Use following command to identify additional commits from release/2.5 that need to be cherry-picked to release/2.6: git cherry rocm/release/2.6 rocm/release/2.5 7d26c2b35f0b6de21877ebba3603b1fd889d793f -v > git_cherry_from_2.5_to_2.6_limit_7d26c2b35f0b6de21877ebba3603b1fd889d793f
    where
    --- rocm/release/2.6 was at 88b9764
    --- rocm/release/2.5 was at 4b51542

git_cherry_from_2.5_to_2.6_limit_7d26c2b35f0b6de21877ebba3603b1fd889d793f.txt

  • Identify commits that haven't been upstreamed into release/2.6, and definitely need to be cherry-picked as-is and mark in this sheet

---- Commits that gave nontrivial conflicts during cherry-pick have been marked as "Need newer version"
---- Commits that were either consolidated or modified considerably have been marked as "Ignored; merged new version"
---- Commits that need a newer version to be committed (in case commit doesn't cleanly apply) have been marked as "Ignored; will merge new version"
---- Commits that we're not sure if they're really required are marked as "Cherry-pick only if needed" (to be cherry-picked if build/test issues arise)

  • Also cherry-pick commits from upstream that customers have requested e.g. CK Flash-attention backend PR.

  • Use "Rebase and Merge" option when merging PR to ensure individual commits show up in target branch

Testing

@jithunnair-amd jithunnair-amd marked this pull request as draft February 18, 2025 09:35
@rocm-repo-management-api
Copy link

rocm-repo-management-api bot commented Feb 18, 2025

Jenkins build for 7cace34598895a7e65f75aa8b0707d8d0e0604ad commit finished as FAILURE
Links: Blue Ocean view / Build artifacts

Detected error during base docker image building:

#32 11.02 The following packages have unmet dependencies:
#32 11.09  rocm-dev : Depends: rocm-cmake (= 0.14.0.60302-66~22.04) but 5.0.0-1 is to be installed
#32 11.09             Depends: rocm-device-libs (= 1.0.0.60302-66~22.04) but 5.0.0-1 is to be installed
#32 11.09  rocm-utils : Depends: rocminfo (= 1.0.0.60302-66~22.04) but 5.0.0-1 is to be installed
#32 11.09               Depends: rocm-cmake (= 0.14.0.60302-66~22.04) but 5.0.0-1 is to be installed
#32 11.09 E: Unable to correct problems, you have held broken packages.
#32 ERROR: process "/bin/sh -c bash ./install_rocm.sh" did not complete successfully: exit code: 100
------
 > [stage-0 24/52] RUN bash ./install_rocm.sh:
11.02 distribution that some required packages have not yet been created
11.02 or been moved out of Incoming.

@rocm-repo-management-api
Copy link

rocm-repo-management-api bot commented Feb 18, 2025

Jenkins build for c154590e311a4942bfb24963181238fcb38be55b commit finished as FAILURE
Links: Blue Ocean view / Build artifacts

Detected error during base docker image building:

#31 15.35 The following packages have unmet dependencies:
#31 15.43  rocm-dev : Depends: rocm-cmake (= 0.14.0.60302-66~22.04) but 5.0.0-1 is to be installed
#31 15.43             Depends: rocm-device-libs (= 1.0.0.60302-66~22.04) but 5.0.0-1 is to be installed
#31 15.43  rocm-utils : Depends: rocminfo (= 1.0.0.60302-66~22.04) but 5.0.0-1 is to be installed
#31 15.43               Depends: rocm-cmake (= 0.14.0.60302-66~22.04) but 5.0.0-1 is to be installed
#31 15.43 E: Unable to correct problems, you have held broken packages.
#31 ERROR: process "/bin/sh -c bash ./install_rocm.sh" did not complete successfully: exit code: 100
------
 > [stage-0 24/48] RUN bash ./install_rocm.sh:
15.35 distribution that some required packages have not yet been created
15.35 or been moved out of Incoming.

BLOrange-AMD and others added 16 commits February 19, 2025 07:19
(cherry picked from commit 743dfda)
(cherry picked from commit ba7c253)
This PR skips pytorch-nightly installation in docker images

Installation of pytorch-nightly is needed to prefetch mobilenet_v2 avd
v3 models for some tests.
Came from

85bd6bc
Models are downloaded on first use to the folder /root/.cache/torch/hub
But pytorch-nightly installation also overrides
.ci/docker/requirements-ci.txt settings and upgrades some of python
packages (sympy from 1.12.0 to 1.13.0) which causes several
'dynamic_shapes' tests to fail
Skip prefetching models affects these tests without any errors (but
**internet access required**):

- python test/mobile/model_test/gen_test_model.py mobilenet_v2
- python test/quantization/eager/test_numeric_suite_eager.py -k
test_mobilenet_v3

Issue ROCm/frameworks-internal#8772

Also, in case of some issues these models can be prefetched after
pytorch building and before testing

(cherry picked from commit b92b34d)
(cherry picked from commit d5608f3)
(cherry picked from commit aaa3134)
this PR add tlparse==0.3.7 into requirementss-ci.txt

fix dynamo/test_structured_trace.py
Error message: FileNotFoundError: [Errno 2] No such file or directory:
'tlparse'

Fixes: https://ontrack-internal.amd.com/browse/SWDEV-480494 (cherry
picked from commit 70fdaed)

(cherry picked from commit e0f6b99)
(cherry picked from commit 02220a5)
Improve performance for smaller shapes that use block radix sort by
decreasing the item_per_thread to 8.
This will increase the thread block size leading to higher occupancy.

Co-author: @amd-sushetty

---------

Co-authored-by: Pruthvi Madugundu <[email protected]>
(cherry picked from commit 1024f36)
(cherry picked from commit 08c0749)
…orm and disable BF16 batchnorm with MIOpen for ROCm less then 6.4

======================================================================================================================================

[release/2.6] Enable bf16 with fp32 weights for MIOpen batchnorm

This PR enables:
* using MIOpen OCL_mix backend for bf16 batchnorm with fp32 weights
(using torch autocast). This was required and tested for customer
workload using NCHW (which is the only memory_layout enabled).
* logging for MIOpen batchnorm using `PYTORCH_MIOPEN_EXTRA_LOGGING` env
var.

TODO in separate PR: Need to implement PyTorch unit tests for this
bf16/fp16 inputs + fp32 weights case.

(cherry picked from commit abbfe77)

[release/2.5] disable BF16 batchnorm with MIOpen for ROCm less then 6.4 (#1858)

PR to disable BF16 batchnorm on ROCm less then 6.4
Fixes "Solver Not Found"
https://ontrack-internal.amd.com/browse/SWDEV-502652

(cherry picked from commit 60cb68e)
…_rcpf(x) instead of 1.f/x (#1800)

Cherry-pick of #1688

Co-authored-by: Michael Halkenhäuser <[email protected]>
Co-authored-by: Hashem Hashemi <[email protected]>
(cherry picked from commit f8544af)
This PR is a release/2.5-based version of
#1809

Copied description by @hj-wei from
#1809

> Hi all, I manually generating nvcc to bypass NVIDIA component
checks(Megatron-LM),
see
https://github.com/NVIDIA/Megatron-LM/blob/2da43ef4c1b9e76f03b7567360cf7390e877f1b6/megatron/legacy/fused_kernels/__init__.py#L57

> but it can lead to incorrect CUDA_HOME configurations. This can cause
initialization anomalies in downstream libraries like DeepSpeed

(cherry picked from commit e814ee8)
…m for 3D shapes (pytorch#143137) (#1843)

Cherry-pick of #1839

Co-authored-by: Jerry Mannil <[email protected]>
Co-authored-by: Doru Bercea <[email protected]>
(cherry picked from commit f929e0d)
…loat16 and Half. (#1844)

Cherry-pick of #1638

Co-authored-by: carlobertolli <[email protected]>
Co-authored-by: Jerry Mannil <[email protected]>
(cherry picked from commit 33911de)
…1847)

Navi passes condition `torch.cuda.get_device_capability() >= (9, 4)` and
uses `default_workspace_size=128MB`, but it required only for MI300
Fix condition to use `("gfx94" in gcn_arch)` instead of
`torch.cuda.get_device_properties()` to detect MI300

(cherry picked from commit d4d0b07)
(cherry picked from commit ff48a82)
…h cudnn and missing miopen version.h

=========================================================================================================

[ROCm] miopen benchmark behavior now better aligns with cudnn (#1851)

The default benchmark setting is now false. The new miopen behavior
means when benchmarking is disabled, for any shape that doesn't have a
find hit, then it will do a quick search (same behavior as the prior
default), and use that result. Now when benchmark is enabled, it will
perform an exhaustive search and update any DBs. miopen immediate mode
is still available and is used when deterministic is true and benchmark
is false.

(cherry picked from commit 80f18e8)

missing miopen version.h (#1866)

follow up to #1851

(cherry picked from commit 47074cd)
…orch#144865) (#1869)

Fixes pytorch#144855

Follows approach in pytorch#141923 to
use int64 types to increase INT_MAX limits Pull Request resolved:
pytorch#144865 Approved by:
https://github.com/eqy

(cherry picked from commit 082fab0)
(cherry picked from commit 5d01868)
Tune 3D tensor sums when not using fastest dimension.

(cherry picked from commit 8b75274)
Fixes SWDEV-501618

(cherry picked from commit 8b59eea)
…pport

==========================================================================

Let aotriton.cmake detect the best binary package to use, and deprecate aotriton_version.txt (pytorch#137443)

We do not need `install_aotriton.sh` and `aotriton_version.txt` any more since `aotriton.cmake` now installs the best binary release package as the default option when building pytorch.

This should resolve the issue of needing a pre-installed aotriton package when building PyTorch for ROCm from source, which is not feasible if building PyTorch *outside* a CI docker image. With this change, a user can have a pre-installed AOTriton in their environment, if desired, and have the build pick it up by specifying the `AOTRITON_INSTALLED_PREFIX` env var, or have the build automatically detect and install the compatible version. As a third option, the user can also force AOTriton to build from source instead, using the `AOTRITON_INSTALL_FROM_SOURCE` env var.

Also, with the changes in this PR, the cmake build process handles the tasks of copying aotriton .so and images directory from `torch/lib` to the installation path.

Pull Request resolved: pytorch#137443
Approved by: https://github.com/jithunnair-amd, https://github.com/jeffdaily

Co-authored-by: Jithun Nair <[email protected]>
(cherry picked from commit bc57635)

Bump AOTriton to 0.8.2b (#1853)

Fixes SWDEV-508774

(cherry picked from commit 4bed249)

Enable head_dim == 512 with AOTriton 0.8.1

(cherry picked from commit 6edd36f)

Add unit tests for head dimension 512

(cherry picked from commit 85290fa)
…nings

==========================================================================

[reland][attempt2][AMD] Turn on TF32 for aten::mm (pytorch#144145)

Summary:
pytorch#143549 was reverted due to some
internal/oss tooling issue. Relanding.

hipblaslt supports TF32, so adding the support.
Original PR pytorch#139869

Test Plan: CI

Differential Revision: D67785496

Pull Request resolved: pytorch#144145
Approved by: https://github.com/jianyuh

(cherry picked from commit 3d3a079)

[AMD] De-noise tf32 warnings (pytorch#144797)

Summary: This is way too noisy especially during unit tests. So just log once.

Test Plan: OSS CI. Tested on a unit test and now I only see one line (hard to notice :) ).

Differential Revision: D68167633

Pull Request resolved: pytorch#144797
Approved by: https://github.com/jianyuh, https://github.com/leitian, https://github.com/yoyoyocmu

(cherry picked from commit 6ba53a5)
@jithunnair-amd jithunnair-amd force-pushed the release2.6_cherry_picks_from_release2.5 branch from c154590 to f502a54 Compare February 19, 2025 07:31
@rocm-repo-management-api
Copy link

rocm-repo-management-api bot commented Feb 19, 2025

Jenkins build for 889a5f50c0a26abbb90c012a7e48ff4968a3e1bb commit finished as FAILURE
Links: Blue Ocean view / Build artifacts

@rocm-repo-management-api
Copy link

rocm-repo-management-api bot commented Feb 19, 2025

Jenkins build for 889a5f50c0a26abbb90c012a7e48ff4968a3e1bb commit finished as FAILURE
Links: Blue Ocean view / Build artifacts

@rocm-repo-management-api
Copy link

rocm-repo-management-api bot commented Feb 19, 2025

Jenkins build for 71958c1ccd4e46f4b007cbdeb0feb8a155e67fd5 commit finished as FAILURE
Links: Blue Ocean view / Build artifacts

@pruthvistony
Copy link
Collaborator

Internal builds are good with this PR branch.

Thanks @jithunnair-amd

@rocm-repo-management-api
Copy link

rocm-repo-management-api bot commented Feb 19, 2025

Jenkins build for d75edcc115356f6ab3769825f92c9e2b202e4929 commit finished as FAILURE
Links: Blue Ocean view / Build artifacts

@rocm-repo-management-api
Copy link

rocm-repo-management-api bot commented Feb 19, 2025

Jenkins build for 889a5f50c0a26abbb90c012a7e48ff4968a3e1bb commit finished as FAILURE
Links: Blue Ocean view / Build artifacts

jithunnair-amd and others added 5 commits February 20, 2025 05:55
Replace pytorch#138947 for re-import.

Replaces #1592

This PR contains the initial implementation of SDPA with composable_kernel backend. The CK path can be forced by simply calling torch.backends.cuda.preferred_rocm_fa_library("ck"). Similarly, you can force the incumbent aotriton implementation by passing in "aotriton" or "default". As you'd expect, not setting this option will result in aotriton to be used as the backend. In the case of CK, if pytorch deems flash attention usable, then it will use the CK path in all the same places aotriton would have been used. This PR makes no changes to the heuristics which select which attention scheme to use (i.e. flash attention vs memory efficient attention vs math etc etc). It only gets called when flash attention is both enabled (via USE_FLASH_ATTENTION) and is selected at runtime by the existing heuristics.

Files located in pytorch/aten/src/ATen/native/transformers/hip/flash_attn/ck/mha* have been pulled from https://github.com/Dao-AILab/flash-attention courtesy of @tridao's hard work who is the co-author

NOTE: In order to use this backend, the user MUST set USE_CK_FLASH_ATTENTION=1 in their environment when they build PyTorch.

Pull Request resolved: pytorch#143695
Approved by: https://github.com/malfet

Co-authored-by: Andy Lugo <[email protected]>
Co-authored-by: Jithun Nair <[email protected]>
(cherry picked from commit 0a94bb4)
Reversed the condition as required

(cherry picked from commit 6efd9b4)
(cherry picked from commit d8002d0)
…4.04) and CentOS

=====================================================================================

Updates to build on Jammy

- Fortran package installation moved after gcc
- Update libtinfo search code in cmake1
- Install libstdc++.so

(cherry picked from commit 6e39ade)
(cherry picked from commit f6ad627)

Updated condition for libstc++ for Jammy

(cherry picked from commit f32cab4)
(cherry picked from commit bb7fd30)

Set ROCM_PATH ENV in Centos docker container

(cherry picked from commit b774eaa)
(cherry picked from commit da86387)

[release/2.5] Changes to support UB 24.04 build (#1816)

Fixes SWDEV-505665

Changes applied from #1816

Successful PyTorch build:
http://rocm-ci.amd.com/job/mainline-framework-pytorch-2.5-ub24-py3.12-ci/17/

---------

Co-authored-by: pramenku <[email protected]>
Co-authored-by: Nichols A. Romero <[email protected]>
(cherry picked from commit f638998)
@jithunnair-amd jithunnair-amd force-pushed the release2.6_cherry_picks_from_release2.5 branch from 889a5f5 to 71958c1 Compare February 20, 2025 06:08
@rocm-repo-management-api
Copy link

Jenkins build for 71958c1ccd4e46f4b007cbdeb0feb8a155e67fd5 commit is in progress
Links: Blue Ocean view / Build artifacts

@jithunnair-amd jithunnair-amd marked this pull request as ready for review February 20, 2025 13:13
@jithunnair-amd
Copy link
Collaborator Author

@pruthvistony I think the internal CI gave us a good A/B analysis of why the patch in install_rocm.sh to pin the repo is needed:

echo -e 'Package: *\nPin: release o=repo.radeon.com\nPin-Priority: 600' \
            | sudo tee /etc/apt/preferences.d/rocm-pin-600

Without this patch, ROCm installation failed: http://ml-ci-internal.amd.com:8080/blue/organizations/jenkins/pytorch%2Fpytorch-ci-pipeline/detail/PR-1910/2/pipeline/264/

------
 > [stage-0 24/48] RUN bash ./install_rocm.sh:
15.35 distribution that some required packages have not yet been created
15.35 or been moved out of Incoming.
15.35 The following information may help to resolve the situation:
15.35 
15.35 The following packages have unmet dependencies:
15.43  rocm-dev : Depends: rocm-cmake (= 0.14.0.60302-66~22.04) but 5.0.0-1 is to be installed
15.43             Depends: rocm-device-libs (= 1.0.0.60302-66~22.04) but 5.0.0-1 is to be installed
15.43  rocm-utils : Depends: rocminfo (= 1.0.0.60302-66~22.04) but 5.0.0-1 is to be installed
15.43               Depends: rocm-cmake (= 0.14.0.60302-66~22.04) but 5.0.0-1 is to be installed
15.43 E: Unable to correct problems, you have held broken packages.
------

With this patch, ROCm installation succeeded: http://ml-ci-internal.amd.com:8080/blue/organizations/jenkins/pytorch%2Fpytorch-ci-pipeline/detail/PR-1910/3/pipeline/264/

#32 80.04 Get:131 http://repo.radeon.com/rocm/apt/6.3.2 jammy/main amd64 rocm-cmake amd64 0.14.0.60302-66~22.04 [24.7 kB]
#32 75.13 Get:121 http://repo.radeon.com/rocm/apt/6.3.2 jammy/main amd64 rocm-device-libs amd64 1.0.0.60302-66~22.04 [720 kB]
#32 23.39 Get:58 http://repo.radeon.com/rocm/apt/6.3.2 jammy/main amd64 rocminfo amd64 1.0.0.60302-66~22.04 [27.5 kB]

Still don't know why this patch doesn't seem to be required upstream, but we should still upstream this to have consistency with our ROCm fork.

@jithunnair-amd jithunnair-amd merged commit 70f3007 into release/2.6 Feb 20, 2025
3 of 6 checks passed
@jithunnair-amd jithunnair-amd deleted the release2.6_cherry_picks_from_release2.5 branch February 20, 2025 13:38
@johnnynunez
Copy link

@jithunnair-amd only curiosity, Do you know if flex attention is working well on RDNA3?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.