-
Notifications
You must be signed in to change notification settings - Fork 56
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[release/2.6] Cherry-picks from release/2.5 #1910
[release/2.6] Cherry-picks from release/2.5 #1910
Conversation
Jenkins build for 7cace34598895a7e65f75aa8b0707d8d0e0604ad commit finished as FAILURE Detected error during base docker image building:
|
Jenkins build for c154590e311a4942bfb24963181238fcb38be55b commit finished as FAILURE Detected error during base docker image building:
|
This PR skips pytorch-nightly installation in docker images Installation of pytorch-nightly is needed to prefetch mobilenet_v2 avd v3 models for some tests. Came from 85bd6bc Models are downloaded on first use to the folder /root/.cache/torch/hub But pytorch-nightly installation also overrides .ci/docker/requirements-ci.txt settings and upgrades some of python packages (sympy from 1.12.0 to 1.13.0) which causes several 'dynamic_shapes' tests to fail Skip prefetching models affects these tests without any errors (but **internet access required**): - python test/mobile/model_test/gen_test_model.py mobilenet_v2 - python test/quantization/eager/test_numeric_suite_eager.py -k test_mobilenet_v3 Issue ROCm/frameworks-internal#8772 Also, in case of some issues these models can be prefetched after pytorch building and before testing (cherry picked from commit b92b34d) (cherry picked from commit d5608f3) (cherry picked from commit aaa3134)
this PR add tlparse==0.3.7 into requirementss-ci.txt fix dynamo/test_structured_trace.py Error message: FileNotFoundError: [Errno 2] No such file or directory: 'tlparse' Fixes: https://ontrack-internal.amd.com/browse/SWDEV-480494 (cherry picked from commit 70fdaed) (cherry picked from commit e0f6b99) (cherry picked from commit 02220a5)
Improve performance for smaller shapes that use block radix sort by decreasing the item_per_thread to 8. This will increase the thread block size leading to higher occupancy. Co-author: @amd-sushetty --------- Co-authored-by: Pruthvi Madugundu <[email protected]> (cherry picked from commit 1024f36) (cherry picked from commit 08c0749)
…orm and disable BF16 batchnorm with MIOpen for ROCm less then 6.4 ====================================================================================================================================== [release/2.6] Enable bf16 with fp32 weights for MIOpen batchnorm This PR enables: * using MIOpen OCL_mix backend for bf16 batchnorm with fp32 weights (using torch autocast). This was required and tested for customer workload using NCHW (which is the only memory_layout enabled). * logging for MIOpen batchnorm using `PYTORCH_MIOPEN_EXTRA_LOGGING` env var. TODO in separate PR: Need to implement PyTorch unit tests for this bf16/fp16 inputs + fp32 weights case. (cherry picked from commit abbfe77) [release/2.5] disable BF16 batchnorm with MIOpen for ROCm less then 6.4 (#1858) PR to disable BF16 batchnorm on ROCm less then 6.4 Fixes "Solver Not Found" https://ontrack-internal.amd.com/browse/SWDEV-502652 (cherry picked from commit 60cb68e)
…_rcpf(x) instead of 1.f/x (#1800) Cherry-pick of #1688 Co-authored-by: Michael Halkenhäuser <[email protected]> Co-authored-by: Hashem Hashemi <[email protected]> (cherry picked from commit f8544af)
This PR is a release/2.5-based version of #1809 Copied description by @hj-wei from #1809 > Hi all, I manually generating nvcc to bypass NVIDIA component checks(Megatron-LM), see https://github.com/NVIDIA/Megatron-LM/blob/2da43ef4c1b9e76f03b7567360cf7390e877f1b6/megatron/legacy/fused_kernels/__init__.py#L57 > but it can lead to incorrect CUDA_HOME configurations. This can cause initialization anomalies in downstream libraries like DeepSpeed (cherry picked from commit e814ee8)
…m for 3D shapes (pytorch#143137) (#1843) Cherry-pick of #1839 Co-authored-by: Jerry Mannil <[email protected]> Co-authored-by: Doru Bercea <[email protected]> (cherry picked from commit f929e0d)
…loat16 and Half. (#1844) Cherry-pick of #1638 Co-authored-by: carlobertolli <[email protected]> Co-authored-by: Jerry Mannil <[email protected]> (cherry picked from commit 33911de)
…1847) Navi passes condition `torch.cuda.get_device_capability() >= (9, 4)` and uses `default_workspace_size=128MB`, but it required only for MI300 Fix condition to use `("gfx94" in gcn_arch)` instead of `torch.cuda.get_device_properties()` to detect MI300 (cherry picked from commit d4d0b07) (cherry picked from commit ff48a82)
…h cudnn and missing miopen version.h ========================================================================================================= [ROCm] miopen benchmark behavior now better aligns with cudnn (#1851) The default benchmark setting is now false. The new miopen behavior means when benchmarking is disabled, for any shape that doesn't have a find hit, then it will do a quick search (same behavior as the prior default), and use that result. Now when benchmark is enabled, it will perform an exhaustive search and update any DBs. miopen immediate mode is still available and is used when deterministic is true and benchmark is false. (cherry picked from commit 80f18e8) missing miopen version.h (#1866) follow up to #1851 (cherry picked from commit 47074cd)
…orch#144865) (#1869) Fixes pytorch#144855 Follows approach in pytorch#141923 to use int64 types to increase INT_MAX limits Pull Request resolved: pytorch#144865 Approved by: https://github.com/eqy (cherry picked from commit 082fab0) (cherry picked from commit 5d01868)
Tune 3D tensor sums when not using fastest dimension. (cherry picked from commit 8b75274)
Fixes SWDEV-501618 (cherry picked from commit 8b59eea)
…pport ========================================================================== Let aotriton.cmake detect the best binary package to use, and deprecate aotriton_version.txt (pytorch#137443) We do not need `install_aotriton.sh` and `aotriton_version.txt` any more since `aotriton.cmake` now installs the best binary release package as the default option when building pytorch. This should resolve the issue of needing a pre-installed aotriton package when building PyTorch for ROCm from source, which is not feasible if building PyTorch *outside* a CI docker image. With this change, a user can have a pre-installed AOTriton in their environment, if desired, and have the build pick it up by specifying the `AOTRITON_INSTALLED_PREFIX` env var, or have the build automatically detect and install the compatible version. As a third option, the user can also force AOTriton to build from source instead, using the `AOTRITON_INSTALL_FROM_SOURCE` env var. Also, with the changes in this PR, the cmake build process handles the tasks of copying aotriton .so and images directory from `torch/lib` to the installation path. Pull Request resolved: pytorch#137443 Approved by: https://github.com/jithunnair-amd, https://github.com/jeffdaily Co-authored-by: Jithun Nair <[email protected]> (cherry picked from commit bc57635) Bump AOTriton to 0.8.2b (#1853) Fixes SWDEV-508774 (cherry picked from commit 4bed249) Enable head_dim == 512 with AOTriton 0.8.1 (cherry picked from commit 6edd36f) Add unit tests for head dimension 512 (cherry picked from commit 85290fa)
…nings ========================================================================== [reland][attempt2][AMD] Turn on TF32 for aten::mm (pytorch#144145) Summary: pytorch#143549 was reverted due to some internal/oss tooling issue. Relanding. hipblaslt supports TF32, so adding the support. Original PR pytorch#139869 Test Plan: CI Differential Revision: D67785496 Pull Request resolved: pytorch#144145 Approved by: https://github.com/jianyuh (cherry picked from commit 3d3a079) [AMD] De-noise tf32 warnings (pytorch#144797) Summary: This is way too noisy especially during unit tests. So just log once. Test Plan: OSS CI. Tested on a unit test and now I only see one line (hard to notice :) ). Differential Revision: D68167633 Pull Request resolved: pytorch#144797 Approved by: https://github.com/jianyuh, https://github.com/leitian, https://github.com/yoyoyocmu (cherry picked from commit 6ba53a5)
c154590
to
f502a54
Compare
Jenkins build for 889a5f50c0a26abbb90c012a7e48ff4968a3e1bb commit finished as FAILURE |
Jenkins build for 889a5f50c0a26abbb90c012a7e48ff4968a3e1bb commit finished as FAILURE |
Jenkins build for 71958c1ccd4e46f4b007cbdeb0feb8a155e67fd5 commit finished as FAILURE |
Internal builds are good with this PR branch. Thanks @jithunnair-amd |
Jenkins build for d75edcc115356f6ab3769825f92c9e2b202e4929 commit finished as FAILURE |
Jenkins build for 889a5f50c0a26abbb90c012a7e48ff4968a3e1bb commit finished as FAILURE |
Replace pytorch#138947 for re-import. Replaces #1592 This PR contains the initial implementation of SDPA with composable_kernel backend. The CK path can be forced by simply calling torch.backends.cuda.preferred_rocm_fa_library("ck"). Similarly, you can force the incumbent aotriton implementation by passing in "aotriton" or "default". As you'd expect, not setting this option will result in aotriton to be used as the backend. In the case of CK, if pytorch deems flash attention usable, then it will use the CK path in all the same places aotriton would have been used. This PR makes no changes to the heuristics which select which attention scheme to use (i.e. flash attention vs memory efficient attention vs math etc etc). It only gets called when flash attention is both enabled (via USE_FLASH_ATTENTION) and is selected at runtime by the existing heuristics. Files located in pytorch/aten/src/ATen/native/transformers/hip/flash_attn/ck/mha* have been pulled from https://github.com/Dao-AILab/flash-attention courtesy of @tridao's hard work who is the co-author NOTE: In order to use this backend, the user MUST set USE_CK_FLASH_ATTENTION=1 in their environment when they build PyTorch. Pull Request resolved: pytorch#143695 Approved by: https://github.com/malfet Co-authored-by: Andy Lugo <[email protected]> Co-authored-by: Jithun Nair <[email protected]> (cherry picked from commit 0a94bb4)
…4.04) and CentOS ===================================================================================== Updates to build on Jammy - Fortran package installation moved after gcc - Update libtinfo search code in cmake1 - Install libstdc++.so (cherry picked from commit 6e39ade) (cherry picked from commit f6ad627) Updated condition for libstc++ for Jammy (cherry picked from commit f32cab4) (cherry picked from commit bb7fd30) Set ROCM_PATH ENV in Centos docker container (cherry picked from commit b774eaa) (cherry picked from commit da86387) [release/2.5] Changes to support UB 24.04 build (#1816) Fixes SWDEV-505665 Changes applied from #1816 Successful PyTorch build: http://rocm-ci.amd.com/job/mainline-framework-pytorch-2.5-ub24-py3.12-ci/17/ --------- Co-authored-by: pramenku <[email protected]> Co-authored-by: Nichols A. Romero <[email protected]> (cherry picked from commit f638998)
889a5f5
to
71958c1
Compare
Jenkins build for 71958c1ccd4e46f4b007cbdeb0feb8a155e67fd5 commit is in progress |
@pruthvistony I think the internal CI gave us a good A/B analysis of why the patch in
Without this patch, ROCm installation failed: http://ml-ci-internal.amd.com:8080/blue/organizations/jenkins/pytorch%2Fpytorch-ci-pipeline/detail/PR-1910/2/pipeline/264/
With this patch, ROCm installation succeeded: http://ml-ci-internal.amd.com:8080/blue/organizations/jenkins/pytorch%2Fpytorch-ci-pipeline/detail/PR-1910/3/pipeline/264/
Still don't know why this patch doesn't seem to be required upstream, but we should still upstream this to have consistency with our ROCm fork. |
@jithunnair-amd only curiosity, Do you know if flex attention is working well on RDNA3? |
Methodology
git cherry rocm/release/2.6 rocm/release/2.5 7d26c2b35f0b6de21877ebba3603b1fd889d793f -v > git_cherry_from_2.5_to_2.6_limit_7d26c2b35f0b6de21877ebba3603b1fd889d793f
where
--- rocm/release/2.6 was at 88b9764
--- rocm/release/2.5 was at 4b51542
git_cherry_from_2.5_to_2.6_limit_7d26c2b35f0b6de21877ebba3603b1fd889d793f.txt
---- Commits that gave nontrivial conflicts during cherry-pick have been marked as "Need newer version"
---- Commits that were either consolidated or modified considerably have been marked as "Ignored; merged new version"
---- Commits that need a newer version to be committed (in case commit doesn't cleanly apply) have been marked as "Ignored; will merge new version"
---- Commits that we're not sure if they're really required are marked as "Cherry-pick only if needed" (to be cherry-picked if build/test issues arise)
Also cherry-pick commits from upstream that customers have requested e.g. CK Flash-attention backend PR.
Use "Rebase and Merge" option when merging PR to ensure individual commits show up in target branch
Testing