-
Notifications
You must be signed in to change notification settings - Fork 643
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[regression][GPU]: 'func.func' op uses 81920 bytes of shared memory; exceeded the limit of 65536 bytes post 6ff00a8a008d06b604d4ca4e0ae6e601ae810b4f #19511
Comments
Which 100 models? Please be specific in bug reports. |
Here's the list of test failures: https://github.com/nod-ai/e2eshark-reports/blob/main/2024-12-17/ci_reports_onnx/rocm/combined-reports/yesterday_comparison.md#317-regressions-found. Lots in the "vit" family (vision transformers) |
Good news: with the existing tests in https://github.com/iree-org/iree-test-suites/tree/main/onnx_models, I can reproduce a regression in a few models like mobilenet and resnet50 using |
Repro using resnet from iree-org/iree-test-suites#65: # Setup
cd onnx_models
.\.venv\Scripts\activate.bat
pip install --upgrade -r requirements-iree.txt
# Run test that passes
pytest --log-cli-level=info -rA --durations=0 -k resnet --test-config-file=./configs/onnx_models_gpu_rocm_rdna3.json
# Switch to broken release, run again to see failure
pip install --find-links https://iree.dev/pip-release-links.html iree-base-compiler==3.1.0rc20241217
pytest --log-cli-level=info -rA --durations=0 -k resnet --test-config-file=./configs/onnx_models_gpu_rocm_rdna3.json Once that PR and #19524 land we'll have that coverage on all IREE PRs. |
I also noticed that this error spews a looooot of logs (multiple thousands of lines): iree/compiler/src/iree/compiler/Codegen/Common/GPU/GPUCheckResourceUsage.cpp Lines 85 to 87 in fb4d094
Maybe we don't need the entire context? |
This issue has been fixed since Commit range: iree-3.1.0rc20241218...iree-3.1.0rc20241219 . Possibly fixed by #19508. Still working on getting those tests landed and enabled on presubmits. |
What happened?
We have 100+ models failing in GPU which were earlier passing numeric and started post 6ff00a8
command:
model.torch_onnx.mlir.txt
Steps to reproduce your issue
What component(s) does this issue relate to?
Compiler
Version information
No response
Additional context
No response
The text was updated successfully, but these errors were encountered: