-
Notifications
You must be signed in to change notification settings - Fork 643
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[infra] Run parameterized ONNX model tests across CPU, Vulkan, and HIP. #19524
base: main
Are you sure you want to change the base?
[infra] Run parameterized ONNX model tests across CPU, Vulkan, and HIP. #19524
Conversation
f7ae544
to
fea2a82
Compare
Progress on #6. See how this is used downstream in iree-org/iree#19524. ## Overview This replaces hardcoded flags like ```python iree_compile_flags = [ "--iree-hal-target-backends=llvm-cpu", "--iree-llvmcpu-target-cpu=host", ] iree_run_module_flags = [ "--device=local-task", ] ``` and inlined marks like ```python @pytest.mark.xfail(raises=IreeCompileException) def test_foo(): ``` with a JSON config file passed to the test runner via the `--test-config-file` option or the `IREE_TEST_CONFIG_FILE` environment variable. During test case collection, each test case name is looked up in the config file to determine what the expected outcome is, from `["skip" (special option), "pass", "fail-import", "fail-compile", "fail-run"]`. By default, all tests are skipped. This design allows for out of tree testing to be performed using explicit test lists (encoded in a file, unlike the [`-k` option](https://docs.pytest.org/en/latest/example/markers.html#using-k-expr-to-select-tests-based-on-their-name)), custom flags, and custom test expectations. ## Design details Compare this implementation with these others: * https://github.com/iree-org/iree-test-suites/tree/main/onnx_ops also uses config files, but with separate lists for `skip_compile_tests`, `skip_run_tests`, `expected_compile_failures`, and `expected_run_failures`. All tests are run by default. * https://github.com/nod-ai/SHARK-TestSuite/blob/main/alt_e2eshark/run.py uses `--device=`, `--backend=`, `--target-chip=`, and `--test-filter=` arguments. Arbitrary flags are not supported, and test expectations are also not supported, so there is no way to directly signal if tests are unexpectedly passing or failing. A utility script can be used to diff the results of two test reports: https://github.com/nod-ai/SHARK-TestSuite/blob/main/alt_e2eshark/utils/check_regressions.py. * https://github.com/iree-org/iree-test-suites/blob/main/sharktank_models/llama3.1/test_llama.py parameterizes test cases using `@pytest.fixture([params=[...]])` with `pytest.mark.target_hip` and other custom marks. This is more standard pytest and supports fluent ways to express other test configurations, but it makes annotating large numbers of tests pretty verbose and doesn't allow for out of tree configuration. I'm imagining a few usage styles: * Nightly testing in this repository, running all test cases and tracking the current test results in a checked in config file. * We could also go with an approach like https://github.com/nod-ai/SHARK-TestSuite/blob/main/alt_e2eshark/utils/check_regressions.py to diff test results but this encodes the test results in the config files rather than in external reports. I see pros and cons to both approaches. * Presubmit testing in https://github.com/iree-org/iree, running a subset of test cases that pass, ensuring that they do not start failing. We could also run with XFAIL to get early signal for tests that start to pass. * If we don't run with XFAIL then we don't need the generalized `tests_and_expected_outcomes`, we could just limit testing to only models that are passing. * Developer testing with arbitrary flags. ## Follow-up tasks - [ ] Add job matrix to workflow (needs runners in this repo with GPUs) - [ ] Add an easy way to update the list of XFAILs (maybe switch to https://github.com/gsnedders/pytest-expect and use its `--update-xfail`?) - [ ] Triage some of the failures (e.g. can adjust tolerances on Vulkan) - [ ] Adjust file downloading / caching behavior to avoid redownloading and using significant bandwidth when used together with persistent self-hosted runners or github actions caches
@@ -396,6 +396,8 @@ not supported by Bazel rules at this point. | |||
|
|||
## External test suites |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This page is published at https://iree.dev/developers/general/testing-guide/#external-test-suites. Generally trying to put enough information there so
- developers working in just this iree-org/iree repository can understand what the different tests are and how to handle newly failing or passing tests
- developers are aware of out of tree test suites
- each test suite is put in context
Along these lines, I would like to promote more of the test suite work going on (in both iree-test-suites and SHARK-TestSuite) up to the level of overall IREE ecosystem dashboards and release notes. For example, each stable release could highlight the test result delta and average performance delta since the previous release.
This switches from running ONNX model compile->run correctness tests on only CPU to now run on GPU using the Vulkan and HIP APIs. We could also run on CUDA with #18814 and Metal with #18817.
These new tests will help guard against regressions to full models, at least when using default flags. I'm planning on adding models coming from other frameworks (such as LiteRT Models) in future PRs.
As these tests will run on every pull request and commit, I'm starting the test list with all tests that are passing on our current set of runners, with no (strict or loose) XFAILs. The full set of tests will be run nightly in https://github.com/iree-org/iree-test-suites using nightly IREE releases... once we have runners with GPUs available in that repository.
See also iree-org/iree-test-suites#65 and iree-org/iree-test-suites#6.
Sample logs
I have not done much triage on the test failures, but it does seem like Vulkan pass rates are substantially lower than CPU and ROCm. Test reports, including logs for all failures, are currently published as artifacts on actions runs in iree-test-suites, such as https://github.com/iree-org/iree-test-suites/actions/runs/12794322266. We could also archive test reports somewhere like https://github.com/nod-ai/e2eshark-reports and/or host the test reports on a website like https://nod-ai.github.io/shark-ai/llm/sglang/index.html?sort=result.
CPU
https://github.com/iree-org/iree/actions/runs/12797886622/job/35681117085?pr=19524#step:8:395
ROCm
https://github.com/iree-org/iree/actions/runs/12797886622/job/35681117629?pr=19524#step:8:344
Vulkan
https://github.com/iree-org/iree/actions/runs/12797886622/job/35681118044?pr=19524#step:8:216
TODO
persistent-cache
label so that we don't burn through bandwidthci-exactly: build_packages, test_onnx