[infra] Run parameterized ONNX model tests across CPU, Vulkan, and HIP. #19524

ScottTodd · 2024-12-18T23:38:13Z

This switches from running ONNX model compile->run correctness tests on only CPU to now run on GPU using the Vulkan and HIP APIs. We could also run on CUDA with #18814 and Metal with #18817.

These new tests will help guard against regressions to full models, at least when using default flags. I'm planning on adding models coming from other frameworks (such as LiteRT Models) in future PRs.

As these tests will run on every pull request and commit, I'm starting the test list with all tests that are passing on our current set of runners, with no (strict or loose) XFAILs. The full set of tests will be run nightly in https://github.com/iree-org/iree-test-suites using nightly IREE releases... once we have runners with GPUs available in that repository.

See also iree-org/iree-test-suites#65 and iree-org/iree-test-suites#6.

Sample logs

I have not done much triage on the test failures, but it does seem like Vulkan pass rates are substantially lower than CPU and ROCm. Test reports, including logs for all failures, are currently published as artifacts on actions runs in iree-test-suites, such as https://github.com/iree-org/iree-test-suites/actions/runs/12794322266. We could also archive test reports somewhere like https://github.com/nod-ai/e2eshark-reports and/or host the test reports on a website like https://nod-ai.github.io/shark-ai/llm/sglang/index.html?sort=result.

CPU

https://github.com/iree-org/iree/actions/runs/12797886622/job/35681117085?pr=19524#step:8:395

============================== slowest durations ===============================
39.46s call     tests/model_zoo/validated/vision/classification_models_test.py::test_models[vgg/model/vgg19-7.onnx]
13.39s call     tests/model_zoo/validated/vision/classification_models_test.py::test_models[caffenet/model/caffenet-12.onnx]
13.25s call     tests/model_zoo/validated/vision/object_detection_segmentation_models_test.py::test_models[yolov2-coco/model/yolov2-coco-9.onnx]
12.48s call     tests/model_zoo/validated/vision/classification_models_test.py::test_models[rcnn_ilsvrc13/model/rcnn-ilsvrc13-9.onnx]
11.93s call     tests/model_zoo/validated/vision/classification_models_test.py::test_models[alexnet/model/bvlcalexnet-12.onnx]
11.49s call     tests/model_zoo/validated/vision/classification_models_test.py::test_models[resnet/model/resnet50-v1-12.onnx]
11.28s call     tests/model_zoo/validated/vision/classification_models_test.py::test_models[densenet-121/model/densenet-12.onnx]
11.26s call     tests/model_zoo/validated/vision/classification_models_test.py::test_models[resnet/model/resnet50-v2-7.onnx]
9.14s call     tests/model_zoo/validated/vision/classification_models_test.py::test_models[inception_and_googlenet/inception_v2/model/inception-v2-9.onnx]
7.73s call     tests/model_zoo/validated/vision/body_analysis_models_test.py::test_models[age_gender/models/age_googlenet.onnx]
7.61s call     tests/model_zoo/validated/vision/body_analysis_models_test.py::test_models[age_gender/models/gender_googlenet.onnx]
7.57s call     tests/model_zoo/validated/vision/classification_models_test.py::test_models[efficientnet-lite4/model/efficientnet-lite4-11.onnx]
7.27s call     tests/model_zoo/validated/vision/object_detection_segmentation_models_test.py::test_models[tiny-yolov2/model/tinyyolov2-8.onnx]
4.86s call     tests/model_zoo/validated/vision/classification_models_test.py::test_models[mobilenet/model/mobilenetv2-12.onnx]
4.61s call     tests/model_zoo/validated/vision/classification_models_test.py::test_models[shufflenet/model/shufflenet-v2-12.onnx]
4.58s call     tests/model_zoo/validated/vision/classification_models_test.py::test_models[shufflenet/model/shufflenet-9.onnx]
3.08s call     tests/model_zoo/validated/vision/classification_models_test.py::test_models[squeezenet/model/squeezenet1.0-9.onnx]
2.02s call     tests/model_zoo/validated/vision/classification_models_test.py::test_models[mnist/model/mnist-12.onnx]
1.90s call     tests/model_zoo/validated/vision/super_resolution_models_test.py::test_models[sub_pixel_cnn_2016/model/super-resolution-10.onnx]
================== 19 passed, 18 skipped in 184.96s (0:03:04) ==================

ROCm

https://github.com/iree-org/iree/actions/runs/12797886622/job/35681117629?pr=19524#step:8:344

============================== slowest durations ===============================
9.40s call     tests/model_zoo/validated/vision/classification_models_test.py::test_models[densenet-121/model/densenet-12.onnx]
9.15s call     tests/model_zoo/validated/vision/classification_models_test.py::test_models[caffenet/model/caffenet-12.onnx]
9.05s call     tests/model_zoo/validated/vision/object_detection_segmentation_models_test.py::test_models[yolov2-coco/model/yolov2-coco-9.onnx]
8.73s call     tests/model_zoo/validated/vision/classification_models_test.py::test_models[rcnn_ilsvrc13/model/rcnn-ilsvrc13-9.onnx]
7.95s call     tests/model_zoo/validated/vision/classification_models_test.py::test_models[inception_and_googlenet/inception_v2/model/inception-v2-9.onnx]
7.94s call     tests/model_zoo/validated/vision/classification_models_test.py::test_models[resnet/model/resnet50-v1-12.onnx]
7.81s call     tests/model_zoo/validated/vision/classification_models_test.py::test_models[alexnet/model/bvlcalexnet-12.onnx]
7.13s call     tests/model_zoo/validated/vision/classification_models_test.py::test_models[resnet/model/resnet50-v2-7.onnx]
6.95s call     tests/model_zoo/validated/vision/body_analysis_models_test.py::test_models[age_gender/models/age_googlenet.onnx]
5.15s call     tests/model_zoo/validated/vision/classification_models_test.py::test_models[efficientnet-lite4/model/efficientnet-lite4-11.onnx]
4.52s call     tests/model_zoo/validated/vision/classification_models_test.py::test_models[inception_and_googlenet/googlenet/model/googlenet-12.onnx]
3.55s call     tests/model_zoo/validated/vision/object_detection_segmentation_models_test.py::test_models[tiny-yolov2/model/tinyyolov2-8.onnx]
3.12s call     tests/model_zoo/validated/vision/classification_models_test.py::test_models[shufflenet/model/shufflenet-v2-12.onnx]
2.57s call     tests/model_zoo/validated/vision/classification_models_test.py::test_models[mobilenet/model/mobilenetv2-12.onnx]
2.48s call     tests/model_zoo/validated/vision/classification_models_test.py::test_models[shufflenet/model/shufflenet-9.onnx]
2.21s call     tests/model_zoo/validated/vision/object_detection_segmentation_models_test.py::test_models[ssd-mobilenetv1/model/ssd_mobilenet_v1_12.onnx]
1.36s call     tests/model_zoo/validated/vision/super_resolution_models_test.py::test_models[sub_pixel_cnn_2016/model/super-resolution-10.onnx]
0.95s call     tests/model_zoo/validated/vision/classification_models_test.py::test_models[mnist/model/mnist-12.onnx]
============ 17 passed, 19 skipped, 1 xfailed in 100.10s (0:01:40) =============

Vulkan

https://github.com/iree-org/iree/actions/runs/12797886622/job/35681118044?pr=19524#step:8:216

============================== slowest durations ===============================
13.10s call     tests/model_zoo/validated/vision/classification_models_test.py::test_models[alexnet/model/bvlcalexnet-12.onnx]
12.97s call     tests/model_zoo/validated/vision/classification_models_test.py::test_models[caffenet/model/caffenet-12.onnx]
12.40s call     tests/model_zoo/validated/vision/classification_models_test.py::test_models[rcnn_ilsvrc13/model/rcnn-ilsvrc13-9.onnx]
12.22s call     tests/model_zoo/validated/vision/object_detection_segmentation_models_test.py::test_models[yolov2-coco/model/yolov2-coco-9.onnx]
9.07s call     tests/model_zoo/validated/vision/classification_models_test.py::test_models[resnet/model/resnet50-v1-12.onnx]
8.09s call     tests/model_zoo/validated/vision/classification_models_test.py::test_models[resnet/model/resnet50-v2-7.onnx]
6.04s call     tests/model_zoo/validated/vision/object_detection_segmentation_models_test.py::test_models[tiny-yolov2/model/tinyyolov2-8.onnx]
2.93s call     tests/model_zoo/validated/vision/object_detection_segmentation_models_test.py::test_models[ssd-mobilenetv1/model/ssd_mobilenet_v1_12.onnx]
1.86s call     tests/model_zoo/validated/vision/classification_models_test.py::test_models[mobilenet/model/mobilenetv2-12.onnx]
0.90s call     tests/model_zoo/validated/vision/classification_models_test.py::test_models[mnist/model/mnist-12.onnx]
============= 9 passed, 27 skipped, 1 xfailed in 79.62s (0:01:19) ==============

TODO

Implement caching in the test suite and run CPU tests on runners with the persistent-cache label so that we don't burn through bandwidth

ci-exactly: build_packages, test_onnx

Progress on #6. See how this is used downstream in iree-org/iree#19524. ## Overview This replaces hardcoded flags like ```python iree_compile_flags = [ "--iree-hal-target-backends=llvm-cpu", "--iree-llvmcpu-target-cpu=host", ] iree_run_module_flags = [ "--device=local-task", ] ``` and inlined marks like ```python @pytest.mark.xfail(raises=IreeCompileException) def test_foo(): ``` with a JSON config file passed to the test runner via the `--test-config-file` option or the `IREE_TEST_CONFIG_FILE` environment variable. During test case collection, each test case name is looked up in the config file to determine what the expected outcome is, from `["skip" (special option), "pass", "fail-import", "fail-compile", "fail-run"]`. By default, all tests are skipped. This design allows for out of tree testing to be performed using explicit test lists (encoded in a file, unlike the [`-k` option](https://docs.pytest.org/en/latest/example/markers.html#using-k-expr-to-select-tests-based-on-their-name)), custom flags, and custom test expectations. ## Design details Compare this implementation with these others: * https://github.com/iree-org/iree-test-suites/tree/main/onnx_ops also uses config files, but with separate lists for `skip_compile_tests`, `skip_run_tests`, `expected_compile_failures`, and `expected_run_failures`. All tests are run by default. * https://github.com/nod-ai/SHARK-TestSuite/blob/main/alt_e2eshark/run.py uses `--device=`, `--backend=`, `--target-chip=`, and `--test-filter=` arguments. Arbitrary flags are not supported, and test expectations are also not supported, so there is no way to directly signal if tests are unexpectedly passing or failing. A utility script can be used to diff the results of two test reports: https://github.com/nod-ai/SHARK-TestSuite/blob/main/alt_e2eshark/utils/check_regressions.py. * https://github.com/iree-org/iree-test-suites/blob/main/sharktank_models/llama3.1/test_llama.py parameterizes test cases using `@pytest.fixture([params=[...]])` with `pytest.mark.target_hip` and other custom marks. This is more standard pytest and supports fluent ways to express other test configurations, but it makes annotating large numbers of tests pretty verbose and doesn't allow for out of tree configuration. I'm imagining a few usage styles: * Nightly testing in this repository, running all test cases and tracking the current test results in a checked in config file. * We could also go with an approach like https://github.com/nod-ai/SHARK-TestSuite/blob/main/alt_e2eshark/utils/check_regressions.py to diff test results but this encodes the test results in the config files rather than in external reports. I see pros and cons to both approaches. * Presubmit testing in https://github.com/iree-org/iree, running a subset of test cases that pass, ensuring that they do not start failing. We could also run with XFAIL to get early signal for tests that start to pass. * If we don't run with XFAIL then we don't need the generalized `tests_and_expected_outcomes`, we could just limit testing to only models that are passing. * Developer testing with arbitrary flags. ## Follow-up tasks - [ ] Add job matrix to workflow (needs runners in this repo with GPUs) - [ ] Add an easy way to update the list of XFAILs (maybe switch to https://github.com/gsnedders/pytest-expect and use its `--update-xfail`?) - [ ] Triage some of the failures (e.g. can adjust tolerances on Vulkan) - [ ] Adjust file downloading / caching behavior to avoid redownloading and using significant bandwidth when used together with persistent self-hosted runners or github actions caches

…-parameterize

ScottTodd · 2025-01-15T23:53:22Z

docs/website/docs/developers/general/testing-guide.md

@@ -396,6 +396,8 @@ not supported by Bazel rules at this point.

 ## External test suites


This page is published at https://iree.dev/developers/general/testing-guide/#external-test-suites. Generally trying to put enough information there so

developers working in just this iree-org/iree repository can understand what the different tests are and how to handle newly failing or passing tests

developers are aware of out of tree test suites

each test suite is put in context

Along these lines, I would like to promote more of the test suite work going on (in both iree-test-suites and SHARK-TestSuite) up to the level of overall IREE ecosystem dashboards and release notes. For example, each stable release could highlight the test result delta and average performance delta since the previous release.

ScottTodd added infrastructure Relating to build systems, CI, or testing integrations/onnx ONNX integration work labels Dec 18, 2024

[infra] Run parameterized ONNX model tests across CPU, Vulkan, and HIP.

fea2a82

ScottTodd force-pushed the testing-onnx-models-parameterize branch from f7ae544 to fea2a82 Compare December 18, 2024 23:57

This was referenced Dec 19, 2024

Parameterize ONNX model tests. iree-org/iree-test-suites#65

Merged

[regression][GPU]: 'func.func' op uses 81920 bytes of shared memory; exceeded the limit of 65536 bytes post 6ff00a8a008d06b604d4ca4e0ae6e601ae810b4f #19511

Closed

ScottTodd added 7 commits January 15, 2025 10:10

Merge remote-tracking branch 'upstream/main' into testing-onnx-models…

010d7e2

…-parameterize

Rebase using latest iree-test-suites code and test statuses.

6faa664

Merge remote-tracking branch 'upstream/main' into testing-onnx-models…

f013b52

…-parameterize

Document ONNX model tests and refresh other test docs.

0f61df2

Fix test names in config files.

0da5a92

Remove failing test.

6d5993f

Prune test list a bit further.

85f9184

ScottTodd requested a review from zjgarvey January 15, 2025 23:47

ScottTodd commented Jan 15, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[infra] Run parameterized ONNX model tests across CPU, Vulkan, and HIP. #19524

[infra] Run parameterized ONNX model tests across CPU, Vulkan, and HIP. #19524

ScottTodd commented Dec 18, 2024 •

edited

Loading

ScottTodd Jan 15, 2025

		@@ -396,6 +396,8 @@ not supported by Bazel rules at this point.

		## External test suites

[infra] Run parameterized ONNX model tests across CPU, Vulkan, and HIP. #19524

Are you sure you want to change the base?

[infra] Run parameterized ONNX model tests across CPU, Vulkan, and HIP. #19524

Conversation

ScottTodd commented Dec 18, 2024 • edited Loading

Sample logs

CPU

ROCm

Vulkan

TODO

ScottTodd Jan 15, 2025

Choose a reason for hiding this comment

ScottTodd commented Dec 18, 2024 •

edited

Loading