Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[infra] Run parameterized ONNX model tests across CPU, Vulkan, and HIP. #19524

Draft
wants to merge 8 commits into
base: main
Choose a base branch
from

Conversation

ScottTodd
Copy link
Member

@ScottTodd ScottTodd commented Dec 18, 2024

This switches from running ONNX model compile->run correctness tests on only CPU to now run on GPU using the Vulkan and HIP APIs. We could also run on CUDA with #18814 and Metal with #18817.

These new tests will help guard against regressions to full models, at least when using default flags. I'm planning on adding models coming from other frameworks (such as LiteRT Models) in future PRs.

As these tests will run on every pull request and commit, I'm starting the test list with all tests that are passing on our current set of runners, with no (strict or loose) XFAILs. The full set of tests will be run nightly in https://github.com/iree-org/iree-test-suites using nightly IREE releases... once we have runners with GPUs available in that repository.

See also iree-org/iree-test-suites#65 and iree-org/iree-test-suites#6.

Sample logs

I have not done much triage on the test failures, but it does seem like Vulkan pass rates are substantially lower than CPU and ROCm. Test reports, including logs for all failures, are currently published as artifacts on actions runs in iree-test-suites, such as https://github.com/iree-org/iree-test-suites/actions/runs/12794322266. We could also archive test reports somewhere like https://github.com/nod-ai/e2eshark-reports and/or host the test reports on a website like https://nod-ai.github.io/shark-ai/llm/sglang/index.html?sort=result.

CPU

https://github.com/iree-org/iree/actions/runs/12797886622/job/35681117085?pr=19524#step:8:395

============================== slowest durations ===============================
39.46s call     tests/model_zoo/validated/vision/classification_models_test.py::test_models[vgg/model/vgg19-7.onnx]
13.39s call     tests/model_zoo/validated/vision/classification_models_test.py::test_models[caffenet/model/caffenet-12.onnx]
13.25s call     tests/model_zoo/validated/vision/object_detection_segmentation_models_test.py::test_models[yolov2-coco/model/yolov2-coco-9.onnx]
12.48s call     tests/model_zoo/validated/vision/classification_models_test.py::test_models[rcnn_ilsvrc13/model/rcnn-ilsvrc13-9.onnx]
11.93s call     tests/model_zoo/validated/vision/classification_models_test.py::test_models[alexnet/model/bvlcalexnet-12.onnx]
11.49s call     tests/model_zoo/validated/vision/classification_models_test.py::test_models[resnet/model/resnet50-v1-12.onnx]
11.28s call     tests/model_zoo/validated/vision/classification_models_test.py::test_models[densenet-121/model/densenet-12.onnx]
11.26s call     tests/model_zoo/validated/vision/classification_models_test.py::test_models[resnet/model/resnet50-v2-7.onnx]
9.14s call     tests/model_zoo/validated/vision/classification_models_test.py::test_models[inception_and_googlenet/inception_v2/model/inception-v2-9.onnx]
7.73s call     tests/model_zoo/validated/vision/body_analysis_models_test.py::test_models[age_gender/models/age_googlenet.onnx]
7.61s call     tests/model_zoo/validated/vision/body_analysis_models_test.py::test_models[age_gender/models/gender_googlenet.onnx]
7.57s call     tests/model_zoo/validated/vision/classification_models_test.py::test_models[efficientnet-lite4/model/efficientnet-lite4-11.onnx]
7.27s call     tests/model_zoo/validated/vision/object_detection_segmentation_models_test.py::test_models[tiny-yolov2/model/tinyyolov2-8.onnx]
4.86s call     tests/model_zoo/validated/vision/classification_models_test.py::test_models[mobilenet/model/mobilenetv2-12.onnx]
4.61s call     tests/model_zoo/validated/vision/classification_models_test.py::test_models[shufflenet/model/shufflenet-v2-12.onnx]
4.58s call     tests/model_zoo/validated/vision/classification_models_test.py::test_models[shufflenet/model/shufflenet-9.onnx]
3.08s call     tests/model_zoo/validated/vision/classification_models_test.py::test_models[squeezenet/model/squeezenet1.0-9.onnx]
2.02s call     tests/model_zoo/validated/vision/classification_models_test.py::test_models[mnist/model/mnist-12.onnx]
1.90s call     tests/model_zoo/validated/vision/super_resolution_models_test.py::test_models[sub_pixel_cnn_2016/model/super-resolution-10.onnx]
================== 19 passed, 18 skipped in 184.96s (0:03:04) ==================

ROCm

https://github.com/iree-org/iree/actions/runs/12797886622/job/35681117629?pr=19524#step:8:344

============================== slowest durations ===============================
9.40s call     tests/model_zoo/validated/vision/classification_models_test.py::test_models[densenet-121/model/densenet-12.onnx]
9.15s call     tests/model_zoo/validated/vision/classification_models_test.py::test_models[caffenet/model/caffenet-12.onnx]
9.05s call     tests/model_zoo/validated/vision/object_detection_segmentation_models_test.py::test_models[yolov2-coco/model/yolov2-coco-9.onnx]
8.73s call     tests/model_zoo/validated/vision/classification_models_test.py::test_models[rcnn_ilsvrc13/model/rcnn-ilsvrc13-9.onnx]
7.95s call     tests/model_zoo/validated/vision/classification_models_test.py::test_models[inception_and_googlenet/inception_v2/model/inception-v2-9.onnx]
7.94s call     tests/model_zoo/validated/vision/classification_models_test.py::test_models[resnet/model/resnet50-v1-12.onnx]
7.81s call     tests/model_zoo/validated/vision/classification_models_test.py::test_models[alexnet/model/bvlcalexnet-12.onnx]
7.13s call     tests/model_zoo/validated/vision/classification_models_test.py::test_models[resnet/model/resnet50-v2-7.onnx]
6.95s call     tests/model_zoo/validated/vision/body_analysis_models_test.py::test_models[age_gender/models/age_googlenet.onnx]
5.15s call     tests/model_zoo/validated/vision/classification_models_test.py::test_models[efficientnet-lite4/model/efficientnet-lite4-11.onnx]
4.52s call     tests/model_zoo/validated/vision/classification_models_test.py::test_models[inception_and_googlenet/googlenet/model/googlenet-12.onnx]
3.55s call     tests/model_zoo/validated/vision/object_detection_segmentation_models_test.py::test_models[tiny-yolov2/model/tinyyolov2-8.onnx]
3.12s call     tests/model_zoo/validated/vision/classification_models_test.py::test_models[shufflenet/model/shufflenet-v2-12.onnx]
2.57s call     tests/model_zoo/validated/vision/classification_models_test.py::test_models[mobilenet/model/mobilenetv2-12.onnx]
2.48s call     tests/model_zoo/validated/vision/classification_models_test.py::test_models[shufflenet/model/shufflenet-9.onnx]
2.21s call     tests/model_zoo/validated/vision/object_detection_segmentation_models_test.py::test_models[ssd-mobilenetv1/model/ssd_mobilenet_v1_12.onnx]
1.36s call     tests/model_zoo/validated/vision/super_resolution_models_test.py::test_models[sub_pixel_cnn_2016/model/super-resolution-10.onnx]
0.95s call     tests/model_zoo/validated/vision/classification_models_test.py::test_models[mnist/model/mnist-12.onnx]
============ 17 passed, 19 skipped, 1 xfailed in 100.10s (0:01:40) =============

Vulkan

https://github.com/iree-org/iree/actions/runs/12797886622/job/35681118044?pr=19524#step:8:216

============================== slowest durations ===============================
13.10s call     tests/model_zoo/validated/vision/classification_models_test.py::test_models[alexnet/model/bvlcalexnet-12.onnx]
12.97s call     tests/model_zoo/validated/vision/classification_models_test.py::test_models[caffenet/model/caffenet-12.onnx]
12.40s call     tests/model_zoo/validated/vision/classification_models_test.py::test_models[rcnn_ilsvrc13/model/rcnn-ilsvrc13-9.onnx]
12.22s call     tests/model_zoo/validated/vision/object_detection_segmentation_models_test.py::test_models[yolov2-coco/model/yolov2-coco-9.onnx]
9.07s call     tests/model_zoo/validated/vision/classification_models_test.py::test_models[resnet/model/resnet50-v1-12.onnx]
8.09s call     tests/model_zoo/validated/vision/classification_models_test.py::test_models[resnet/model/resnet50-v2-7.onnx]
6.04s call     tests/model_zoo/validated/vision/object_detection_segmentation_models_test.py::test_models[tiny-yolov2/model/tinyyolov2-8.onnx]
2.93s call     tests/model_zoo/validated/vision/object_detection_segmentation_models_test.py::test_models[ssd-mobilenetv1/model/ssd_mobilenet_v1_12.onnx]
1.86s call     tests/model_zoo/validated/vision/classification_models_test.py::test_models[mobilenet/model/mobilenetv2-12.onnx]
0.90s call     tests/model_zoo/validated/vision/classification_models_test.py::test_models[mnist/model/mnist-12.onnx]
============= 9 passed, 27 skipped, 1 xfailed in 79.62s (0:01:19) ==============

TODO

  • Implement caching in the test suite and run CPU tests on runners with the persistent-cache label so that we don't burn through bandwidth

ci-exactly: build_packages, test_onnx

@ScottTodd ScottTodd added infrastructure Relating to build systems, CI, or testing integrations/onnx ONNX integration work labels Dec 18, 2024
@ScottTodd ScottTodd force-pushed the testing-onnx-models-parameterize branch from f7ae544 to fea2a82 Compare December 18, 2024 23:57
ScottTodd added a commit to iree-org/iree-test-suites that referenced this pull request Jan 15, 2025
Progress on #6. See
how this is used downstream in
iree-org/iree#19524.

## Overview

This replaces hardcoded flags like
```python
iree_compile_flags = [
    "--iree-hal-target-backends=llvm-cpu",
    "--iree-llvmcpu-target-cpu=host",
]
iree_run_module_flags = [
    "--device=local-task",
]
```
and inlined marks like
```python
@pytest.mark.xfail(raises=IreeCompileException)
def test_foo():
```
with a JSON config file passed to the test runner via the
`--test-config-file` option or the `IREE_TEST_CONFIG_FILE` environment
variable.

During test case collection, each test case name is looked up in the
config file to determine what the expected outcome is, from `["skip"
(special option), "pass", "fail-import", "fail-compile", "fail-run"]`.
By default, all tests are skipped. This design allows for out of tree
testing to be performed using explicit test lists (encoded in a file,
unlike the [`-k`
option](https://docs.pytest.org/en/latest/example/markers.html#using-k-expr-to-select-tests-based-on-their-name)),
custom flags, and custom test expectations.

## Design details

Compare this implementation with these others:

* https://github.com/iree-org/iree-test-suites/tree/main/onnx_ops also
uses config files, but with separate lists for `skip_compile_tests`,
`skip_run_tests`, `expected_compile_failures`, and
`expected_run_failures`. All tests are run by default.
*
https://github.com/nod-ai/SHARK-TestSuite/blob/main/alt_e2eshark/run.py
uses `--device=`, `--backend=`, `--target-chip=`, and `--test-filter=`
arguments. Arbitrary flags are not supported, and test expectations are
also not supported, so there is no way to directly signal if tests are
unexpectedly passing or failing. A utility script can be used to diff
the results of two test reports:
https://github.com/nod-ai/SHARK-TestSuite/blob/main/alt_e2eshark/utils/check_regressions.py.
*
https://github.com/iree-org/iree-test-suites/blob/main/sharktank_models/llama3.1/test_llama.py
parameterizes test cases using `@pytest.fixture([params=[...]])` with
`pytest.mark.target_hip` and other custom marks. This is more standard
pytest and supports fluent ways to express other test configurations,
but it makes annotating large numbers of tests pretty verbose and
doesn't allow for out of tree configuration.

I'm imagining a few usage styles:

* Nightly testing in this repository, running all test cases and
tracking the current test results in a checked in config file.
* We could also go with an approach like
https://github.com/nod-ai/SHARK-TestSuite/blob/main/alt_e2eshark/utils/check_regressions.py
to diff test results but this encodes the test results in the config
files rather than in external reports. I see pros and cons to both
approaches.
* Presubmit testing in https://github.com/iree-org/iree, running a
subset of test cases that pass, ensuring that they do not start failing.
We could also run with XFAIL to get early signal for tests that start to
pass.
* If we don't run with XFAIL then we don't need the generalized
`tests_and_expected_outcomes`, we could just limit testing to only
models that are passing.
* Developer testing with arbitrary flags.

## Follow-up tasks

- [ ] Add job matrix to workflow (needs runners in this repo with GPUs)
- [ ] Add an easy way to update the list of XFAILs (maybe switch to
https://github.com/gsnedders/pytest-expect and use its
`--update-xfail`?)
- [ ] Triage some of the failures (e.g. can adjust tolerances on Vulkan)
- [ ] Adjust file downloading / caching behavior to avoid redownloading
and using significant bandwidth when used together with persistent
self-hosted runners or github actions caches
@ScottTodd ScottTodd requested a review from zjgarvey January 15, 2025 23:47
@@ -396,6 +396,8 @@ not supported by Bazel rules at this point.

## External test suites
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This page is published at https://iree.dev/developers/general/testing-guide/#external-test-suites. Generally trying to put enough information there so

  • developers working in just this iree-org/iree repository can understand what the different tests are and how to handle newly failing or passing tests
  • developers are aware of out of tree test suites
  • each test suite is put in context

Along these lines, I would like to promote more of the test suite work going on (in both iree-test-suites and SHARK-TestSuite) up to the level of overall IREE ecosystem dashboards and release notes. For example, each stable release could highlight the test result delta and average performance delta since the previous release.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
infrastructure Relating to build systems, CI, or testing integrations/onnx ONNX integration work
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant