RFC: Specialize for non-mixed-dtype in elementwise_util #9388

swolchok · 2025-03-19T00:32:28Z

Mixed dtype should be uncommon. Here is how we can specialize for the common case. Prepares us to tackle #9241 .

Test Plan: automated tests on this PR verify we didn't break the now-deprecated runtime_out_dtypes mode; tests on the next PR will verify that everything works after migration. Also included migration for exactly one operator, op_mul, to verify that the new code compiles.

To check performance, I edited examples/models/toy_model/model.py so that MulModule used inputs of size 3000, 2000 instead of 3, 2. I exported it with python3 -m examples.portable.scripts.export --model_name mul and saved the resulting mul.pte. Then I built in release mode with optimized kernels on, but with mul.out removed from kernels/optimized/optimized.yaml, so that we would use the optimized_portable_kernels build of kernels/portable/op_mul.cpp. Finally, I ran 3 trials on my M1 Macbook Pro using cmake-out/executor_runner --model_path mul3kby2k.pte --num_executions 1000 --cpu_threads 2. Resulting times for 1000 iterations in ms:

Previous diff: 8295, 8187, 8139
This diff: 2953, 2806, 2861

(For comparison, the actual optimized mul kernel took around 1000 ms to run 1000 iterations, and #9432 later in the stack arrived at similar numbers.)

[ghstack-poisoned]

swolchok · 2025-03-19T00:32:29Z

Stack from ghstack (oldest at bottom):

pytorch-bot · 2025-03-19T00:32:31Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/9388

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

⏳ No Failures, 5 Pending

As of commit b0fc7f9 with merge base a1dd017 ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

Mixed dtype should be uncommon. Here is how we can specialize for the common case. Test Plan: automated tests on this PR verify we didn't break the now-deprecated runtime_out_dtypes mode; tests on the next PR will verify that everything works after migration. Also included migration for exactly one operator, op_mul, to verify that the new code compiles. ghstack-source-id: 079c8b9 ghstack-comment-id: 2735017566 Pull Request resolved: #9388

[ghstack-poisoned]

Mixed dtype should be uncommon. Here is how we can specialize for the common case. Test Plan: automated tests on this PR verify we didn't break the now-deprecated runtime_out_dtypes mode; tests on the next PR will verify that everything works after migration. Also included migration for exactly one operator, op_mul, to verify that the new code compiles. ghstack-source-id: ae7c289 ghstack-comment-id: 2735017566 Pull Request resolved: #9388

[ghstack-poisoned]

…emplate arguments (#9841) This is necessary to take advantage of the previous PR #9388, which creates dtype-specialized implementations for the non-mixed dtype case. The size cost of this approach looking at `__TEXT` of size_test_all_optimized_ops built with bash test/build_optimized_size_test.sh is 104212 bytes, a 2.2% increase.

…out_dtypes in template arguments This is necessary to take advantage of #9388, which creates dtype-specialized implementations for the non-mixed dtype case. Measured the size cost of this approach with test/build_optimized_size_test.sh . It does cost us some size: ``` Before: ExecuTorch with no ops binary size, unstripped: -rwxr-xr-x 1 swolchok staff 153928 Apr 25 12:51 cmake-out/test/size_test ExecuTorch with portable ops binary size, unstripped: -rwxr-xr-x 1 swolchok staff 1796928 Apr 25 12:51 cmake-out/test/size_test_all_ops ExecuTorch with optimized ops binary size, unstripped: -rwxr-xr-x 1 swolchok staff 5605176 Apr 25 12:51 cmake-out/test/size_test_all_optimized_ops (.venv) swolchok@swolchok-mac ~/src/executorch> size cmake-out/test/size_test* __TEXT __DATA __OBJC others dec hex 81920 81920 0 4295049216 4295213056 10003c000 cmake-out/test/size_test 1310720 81920 0 4295458816 4296851456 1001cc000 cmake-out/test/size_test_all_ops 4358144 98304 0 4296212480 4300668928 100570000 cmake-out/test/size_test_all_optimized_ops After: ExecuTorch with no ops binary size, unstripped: -rwxr-xr-x 1 swolchok staff 153928 Apr 25 12:57 cmake-out/test/size_test ExecuTorch with portable ops binary size, unstripped: -rwxr-xr-x 1 swolchok staff 1889792 Apr 25 12:57 cmake-out/test/size_test_all_ops ExecuTorch with optimized ops binary size, unstripped: -rwxr-xr-x 1 swolchok staff 5799704 Apr 25 12:57 cmake-out/test/size_test_all_optimized_ops (.venv) swolchok@swolchok-mac ~/src/executorch> size cmake-out/test/size_test* __TEXT __DATA __OBJC others dec hex 81920 81920 0 4295049216 4295213056 10003c000 cmake-out/test/size_test 1376256 81920 0 4295491584 4296949760 1001e4000 cmake-out/test/size_test_all_ops 4423680 98304 0 4296327168 4300849152 10059c000 cmake-out/test/size_test_all_optimized_ops ``` However, on an absolute basis, size is still below where we are at two PRs ago, which was: ``` ExecuTorch with no ops binary size, unstripped: -rwxr-xr-x 1 swolchok staff 153928 Apr 25 12:24 cmake-out/test/size_test ExecuTorch with portable ops binary size, unstripped: -rwxr-xr-x 1 swolchok staff 2150960 Apr 25 12:24 cmake-out/test/size_test_all_ops ExecuTorch with optimized ops binary size, unstripped: -rwxr-xr-x 1 swolchok staff 5887368 Apr 25 12:24 cmake-out/test/size_test_all_optimized_ops (.venv) swolchok@swolchok-mac ~/src/executorch> size cmake-out/test/size_test* __TEXT __DATA __OBJC others dec hex 81920 81920 0 4295049216 4295213056 10003c000 cmake-out/test/size_test 1474560 81920 0 4295655424 4297211904 100224000 cmake-out/test/size_test_all_ops 4489216 98304 0 4296359936 4300947456 1005b4000 cmake-out/test/size_test_all_optimized_ops ``` ghstack-source-id: 9c6751f ghstack-comment-id: 2831329157 Pull-Request-resolved: #10491

…out_dtypes in template arguments This is necessary to take advantage of #9388, which creates dtype-specialized implementations for the non-mixed dtype case. Measured the size cost of this approach with test/build_optimized_size_test.sh . It does cost us some size: ``` Before: ExecuTorch with no ops binary size, unstripped: -rwxr-xr-x 1 swolchok staff 153928 Apr 25 12:51 cmake-out/test/size_test ExecuTorch with portable ops binary size, unstripped: -rwxr-xr-x 1 swolchok staff 1796928 Apr 25 12:51 cmake-out/test/size_test_all_ops ExecuTorch with optimized ops binary size, unstripped: -rwxr-xr-x 1 swolchok staff 5605176 Apr 25 12:51 cmake-out/test/size_test_all_optimized_ops (.venv) swolchok@swolchok-mac ~/src/executorch> size cmake-out/test/size_test* __TEXT __DATA __OBJC others dec hex 81920 81920 0 4295049216 4295213056 10003c000 cmake-out/test/size_test 1310720 81920 0 4295458816 4296851456 1001cc000 cmake-out/test/size_test_all_ops 4358144 98304 0 4296212480 4300668928 100570000 cmake-out/test/size_test_all_optimized_ops After: ExecuTorch with no ops binary size, unstripped: -rwxr-xr-x 1 swolchok staff 153928 Apr 25 12:57 cmake-out/test/size_test ExecuTorch with portable ops binary size, unstripped: -rwxr-xr-x 1 swolchok staff 1889792 Apr 25 12:57 cmake-out/test/size_test_all_ops ExecuTorch with optimized ops binary size, unstripped: -rwxr-xr-x 1 swolchok staff 5799704 Apr 25 12:57 cmake-out/test/size_test_all_optimized_ops (.venv) swolchok@swolchok-mac ~/src/executorch> size cmake-out/test/size_test* __TEXT __DATA __OBJC others dec hex 81920 81920 0 4295049216 4295213056 10003c000 cmake-out/test/size_test 1376256 81920 0 4295491584 4296949760 1001e4000 cmake-out/test/size_test_all_ops 4423680 98304 0 4296327168 4300849152 10059c000 cmake-out/test/size_test_all_optimized_ops ``` However, on an absolute basis, size is still below where we are at two PRs ago, which was: ``` ExecuTorch with no ops binary size, unstripped: -rwxr-xr-x 1 swolchok staff 153928 Apr 25 12:24 cmake-out/test/size_test ExecuTorch with portable ops binary size, unstripped: -rwxr-xr-x 1 swolchok staff 2150960 Apr 25 12:24 cmake-out/test/size_test_all_ops ExecuTorch with optimized ops binary size, unstripped: -rwxr-xr-x 1 swolchok staff 5887368 Apr 25 12:24 cmake-out/test/size_test_all_optimized_ops (.venv) swolchok@swolchok-mac ~/src/executorch> size cmake-out/test/size_test* __TEXT __DATA __OBJC others dec hex 81920 81920 0 4295049216 4295213056 10003c000 cmake-out/test/size_test 1474560 81920 0 4295655424 4297211904 100224000 cmake-out/test/size_test_all_ops 4489216 98304 0 4296359936 4300947456 1005b4000 cmake-out/test/size_test_all_optimized_ops ``` ghstack-source-id: 58e2b05 ghstack-comment-id: 2831329157 Pull-Request-resolved: #10491

…out_dtypes in template arguments This is necessary to take advantage of #9388, which creates dtype-specialized implementations for the non-mixed dtype case. Measured the size cost of this approach with test/build_optimized_size_test.sh . It does cost us some size: ``` Before: ExecuTorch with no ops binary size, unstripped: -rwxr-xr-x 1 swolchok staff 153928 Apr 25 12:51 cmake-out/test/size_test ExecuTorch with portable ops binary size, unstripped: -rwxr-xr-x 1 swolchok staff 1796928 Apr 25 12:51 cmake-out/test/size_test_all_ops ExecuTorch with optimized ops binary size, unstripped: -rwxr-xr-x 1 swolchok staff 5605176 Apr 25 12:51 cmake-out/test/size_test_all_optimized_ops (.venv) swolchok@swolchok-mac ~/src/executorch> size cmake-out/test/size_test* __TEXT __DATA __OBJC others dec hex 81920 81920 0 4295049216 4295213056 10003c000 cmake-out/test/size_test 1310720 81920 0 4295458816 4296851456 1001cc000 cmake-out/test/size_test_all_ops 4358144 98304 0 4296212480 4300668928 100570000 cmake-out/test/size_test_all_optimized_ops After: ExecuTorch with no ops binary size, unstripped: -rwxr-xr-x 1 swolchok staff 153928 Apr 25 12:57 cmake-out/test/size_test ExecuTorch with portable ops binary size, unstripped: -rwxr-xr-x 1 swolchok staff 1889792 Apr 25 12:57 cmake-out/test/size_test_all_ops ExecuTorch with optimized ops binary size, unstripped: -rwxr-xr-x 1 swolchok staff 5799704 Apr 25 12:57 cmake-out/test/size_test_all_optimized_ops (.venv) swolchok@swolchok-mac ~/src/executorch> size cmake-out/test/size_test* __TEXT __DATA __OBJC others dec hex 81920 81920 0 4295049216 4295213056 10003c000 cmake-out/test/size_test 1376256 81920 0 4295491584 4296949760 1001e4000 cmake-out/test/size_test_all_ops 4423680 98304 0 4296327168 4300849152 10059c000 cmake-out/test/size_test_all_optimized_ops ``` However, on an absolute basis, size is still below where we are at two PRs ago, which was: ``` ExecuTorch with no ops binary size, unstripped: -rwxr-xr-x 1 swolchok staff 153928 Apr 25 12:24 cmake-out/test/size_test ExecuTorch with portable ops binary size, unstripped: -rwxr-xr-x 1 swolchok staff 2150960 Apr 25 12:24 cmake-out/test/size_test_all_ops ExecuTorch with optimized ops binary size, unstripped: -rwxr-xr-x 1 swolchok staff 5887368 Apr 25 12:24 cmake-out/test/size_test_all_optimized_ops (.venv) swolchok@swolchok-mac ~/src/executorch> size cmake-out/test/size_test* __TEXT __DATA __OBJC others dec hex 81920 81920 0 4295049216 4295213056 10003c000 cmake-out/test/size_test 1474560 81920 0 4295655424 4297211904 100224000 cmake-out/test/size_test_all_ops 4489216 98304 0 4296359936 4300947456 1005b4000 cmake-out/test/size_test_all_optimized_ops ``` ghstack-source-id: 1eb6943 ghstack-comment-id: 2831329157 Pull-Request-resolved: #10491

…out_dtypes in template arguments This is necessary to take advantage of #9388, which creates dtype-specialized implementations for the non-mixed dtype case. Measured the size cost of this approach with test/build_optimized_size_test.sh . It does cost us some size: ``` Before: ExecuTorch with no ops binary size, unstripped: -rwxr-xr-x 1 swolchok staff 153928 Apr 25 12:51 cmake-out/test/size_test ExecuTorch with portable ops binary size, unstripped: -rwxr-xr-x 1 swolchok staff 1796928 Apr 25 12:51 cmake-out/test/size_test_all_ops ExecuTorch with optimized ops binary size, unstripped: -rwxr-xr-x 1 swolchok staff 5605176 Apr 25 12:51 cmake-out/test/size_test_all_optimized_ops (.venv) swolchok@swolchok-mac ~/src/executorch> size cmake-out/test/size_test* __TEXT __DATA __OBJC others dec hex 81920 81920 0 4295049216 4295213056 10003c000 cmake-out/test/size_test 1310720 81920 0 4295458816 4296851456 1001cc000 cmake-out/test/size_test_all_ops 4358144 98304 0 4296212480 4300668928 100570000 cmake-out/test/size_test_all_optimized_ops After: ExecuTorch with no ops binary size, unstripped: -rwxr-xr-x 1 swolchok staff 153928 Apr 25 12:57 cmake-out/test/size_test ExecuTorch with portable ops binary size, unstripped: -rwxr-xr-x 1 swolchok staff 1889792 Apr 25 12:57 cmake-out/test/size_test_all_ops ExecuTorch with optimized ops binary size, unstripped: -rwxr-xr-x 1 swolchok staff 5799704 Apr 25 12:57 cmake-out/test/size_test_all_optimized_ops (.venv) swolchok@swolchok-mac ~/src/executorch> size cmake-out/test/size_test* __TEXT __DATA __OBJC others dec hex 81920 81920 0 4295049216 4295213056 10003c000 cmake-out/test/size_test 1376256 81920 0 4295491584 4296949760 1001e4000 cmake-out/test/size_test_all_ops 4423680 98304 0 4296327168 4300849152 10059c000 cmake-out/test/size_test_all_optimized_ops ``` However, on an absolute basis, size is still below where we are at two PRs ago, which was: ``` ExecuTorch with no ops binary size, unstripped: -rwxr-xr-x 1 swolchok staff 153928 Apr 25 12:24 cmake-out/test/size_test ExecuTorch with portable ops binary size, unstripped: -rwxr-xr-x 1 swolchok staff 2150960 Apr 25 12:24 cmake-out/test/size_test_all_ops ExecuTorch with optimized ops binary size, unstripped: -rwxr-xr-x 1 swolchok staff 5887368 Apr 25 12:24 cmake-out/test/size_test_all_optimized_ops (.venv) swolchok@swolchok-mac ~/src/executorch> size cmake-out/test/size_test* __TEXT __DATA __OBJC others dec hex 81920 81920 0 4295049216 4295213056 10003c000 cmake-out/test/size_test 1474560 81920 0 4295655424 4297211904 100224000 cmake-out/test/size_test_all_ops 4489216 98304 0 4296359936 4300947456 1005b4000 cmake-out/test/size_test_all_optimized_ops ``` ghstack-source-id: 897d4c9 ghstack-comment-id: 2831329157 Pull-Request-resolved: #10491

…out_dtypes in template arguments (#10491) This is necessary to take advantage of #9388, which creates dtype-specialized implementations for the non-mixed dtype case. Measured the size cost of this approach with test/build_optimized_size_test.sh . It does cost us some size: ``` Before: ExecuTorch with no ops binary size, unstripped: -rwxr-xr-x 1 swolchok staff 153928 Apr 25 12:51 cmake-out/test/size_test ExecuTorch with portable ops binary size, unstripped: -rwxr-xr-x 1 swolchok staff 1796928 Apr 25 12:51 cmake-out/test/size_test_all_ops ExecuTorch with optimized ops binary size, unstripped: -rwxr-xr-x 1 swolchok staff 5605176 Apr 25 12:51 cmake-out/test/size_test_all_optimized_ops (.venv) swolchok@swolchok-mac ~/src/executorch> size cmake-out/test/size_test* __TEXT __DATA __OBJC others dec hex 81920 81920 0 4295049216 4295213056 10003c000 cmake-out/test/size_test 1310720 81920 0 4295458816 4296851456 1001cc000 cmake-out/test/size_test_all_ops 4358144 98304 0 4296212480 4300668928 100570000 cmake-out/test/size_test_all_optimized_ops After: ExecuTorch with no ops binary size, unstripped: -rwxr-xr-x 1 swolchok staff 153928 Apr 25 12:57 cmake-out/test/size_test ExecuTorch with portable ops binary size, unstripped: -rwxr-xr-x 1 swolchok staff 1889792 Apr 25 12:57 cmake-out/test/size_test_all_ops ExecuTorch with optimized ops binary size, unstripped: -rwxr-xr-x 1 swolchok staff 5799704 Apr 25 12:57 cmake-out/test/size_test_all_optimized_ops (.venv) swolchok@swolchok-mac ~/src/executorch> size cmake-out/test/size_test* __TEXT __DATA __OBJC others dec hex 81920 81920 0 4295049216 4295213056 10003c000 cmake-out/test/size_test 1376256 81920 0 4295491584 4296949760 1001e4000 cmake-out/test/size_test_all_ops 4423680 98304 0 4296327168 4300849152 10059c000 cmake-out/test/size_test_all_optimized_ops ``` However, on an absolute basis, size is still below where we are at two PRs ago, which was: ``` ExecuTorch with no ops binary size, unstripped: -rwxr-xr-x 1 swolchok staff 153928 Apr 25 12:24 cmake-out/test/size_test ExecuTorch with portable ops binary size, unstripped: -rwxr-xr-x 1 swolchok staff 2150960 Apr 25 12:24 cmake-out/test/size_test_all_ops ExecuTorch with optimized ops binary size, unstripped: -rwxr-xr-x 1 swolchok staff 5887368 Apr 25 12:24 cmake-out/test/size_test_all_optimized_ops (.venv) swolchok@swolchok-mac ~/src/executorch> size cmake-out/test/size_test* __TEXT __DATA __OBJC others dec hex 81920 81920 0 4295049216 4295213056 10003c000 cmake-out/test/size_test 1474560 81920 0 4295655424 4297211904 100224000 cmake-out/test/size_test_all_ops 4489216 98304 0 4296359936 4300947456 1005b4000 cmake-out/test/size_test_all_optimized_ops ```

Mixed dtype should be uncommon. Here is how we can specialize for the common case. Test Plan: automated tests on this PR verify we didn't break the now-deprecated runtime_out_dtypes mode; tests on the next PR will verify that everything works after migration. Also included migration for exactly one operator, op_mul, to verify that the new code compiles. ghstack-source-id: 496defd ghstack-comment-id: 2735017566 Pull Request resolved: pytorch/executorch#9388

Mixed dtype should be uncommon. Here is how we can specialize for the common case. Test Plan: automated tests on this PR verify we didn't break the now-deprecated runtime_out_dtypes mode; tests on the next PR will verify that everything works after migration. Also included migration for exactly one operator, op_mul, to verify that the new code compiles. ghstack-source-id: c2537b7 ghstack-comment-id: 2735017566 Pull Request resolved: pytorch/executorch#9388

Mixed dtype should be uncommon. Here is how we can specialize for the common case. Test Plan: automated tests on this PR verify we didn't break the now-deprecated runtime_out_dtypes mode; tests on the next PR will verify that everything works after migration. Also included migration for exactly one operator, op_mul, to verify that the new code compiles. ghstack-source-id: a094279 ghstack-comment-id: 2735017566 Pull Request resolved: pytorch/executorch#9388

swolchok added 5 commits March 18, 2025 17:32

Update

31a49e0

[ghstack-poisoned]

Update

9fcd885

[ghstack-poisoned]

Update

29d6de9

[ghstack-poisoned]

Update

79b908c

[ghstack-poisoned]

Update

fd62a07

[ghstack-poisoned]

swolchok requested a review from manuelcandales as a code owner March 19, 2025 00:32

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Mar 19, 2025

swolchok requested a review from kimishpatel March 19, 2025 00:32

swolchok added 6 commits March 19, 2025 08:57

Update

854c991

[ghstack-poisoned]

Update

def7ed4

[ghstack-poisoned]

Update

40c1b1b

[ghstack-poisoned]

Update

7c78357

[ghstack-poisoned]

Update

7ba269a

[ghstack-poisoned]

Update

edd45fb

[ghstack-poisoned]

swolchok mentioned this pull request Mar 19, 2025

Vectorize optimized_portable_ops versions of portable ops? #9241

Open

swolchok added 7 commits March 19, 2025 17:36

Update

b9c545f

[ghstack-poisoned]

Update

3091007

[ghstack-poisoned]

Update

4a00cac

[ghstack-poisoned]

Update

21b81bf

[ghstack-poisoned]

Update

4c4add0

[ghstack-poisoned]

Update

8782a90

[ghstack-poisoned]

Update

75f8970

[ghstack-poisoned]

This was referenced Apr 2, 2025

set INTERFACE_LINK_LIBARIES for extension_threadpool #9826

Closed

Add build_optimized_size_test.sh #9827

Closed

Save some size in dtype_util when dtype selective build is not in use #9829

Closed

swolchok added 6 commits April 2, 2025 10:09

Update

20f3046

[ghstack-poisoned]

Update

3aa266d

[ghstack-poisoned]

Update

3c88a56

[ghstack-poisoned]

Update

153735d

[ghstack-poisoned]

Update

cac4293

[ghstack-poisoned]

Update

85451ea

[ghstack-poisoned]

Base automatically changed from gh/swolchok/382/head to main April 22, 2025 23:17

swolchok requested a review from larryliu0820 as a code owner April 22, 2025 23:17

swolchok added the topic: not user facing label Apr 22, 2025

Update

b0fc7f9

[ghstack-poisoned]

swolchok merged commit 208a341 into main Apr 23, 2025
84 checks passed

swolchok deleted the gh/swolchok/383/head branch April 23, 2025 01:05

swolchok mentioned this pull request Apr 25, 2025

Reapply #9841: Migrate elementwise_util callers to the variants with out_dtypes in template arguments #10491

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

RFC: Specialize for non-mixed-dtype in elementwise_util #9388

RFC: Specialize for non-mixed-dtype in elementwise_util #9388

Uh oh!

swolchok commented Mar 19, 2025 •

edited

Loading

Uh oh!

swolchok commented Mar 19, 2025 •

edited

Loading

Uh oh!

pytorch-bot bot commented Mar 19, 2025 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

RFC: Specialize for non-mixed-dtype in elementwise_util #9388

RFC: Specialize for non-mixed-dtype in elementwise_util #9388

Uh oh!

Conversation

swolchok commented Mar 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

swolchok commented Mar 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Mar 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/9388

⏳ No Failures, 5 Pending

Uh oh!

Uh oh!

Uh oh!

swolchok commented Mar 19, 2025 •

edited

Loading

swolchok commented Mar 19, 2025 •

edited

Loading

pytorch-bot bot commented Mar 19, 2025 •

edited

Loading