Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
benchmark of fbgemm op - keyed_tensor_regroup (pytorch#2159)
Summary: Pull Request resolved: pytorch#2159 # context * added **fn-level** benchmark for the `regroup_keyed_tensor` * `keyed_tensor_regroup` further reduces the CPU runtime from 2.0ms to 1.3ms (35% improvement) without hurting the GPU runtime/memory usage # conclusion * CPU runtime **reduces 1/3** from 1.8 ms to 1.1 ms * GPU runtime **reduces 2/3** from 7.0 ms to 2.0 ms * GPU memory **reduces 1/3** from 1.5 K to 1.0 K * **we should migrate to the new op** unless any unknown concern/blocker # traces * [files](https://drive.google.com/drive/folders/1iiEf30LeG_i0xobMZVhmMneOQ5slmX3U?usp=drive_link) ``` [[email protected] /data/sandcastle/boxes/fbsource (04ad34da3)]$ ll *.json -rw-r--r-- 1 hhy hhy 552501 Jul 10 16:01 'trace-[1 Op] KT_regroup_dup.json' -rw-r--r-- 1 hhy hhy 548847 Jul 10 16:01 'trace-[1 Op] KT_regroup.json' -rw-r--r-- 1 hhy hhy 559006 Jul 10 16:01 'trace-[2 Ops] permute_multi_embs_dup.json' -rw-r--r-- 1 hhy hhy 553199 Jul 10 16:01 'trace-[2 Ops] permute_multi_embs.json' -rw-r--r-- 1 hhy hhy 5104239 Jul 10 16:01 'trace-[Module] KTRegroupAsDict_dup.json' -rw-r--r-- 1 hhy hhy 346643 Jul 10 16:01 'trace-[Module] KTRegroupAsDict.json' -rw-r--r-- 1 hhy hhy 895096 Jul 10 16:01 'trace-[Old Prod] permute_pooled_embs.json' -rw-r--r-- 1 hhy hhy 561685 Jul 10 16:01 'trace-[Prod] KeyedTensor.regroup_dup.json' -rw-r--r-- 1 hhy hhy 559147 Jul 10 16:01 'trace-[Prod] KeyedTensor.regroup.json' -rw-r--r-- 1 hhy hhy 7958676 Jul 10 16:01 'trace-[pytorch generic] fallback_dup.json' -rw-r--r-- 1 hhy hhy 7978141 Jul 10 16:01 'trace-[pytorch generic] fallback.json' ``` * pytorch generic {F1752502508} * current prod {F1752503546} * permute_multi_embedding (2 Ops) {F1752503160} * KT.regroup (1 Op) {F1752504258} * regroupAsDict (Module) {F1752504964} * metrics |Operator|CPU runtime|GPU runtime|GPU memory|notes| |---|---|---|---|---| |**[fallback] pytorch generic**|3.9 ms|3.2 ms|1.0 K|CPU-bounded, allow duplicates| |**[prod] _fbgemm_permute_pooled_embs**|1.9 ms|7.1 ms|1.5 K|GPU-boudned, does **NOT** allow duplicates, PT2 non-compatible `pin_and_move`| |**[hybrid python/cu] keyed_tensor_regroup**|1.5 ms|2.0 ms|1.0 K|both GPU runtime and memory improved, **ALLOW** duplicates, PT2 friendly| |**[pure c++/cu] permute_multi_embedding**|1.0 ms|2.0 ms|1.0 K|both CPU and GPU runtime/memory improved, **ALLOW** duplicates, PT2 friendly| Differential Revision: D58907223
- Loading branch information