Merge #2349

PaulZhang12 · 2024-08-29T15:34:56Z

No description provided.

Summary: Pull Request resolved: #2302 Reviewed By: joshuadeng Differential Revision: D61312123 Pulled By: PaulZhang12 fbshipit-source-id: 22be2161d4cfcfe9ebe23309388f3bda769f6fdc

Summary: Pull Request resolved: #2330 So: (1) we did not have consistent handling between quant and sharded quant dtypes. (2) users correctly identified on cpu path we want to make sure indices and offsets are same dtype (to save templates). But talking to sryap we want to just make them consistent, and leave in int32 if possible. (3) cuda is always int32. this is likely to be issue with ZCH models pushing beyond 2B row shards for inference, for now just need make shards that size. Reviewed By: sryap Differential Revision: D61446242 fbshipit-source-id: 779d22822d4ac7656675d9ac3887b5ac3ecbfa39

Summary: Pull Request resolved: #2329 TSIA Reviewed By: yuhuishi-convect, joshuadeng Differential Revision: D61613723 fbshipit-source-id: 343c9631dd714cfd0d64fdac3c3210afc2d744f5

Summary: Pull Request resolved: #2310 Add new pipeline for CompiledAutograd development. Reviewed By: dstaay-fb, xmfan, yf225 Differential Revision: D61403499 fbshipit-source-id: 7bf0720e0c1078815315278fffd79c2d7470882f

Summary: Pull Request resolved: #2331 Initial to do Prod strategy is use unsharded MCH module in front of Q / SQ EC. General seems ok, but biggest issue: [2] uneven sharding flag was not respected for rw sequence GPU case, easy fix cc: gnahzg [3] get_propogate device is bit unf/messy, will cleanup in followup task, but found edgecase wrt cpu path cc: gnahzg Reviewed By: gnahzg Differential Revision: D61572421 fbshipit-source-id: 412ed7947a7cde1991518d6a979db3bc7832fc68

Summary: Pull Request resolved: #2324 Add missing pipelline_preproc and custom_moel_fwd args. Reviewed By: chrisxcai Differential Revision: D61564467 fbshipit-source-id: 280f0e83de13e10ff2901bbda611d7ba76c8ac68

Summary: Pull Request resolved: #2326 # context * want to resolve graph break: [failures_and_restarts](https://interncache-all.fbcdn.net/manifold/tlparse_reports/tree/logs/.tmpKJM3FI/failures_and_restarts.html), P1537573230 ``` Tried to use data-dependent value in the subsequent computation. This can happen when we encounter unbounded dynamic value that is unknown during tracing time. You will need to explicitly give hint to the compiler. Please take a look at torch._check OR torch._check_is_size APIs. Could not guard on data-dependent expression Eq(((2*u48)//(u48 + u49)), 0) (unhinted: Eq(((2*u48)//(u48 + u49)), 0)). (Size-like symbols: u49, u48) Potential framework code culprit (scroll up for full backtrace): File "/data/users/hhy/fbsource/buck-out/v2/gen/fbcode/61f992c26f3f2773/aps_models/ads/icvr/__icvr_launcher_live__/icvr_launcher_live#link-tree/torch/_refs/__init__.py", line 3950, in unbind if guard_size_oblivious(t.shape[dim] == 0): For more information, run with TORCH_LOGS="dynamic" For extended logs when we create symbols, also add TORCHDYNAMO_EXTENDED_DEBUG_CREATE_SYMBOL="u49,u48" If you suspect the guard was triggered from C++, add TORCHDYNAMO_EXTENDED_DEBUG_CPP=1 For more debugging help, see https://docs.google.com/document/d/1HSuTTVvYH1pTew89Rtpeu84Ht3nQEFTYhAX3Ypa_xJs/edit?usp=sharing User Stack (most recent call last): (snipped, see stack below for prefix) ... File "/data/users/hhy/fbsource/buck-out/v2/gen/fbcode/61f992c26f3f2773/aps_models/ads/icvr/__icvr_launcher_live__/icvr_launcher_live#link-tree/torchrec/sparse/jagged_tensor.py", line 2241, in to_dict _jt_dict = _maybe_compute_kjt_to_jt_dict( File "/data/users/hhy/fbsource/buck-out/v2/gen/fbcode/61f992c26f3f2773/aps_models/ads/icvr/__icvr_launcher_live__/icvr_launcher_live#link-tree/torchrec/sparse/jagged_tensor.py", line 1226, in _maybe_compute_kjt_to_jt_dict split_lengths = torch.unbind( ``` * we added [shape check](https://fburl.com/code/p02u4mck): ``` if pt2_guard_size_oblivious(lengths.numel() > 0): strided_lengths = lengths.view(-1, stride) if not torch.jit.is_scripting() and is_torchdynamo_compiling(): torch._check(strided_lengths.shape[0] > 0) torch._check(strided_lengths.shape[1] > 0) split_lengths = torch.unbind( strided_lengths, dim=0, ) ``` * however the error is still there ``` File "/data/users/hhy/fbsource/buck-out/v2/gen/fbcode/61f992c26f3f2773/aps_models/ads/icvr/__icvr_launcher_live__/icvr_launcher_live#link-tree/torch/_refs/__init__.py", line 3950, in unbind if guard_size_oblivious(t.shape[dim] == 0): File "/data/users/hhy/fbsource/buck-out/v2/gen/fbcode/61f992c26f3f2773/aps_models/ads/icvr/__icvr_launcher_live__/icvr_launcher_live#link-tree/torch/fx/experimental/symbolic_shapes.py", line 253, in guard_size_oblivious return expr.node.guard_size_oblivious("", 0) File "/data/users/hhy/fbsource/buck-out/v2/gen/fbcode/61f992c26f3f2773/aps_models/ads/icvr/__icvr_launcher_live__/icvr_launcher_live#link-tree/torch/fx/experimental/sym_node.py", line 503, in guard_size_oblivious r = self.shape_env.evaluate_expr( ``` * [implementation](https://fburl.com/code/20iue1ib) ``` register_decomposition(aten.unbind) def unbind(t: TensorLikeType, dim: int = 0) -> TensorSequenceType: from torch.fx.experimental.symbolic_shapes import guard_size_oblivious dim = utils.canonicalize_dim(t.ndim, dim) torch._check_index( len(t.shape) > 0, lambda: "Dimension specified as 0 but tensor has no dimensions", ) if guard_size_oblivious(t.shape[dim] == 0): # <------- here return () else: return tuple( torch.squeeze(s, dim) for s in torch.tensor_split(t, t.shape[dim], dim) ) ``` * with D61677207 [no graph break at _maybe_compute_kjt_to_jt_dict](https://interncache-all.fbcdn.net/manifold/tlparse_reports/tree/logs/.tmpNcI14t/failures_and_restarts.html) Reviewed By: IvanKobzarev Differential Revision: D55277785 fbshipit-source-id: d3bbb2439427677dc8e9a58560679acbf6f872d4

Summary: X-link: ctrl-labs/src2#33884 X-link: pytorch/executorch#4810 Pull Request resolved: #2322 X-link: pytorch/pytorch#134054 See #131429 Reviewed By: laithsakka Differential Revision: D61493706 fbshipit-source-id: d2b3feeff2abf8610e4e9940a1b93b5f80777dc2

Summary: Pull Request resolved: #2288 # context * provide a benchmark function entry point and command-line for running torch.compile with a baby model containing EBC * current supported arguments: ``` rank: int = 0, world_size: int = 2, num_features: int = 5, batch_size: int = 10, ``` * run command ``` buck2 run fbcode//mode/opt fbcode//torchrec/distributed/tests:pt2_compile_benchmark -- --num-features=32 ``` # results * on a DevGPU machine ``` rank: 0, world_size: 2, num_features: 16, batch_size: 10, time: 6.65s rank: 0, world_size: 2, num_features: 32, batch_size: 10, time: 10.99s rank: 0, world_size: 2, num_features: 64, batch_size: 10, time: 61.55s rank: 0, world_size: 2, num_features: 128, batch_size: 10, time: 429.14s ``` Reviewed By: IvanKobzarev Differential Revision: D57501708 fbshipit-source-id: 08c9acc4eb1187d4cf775b18b9ea5cfb28175783

Summary: Pull Request resolved: #2339 So traced issues with changes back to 5 month old introduction to __init__.py. LSS never add anything to __init__.py due to design flaw with torch.package importer logic. more context: https://fb.workplace.com/groups/401230229322235/permalink/490062753772315/ Reviewed By: TroyGarden, gnahzg Differential Revision: D61839464 fbshipit-source-id: d47b235cce3c895069ae49e8acc5083521eceb53

Summary: Pull Request resolved: #2341 Add clippy wrapper to shampoo_wrapper. I added shampoo_pt2_compile_config as an input as the main difference from the ads distributed wrapped. Reviewed By: hjmshi Differential Revision: D60599611 fbshipit-source-id: 2a20b21f203b0f740e1da6b0d2d4873ac34ee426

Summary: Pull Request resolved: #2313 Two optimizations applied to the DTensor code path (1) Skip clipping a grad that has 0 element (2) Effectively doing per-module gradient clipping Reviewed By: iamzainhuda Differential Revision: D61447255 fbshipit-source-id: d734fa41d797715e064461ad32b7b3831e818178

…ocs (#2342) Summary: Pull Request resolved: #2342 Ran into 2 issues while enabling pipeline for a model: 1) Current pipeline logic for finding and swapping a preproc module only works if the preproc module exists at model level. If the preproc is within a model's child modules, this logic would break down e.g. `model._sparse_arch._preproc_module`. Finding a module would not work as this used `getattr` on the model and swapping the module would fail as this used `setattr` on the model. Solution: - Replaced `getattr` and `setattr` with `_find_preproc_module_recursive` and `_swap_preproc_module_recursive` respectively. 2) Logic doesn't support if an arg to a preproc module is a constant (e.g. `self.model.constant_value`) as we skip args that aren't `torch.fx.Node` values. However, we should be able to pipeline these cases. Solution: - Add a new field to `ArgInfo` called `objects` of type `List[Optional[object]]`. After fx tracing, you will have fx immutable collections, such as `torch.fx.immutable_dict` for immutable `Dict`. Creating a copy converts it back to mutable original value. So we capture this variable in `ArgInfo`. Potential downside is the extra memory overhead, but for this model in particular, this was just a small string value. Reviewed By: xing-liu Differential Revision: D61891459 fbshipit-source-id: fe5d074f8b8937a6154596221a57e2d5213ffc36

…2337) Summary: Pull Request resolved: #2337 KTRegroupAsDict was created as a module to cache certain computations consistent across batches for regrouping KeyedTensors into pooled embeddings for Ads models. However, there existed a bug that caused NE regression for TorchRec inference use cases. The order of KeyedTensors into the regroup module can change after sharding, which would make the previous caching invalid, resulting in an NE issue. D61615045 was a temporary fix, but this diff ensures that this module is reset during the official TorchRec API for sharding inference models Reviewed By: dstaay-fb, ZhengkaiZ Differential Revision: D61703144 fbshipit-source-id: 10a13089654e426cecdc4939101cef8668286449

Summary: Pull Request resolved: #2346 add support for DTensor state dict output in TW sharding fixed fused_params error where flag gets passed to TBE but it doesn't support it. Reviewed By: joshuadeng Differential Revision: D61938361 fbshipit-source-id: f7c24ea5647bc087f8036c2271ec1c0c4014c687

Summary: Pull Request resolved: #2226 # context * the new op `permute_multi_embedding` outperforms the original op `_fbgemm_permute_pooled_embs` * this diff makes the move to switch to the new op # benchmark * more results: D58907223, [traces](https://drive.google.com/drive/folders/1DEYozPihmij2zRAyG9AMxaIbcjTPWRVU?usp=drive_link) * previous prod {F1755206204} * new prod {F1755207013} * metrics |Operator|CPU runtime|GPU runtime|GPU memory|notes| |---|---|---|---|---| |**[fallback] pytorch generic**|3.9 ms|3.2 ms|1.0 K|CPU-bounded, allow duplicates| |**[previous prod] permute_pooled_embs**|1.9 ms|4.9 ms|1.5 K|GPU-boudned, does **NOT** allow duplicates, PT2 non-compatible `pin_and_move`| |**[new prod] permute_multi_embedding**|1.0 ms|2.0 ms|1.0 K|both CPU and GPU runtime/memory improved, **ALLOW** duplicates, PT2 friendly| NOTE: the new op takes in `List[List[str]]` and `List[List[int]]`, it currently does not support dynamic_shape and produces error like the following: > 1) SerializeError: Failed serializing node kt_regroup_permutes in graph: %kt_regroup_permutes : [num_users=3] = call_function[target=torch.ops.fbgemm.kt_regroup_permutes.default](args = (%ir_custom_op, [[f1], [f2]], [[3], [5]], [[f1], [f2]]), kwargs = {}) ... Caused by SerializeError: Unsupported list/tuple argument type: [<class 'torch.fx.immutable_collections.immutable_list'>, <class 'torch.fx.immutable_collections.immutable_list'>] Reviewed By: dstaay-fb Differential Revision: D55277833 fbshipit-source-id: be47179c62b2df48445c78eabf5d7d44582a495b

PaulZhang12 and others added 16 commits August 21, 2024 09:26

Enable TorchRec GPU OSS Unit Test CI (#2302)

1306205

Summary: Pull Request resolved: #2302 Reviewed By: joshuadeng Differential Revision: D61312123 Pulled By: PaulZhang12 fbshipit-source-id: 22be2161d4cfcfe9ebe23309388f3bda769f6fdc

expose all DDP params in DefaultDDPWrapper (#2329)

9418355

Summary: Pull Request resolved: #2329 TSIA Reviewed By: yuhuishi-convect, joshuadeng Differential Revision: D61613723 fbshipit-source-id: 343c9631dd714cfd0d64fdac3c3210afc2d744f5

- Add CompiledAutograd pipeline (#2310)

f7e444d

Summary: Pull Request resolved: #2310 Add new pipeline for CompiledAutograd development. Reviewed By: dstaay-fb, xmfan, yf225 Differential Revision: D61403499 fbshipit-source-id: 7bf0720e0c1078815315278fffd79c2d7470882f

Add custom model fwd in train pipelines (#2324)

97585b8

Summary: Pull Request resolved: #2324 Add missing pipelline_preproc and custom_moel_fwd args. Reviewed By: chrisxcai Differential Revision: D61564467 fbshipit-source-id: 280f0e83de13e10ff2901bbda611d7ba76c8ac68

facebook-github-bot added CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. fx labels Aug 29, 2024

PaulZhang12 closed this Aug 29, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Merge #2349

Merge #2349

PaulZhang12 commented Aug 29, 2024

Merge #2349

Merge #2349

Conversation

PaulZhang12 commented Aug 29, 2024