Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merge #2349

Closed
wants to merge 16 commits into from
Closed

Merge #2349

wants to merge 16 commits into from

Conversation

PaulZhang12
Copy link
Contributor

No description provided.

PaulZhang12 and others added 16 commits August 21, 2024 09:26
Summary: Pull Request resolved: #2302

Reviewed By: joshuadeng

Differential Revision: D61312123

Pulled By: PaulZhang12

fbshipit-source-id: 22be2161d4cfcfe9ebe23309388f3bda769f6fdc
Summary:
Pull Request resolved: #2330

So:

(1) we did not have consistent handling between quant and sharded quant dtypes.
(2) users correctly identified on cpu path we want to make sure indices and offsets are same dtype (to save templates).  But talking to sryap we want to just make them consistent, and leave in int32 if possible.
(3) cuda is always int32. this is likely to be issue with ZCH models pushing beyond 2B row shards for inference, for now just need make shards that size.

Reviewed By: sryap

Differential Revision: D61446242

fbshipit-source-id: 779d22822d4ac7656675d9ac3887b5ac3ecbfa39
Summary:
Pull Request resolved: #2329

TSIA

Reviewed By: yuhuishi-convect, joshuadeng

Differential Revision: D61613723

fbshipit-source-id: 343c9631dd714cfd0d64fdac3c3210afc2d744f5
Summary:
Pull Request resolved: #2310

Add new pipeline for CompiledAutograd development.

Reviewed By: dstaay-fb, xmfan, yf225

Differential Revision: D61403499

fbshipit-source-id: 7bf0720e0c1078815315278fffd79c2d7470882f
Summary:
Pull Request resolved: #2331

Initial to do Prod strategy is use unsharded MCH module in front of Q / SQ EC.  General seems ok, but biggest issue:

[2] uneven sharding flag was not respected for rw sequence GPU case, easy fix cc: gnahzg
[3] get_propogate device is bit unf/messy, will cleanup in followup task, but found edgecase wrt cpu path cc: gnahzg

Reviewed By: gnahzg

Differential Revision: D61572421

fbshipit-source-id: 412ed7947a7cde1991518d6a979db3bc7832fc68
Summary:
Pull Request resolved: #2324

Add missing pipelline_preproc and custom_moel_fwd args.

Reviewed By: chrisxcai

Differential Revision: D61564467

fbshipit-source-id: 280f0e83de13e10ff2901bbda611d7ba76c8ac68
Summary:
Pull Request resolved: #2326

# context
* want to resolve graph break: [failures_and_restarts](https://interncache-all.fbcdn.net/manifold/tlparse_reports/tree/logs/.tmpKJM3FI/failures_and_restarts.html), P1537573230
```
Tried to use data-dependent value in the subsequent computation. This can happen when we encounter unbounded dynamic value that is unknown during tracing time.  You will need to explicitly give hint to the compiler. Please take a look at torch._check OR torch._check_is_size APIs.  Could not guard on data-dependent expression Eq(((2*u48)//(u48 + u49)), 0) (unhinted: Eq(((2*u48)//(u48 + u49)), 0)).  (Size-like symbols: u49, u48)

Potential framework code culprit (scroll up for full backtrace):
  File "/data/users/hhy/fbsource/buck-out/v2/gen/fbcode/61f992c26f3f2773/aps_models/ads/icvr/__icvr_launcher_live__/icvr_launcher_live#link-tree/torch/_refs/__init__.py", line 3950, in unbind
    if guard_size_oblivious(t.shape[dim] == 0):

For more information, run with TORCH_LOGS="dynamic"
For extended logs when we create symbols, also add TORCHDYNAMO_EXTENDED_DEBUG_CREATE_SYMBOL="u49,u48"
If you suspect the guard was triggered from C++, add TORCHDYNAMO_EXTENDED_DEBUG_CPP=1
For more debugging help, see https://docs.google.com/document/d/1HSuTTVvYH1pTew89Rtpeu84Ht3nQEFTYhAX3Ypa_xJs/edit?usp=sharing

User Stack (most recent call last):
  (snipped, see stack below for prefix)
...
  File "/data/users/hhy/fbsource/buck-out/v2/gen/fbcode/61f992c26f3f2773/aps_models/ads/icvr/__icvr_launcher_live__/icvr_launcher_live#link-tree/torchrec/sparse/jagged_tensor.py", line 2241, in to_dict
    _jt_dict = _maybe_compute_kjt_to_jt_dict(
  File "/data/users/hhy/fbsource/buck-out/v2/gen/fbcode/61f992c26f3f2773/aps_models/ads/icvr/__icvr_launcher_live__/icvr_launcher_live#link-tree/torchrec/sparse/jagged_tensor.py", line 1226, in _maybe_compute_kjt_to_jt_dict
    split_lengths = torch.unbind(
```
* we added [shape check](https://fburl.com/code/p02u4mck):
```
        if pt2_guard_size_oblivious(lengths.numel() > 0):
            strided_lengths = lengths.view(-1, stride)
            if not torch.jit.is_scripting() and is_torchdynamo_compiling():
                torch._check(strided_lengths.shape[0] > 0)
                torch._check(strided_lengths.shape[1] > 0)
            split_lengths = torch.unbind(
                strided_lengths,
                dim=0,
            )
```
* however the error is still there
```
  File "/data/users/hhy/fbsource/buck-out/v2/gen/fbcode/61f992c26f3f2773/aps_models/ads/icvr/__icvr_launcher_live__/icvr_launcher_live#link-tree/torch/_refs/__init__.py", line 3950, in unbind
    if guard_size_oblivious(t.shape[dim] == 0):
  File "/data/users/hhy/fbsource/buck-out/v2/gen/fbcode/61f992c26f3f2773/aps_models/ads/icvr/__icvr_launcher_live__/icvr_launcher_live#link-tree/torch/fx/experimental/symbolic_shapes.py", line 253, in guard_size_oblivious
    return expr.node.guard_size_oblivious("", 0)
  File "/data/users/hhy/fbsource/buck-out/v2/gen/fbcode/61f992c26f3f2773/aps_models/ads/icvr/__icvr_launcher_live__/icvr_launcher_live#link-tree/torch/fx/experimental/sym_node.py", line 503, in guard_size_oblivious
    r = self.shape_env.evaluate_expr(
```
* [implementation](https://fburl.com/code/20iue1ib)
```
register_decomposition(aten.unbind)
def unbind(t: TensorLikeType, dim: int = 0) -> TensorSequenceType:
    from torch.fx.experimental.symbolic_shapes import guard_size_oblivious

    dim = utils.canonicalize_dim(t.ndim, dim)
    torch._check_index(
        len(t.shape) > 0,
        lambda: "Dimension specified as 0 but tensor has no dimensions",
    )
    if guard_size_oblivious(t.shape[dim] == 0):  # <------- here
        return ()
    else:
        return tuple(
            torch.squeeze(s, dim) for s in torch.tensor_split(t, t.shape[dim], dim)
        )
```
* with D61677207 [no graph break at _maybe_compute_kjt_to_jt_dict](https://interncache-all.fbcdn.net/manifold/tlparse_reports/tree/logs/.tmpNcI14t/failures_and_restarts.html)

Reviewed By: IvanKobzarev

Differential Revision: D55277785

fbshipit-source-id: d3bbb2439427677dc8e9a58560679acbf6f872d4
Summary:
X-link: ctrl-labs/src2#33884

X-link: pytorch/executorch#4810

Pull Request resolved: #2322

X-link: pytorch/pytorch#134054

See #131429

Reviewed By: laithsakka

Differential Revision: D61493706

fbshipit-source-id: d2b3feeff2abf8610e4e9940a1b93b5f80777dc2
Summary:
Pull Request resolved: #2288

# context
* provide a benchmark function entry point and command-line for running torch.compile with a baby model containing EBC
* current supported arguments:
```
    rank: int = 0,
    world_size: int = 2,
    num_features: int = 5,
    batch_size: int = 10,
```
* run command
```
buck2 run fbcode//mode/opt fbcode//torchrec/distributed/tests:pt2_compile_benchmark -- --num-features=32
```

# results
* on a DevGPU machine
```
rank: 0, world_size: 2, num_features: 16, batch_size: 10, time: 6.65s
rank: 0, world_size: 2, num_features: 32, batch_size: 10, time: 10.99s
rank: 0, world_size: 2, num_features: 64, batch_size: 10, time: 61.55s
rank: 0, world_size: 2, num_features: 128, batch_size: 10, time: 429.14s
```

Reviewed By: IvanKobzarev

Differential Revision: D57501708

fbshipit-source-id: 08c9acc4eb1187d4cf775b18b9ea5cfb28175783
Summary:
Pull Request resolved: #2339

So traced issues with changes back to 5 month old introduction to __init__.py.   LSS never add anything to __init__.py due to design flaw with torch.package importer logic.

more context: https://fb.workplace.com/groups/401230229322235/permalink/490062753772315/

Reviewed By: TroyGarden, gnahzg

Differential Revision: D61839464

fbshipit-source-id: d47b235cce3c895069ae49e8acc5083521eceb53
Summary:
Pull Request resolved: #2341

Add clippy wrapper to shampoo_wrapper. I added shampoo_pt2_compile_config as an input as the main difference from the ads distributed wrapped.

Reviewed By: hjmshi

Differential Revision: D60599611

fbshipit-source-id: 2a20b21f203b0f740e1da6b0d2d4873ac34ee426
Summary:
Pull Request resolved: #2313

Two optimizations applied to the DTensor code path
(1) Skip clipping a grad that has 0 element
(2) Effectively doing per-module gradient clipping

Reviewed By: iamzainhuda

Differential Revision: D61447255

fbshipit-source-id: d734fa41d797715e064461ad32b7b3831e818178
…ocs (#2342)

Summary:
Pull Request resolved: #2342

Ran into 2 issues while enabling pipeline for a model:
1) Current pipeline logic for finding and swapping a preproc module only works if the preproc module exists at model level. If the preproc is within a model's child modules, this logic would break down e.g. `model._sparse_arch._preproc_module`. Finding a module would not work as this used `getattr` on the model and swapping the module would fail as this used `setattr` on the model. Solution:
   - Replaced `getattr` and `setattr` with `_find_preproc_module_recursive` and `_swap_preproc_module_recursive` respectively.

2) Logic doesn't support if an arg to a preproc module is a constant (e.g. `self.model.constant_value`) as we skip args that aren't `torch.fx.Node` values. However, we should be able to pipeline these cases. Solution:
    - Add a new field to `ArgInfo` called `objects` of type `List[Optional[object]]`. After fx tracing, you will have fx immutable collections, such as `torch.fx.immutable_dict` for immutable `Dict`. Creating a copy converts it back to mutable original value. So we capture this variable in `ArgInfo`. Potential downside is the extra memory overhead, but for this model in particular, this was just a small string value.

Reviewed By: xing-liu

Differential Revision: D61891459

fbshipit-source-id: fe5d074f8b8937a6154596221a57e2d5213ffc36
…2337)

Summary:
Pull Request resolved: #2337

KTRegroupAsDict was created as a module to cache certain computations consistent across batches for regrouping KeyedTensors into pooled embeddings for Ads models.

However, there existed a bug that caused NE regression for TorchRec inference use cases. The order of KeyedTensors into the regroup module can change after sharding, which would make the previous caching invalid, resulting in an NE issue. D61615045 was a temporary fix, but this diff ensures that this module is reset during the official TorchRec API for sharding inference models

Reviewed By: dstaay-fb, ZhengkaiZ

Differential Revision: D61703144

fbshipit-source-id: 10a13089654e426cecdc4939101cef8668286449
Summary:
Pull Request resolved: #2346

add support for DTensor state dict output in TW sharding

fixed fused_params error where flag gets passed to TBE but it doesn't support it.

Reviewed By: joshuadeng

Differential Revision: D61938361

fbshipit-source-id: f7c24ea5647bc087f8036c2271ec1c0c4014c687
Summary:
Pull Request resolved: #2226

# context
* the new op `permute_multi_embedding` outperforms the original op `_fbgemm_permute_pooled_embs`
* this diff makes the move to switch to the new op

# benchmark
* more results: D58907223, [traces](https://drive.google.com/drive/folders/1DEYozPihmij2zRAyG9AMxaIbcjTPWRVU?usp=drive_link)
* previous prod
 {F1755206204}
* new prod
 {F1755207013}
* metrics
|Operator|CPU runtime|GPU runtime|GPU memory|notes|
|---|---|---|---|---|
|**[fallback] pytorch generic**|3.9 ms|3.2 ms|1.0 K|CPU-bounded, allow duplicates|
|**[previous prod] permute_pooled_embs**|1.9 ms|4.9 ms|1.5 K|GPU-boudned, does **NOT** allow duplicates, PT2 non-compatible `pin_and_move`|
|**[new prod] permute_multi_embedding**|1.0 ms|2.0 ms|1.0 K|both CPU and GPU runtime/memory improved, **ALLOW** duplicates, PT2 friendly|

NOTE: the new op takes in `List[List[str]]` and `List[List[int]]`, it currently does not support dynamic_shape and produces error like the following:
>     1) SerializeError: Failed serializing node kt_regroup_permutes in graph: %kt_regroup_permutes : [num_users=3] = call_function[target=torch.ops.fbgemm.kt_regroup_permutes.default](args = (%ir_custom_op, [[f1], [f2]], [[3], [5]], [[f1], [f2]]), kwargs = {})
...
Caused by SerializeError: Unsupported list/tuple argument type: [<class 'torch.fx.immutable_collections.immutable_list'>, <class 'torch.fx.immutable_collections.immutable_list'>]

Reviewed By: dstaay-fb

Differential Revision: D55277833

fbshipit-source-id: be47179c62b2df48445c78eabf5d7d44582a495b
@facebook-github-bot facebook-github-bot added CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. fx labels Aug 29, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. fx
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants