Revert "Pin PT version: Fix FPX Inductor error" #843

msaroufim · 2024-09-08T10:58:46Z

Reverts #790

The problems are all mostly cpu specific, in particular this feels like a subclass on cpu problem but not sure - cc @bdhirsh

This reverts commit 287458c.

pytorch-bot · 2024-09-08T10:58:49Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/843

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 47acb46 with merge base 1b317f9 ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

msaroufim · 2024-09-08T11:33:55Z

so fpx test is no longer failing but i see some new failures with bitnet that should probably be skipped @andrewor14 before we make the release. Also save/load is now failing on nightlies @jerryzh168

bdhirsh · 2024-09-10T15:11:16Z

@msaroufim do you have a link to the failing tests? I can try to help root cause (and/or at least figure out if it's subclass related)

msaroufim · 2024-09-10T15:17:21Z

Thank you! <3

They're in test/integration/test_integration.py

test_int8_weight_only_quant_subclass_api
test_int8_weight_only_quant_with_freeze
test_save_load_int8woqtensors

And only fail for cpu

* Revert "Pin PT version: Fix FPX Inductor error (#790)" This reverts commit 287458c. * udpates * yolo * yolo * yolo * yolo

bdhirsh · 2024-09-10T21:59:27Z

A couple things:

(1) I confirmed that python test/integration/test_integration.py -k test_int8_weight_only_quant_subclass_api_1_cpu fails for me locally

(2) I tweaked the test to compile with a few different backends: compile(... backend="inductor") fails, but compile(..., backend="aot_eager_decomp_partition") runs fine

(3) i checked out @eellison's beautiful inductor bisecting tool from here and ran it. It bisected down to to the lowering for prims.convert_element_type as the "problematic" bit of inductor. This is technically useful, but doesn't really give us the full story: that lowering hasn't been changed in months, and avoiding lowering that op probably just prevents inductor from doing some other fusions that are causing the numeric difference (I bet if the bisector could bisect on inductor fusions it could tell us more).

(4) Also confirmed that this only repros with cpu inputs.

Idk, @eellison do you know any suspicious commits coming from the cpu-inductor side in the last week that could affect numerics, especially in relation to casting ops?

leslie-fang-intel · 2024-09-12T08:00:51Z

Hi @bdhirsh

(1) I confirmed that python test/integration/test_integration.py -k test_int8_weight_only_quant_subclass_api_1_cpu fails for me locally

is it the same failure reported in #890

eellison · 2024-09-12T15:35:39Z

@bdhirsh, I was on pto last week so i'm not sure. sounds like @leslie-fang-intel has an idea. Would be easier to bisect maybe.

bdhirsh · 2024-09-12T15:40:36Z

@leslie-fang-intel not sure, but seems unlikely (mainly because the issue you linked is a hard error while the subclass is running at compile time, while this is a runtime / bad numerics error)

leslie-fang-intel · 2024-09-13T00:43:45Z

Thanks @bdhirsh. It sounds like a different error then. Maybe after we resolve the first one, we will meet the numeric error you met. Do you have any idea about the hard error in #890? Why it didn't happen in your test environment.

----------------- Update ---------------

PyTorch: 86335e913557962bf8d415c80dcb7e615468ba42
AO: 8236a87

comment out these lines to enable the test:

ao/test/integration/test_integration.py

Lines 820 to 821 in 8236a87

if TORCH_VERSION_AT_LEAST_2_5 and device == "cpu":

self.skipTest("Regression introduced in PT nightlies")
ao/test/integration/test_integration.py

Lines 832 to 833 in 8236a87

if TORCH_VERSION_AT_LEAST_2_5 and device == "cpu":

self.skipTest("Regression introduced in PT nightlies")

It seems we saw 2 different errors when running 2 different UT

python test/integration/test_integration.py -k test_int8_weight_only_quant_with_freeze throws the hard error reported in [torchao]NotImplementedError: AffineQuantizedTensor dispatch: attempting to run unimplemented operator/function: aten.permute.default #890
python test/integration/test_integration.py -k test_int8_weight_only_quant_subclass_api throws the numerical failure as reported in [torchao]AssertionError: tensor(2.3359, dtype=torch.float16) not greater than 40 : _int8wo_api failed when compiled with dtype=torch.float16, (m, k, n)=(32, 64, 32) pytorch#135831

leslie-fang-intel · 2024-09-13T03:27:50Z

Hi @bdhirsh, I have create PRs to fix these 2 failures

PR in Torch: [AO][Inductor] Enable WOQ fusion pattern with permute pytorch#135928 addresses the numerical failures, which were caused by missing the fusion of this woq int8 pattern and falling back to the reference implementation. By re-enabling the fusions in Inductor for this pattern, the unit tests now pass on my local system.
PR in AO: Fix WOQ int8 failures #884, as for another UT failure python test/integration/test_integration.py -k test_int8_weight_only_quant_with_freeze, it seems we still need unwrap_tensor_subclass for Inductor freezing. cc @eellison here.

Please kindly help to review these 2 PRs.

bdhirsh · 2024-09-17T00:06:32Z

@jerryzh168 I vaguely remember it being a problem to re-enable unwrap_tensor_subclass() in the torch.compile path for torchao, since dynamo had a bad interaction with the parametrization code. Does that ring a bell?

Either way, we should fix the freezing interaction (@IvanKobzarev is taking a look)

Revert "Pin PT version: Fix FPX Inductor error (#790)"

dc7039f

This reverts commit 287458c.

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Sep 8, 2024

Merge branch 'main' into revert-790-msaroufim-patch-17

1571125

Merge branch 'main' into revert-790-msaroufim-patch-17

e9fa2a4

msaroufim mentioned this pull request Sep 9, 2024

Revert "Pin PT version: Fix FPX Inductor error" #818

Closed

msaroufim added 5 commits September 10, 2024 03:16

udpates

f34a0fe

yolo

4499047

yolo

ec2c380

yolo

45969b9

yolo

47acb46

andrewor14 approved these changes Sep 10, 2024

View reviewed changes

msaroufim merged commit e283743 into main Sep 10, 2024
17 checks passed

msaroufim deleted the revert-790-msaroufim-patch-17 branch September 10, 2024 14:32

jainapurva pushed a commit that referenced this pull request Sep 10, 2024

Revert "Pin PT version: Fix FPX Inductor error" (#843)

f65ad6d

* Revert "Pin PT version: Fix FPX Inductor error (#790)" This reverts commit 287458c. * udpates * yolo * yolo * yolo * yolo

leslie-fang-intel mentioned this pull request Sep 13, 2024

Fix WOQ int8 failures #884

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Revert "Pin PT version: Fix FPX Inductor error" #843

Revert "Pin PT version: Fix FPX Inductor error" #843

Uh oh!

msaroufim commented Sep 8, 2024 •

edited

Loading

Uh oh!

pytorch-bot bot commented Sep 8, 2024 •

edited

Loading

Uh oh!

msaroufim commented Sep 8, 2024

Uh oh!

Uh oh!

bdhirsh commented Sep 10, 2024 •

edited

Loading

Uh oh!

msaroufim commented Sep 10, 2024

Uh oh!

bdhirsh commented Sep 10, 2024

Uh oh!

leslie-fang-intel commented Sep 12, 2024

Uh oh!

eellison commented Sep 12, 2024 •

edited

Loading

Uh oh!

bdhirsh commented Sep 12, 2024

Uh oh!

leslie-fang-intel commented Sep 13, 2024 •

edited

Loading

Uh oh!

leslie-fang-intel commented Sep 13, 2024

Uh oh!

bdhirsh commented Sep 17, 2024

Uh oh!

Uh oh!

Revert "Pin PT version: Fix FPX Inductor error" #843

Revert "Pin PT version: Fix FPX Inductor error" #843

Uh oh!

Conversation

msaroufim commented Sep 8, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Sep 8, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/843

✅ No Failures

Uh oh!

msaroufim commented Sep 8, 2024

Uh oh!

Uh oh!

bdhirsh commented Sep 10, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

msaroufim commented Sep 10, 2024

Uh oh!

bdhirsh commented Sep 10, 2024

Uh oh!

leslie-fang-intel commented Sep 12, 2024

Uh oh!

eellison commented Sep 12, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

bdhirsh commented Sep 12, 2024

Uh oh!

leslie-fang-intel commented Sep 13, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

leslie-fang-intel commented Sep 13, 2024

Uh oh!

bdhirsh commented Sep 17, 2024

Uh oh!

Uh oh!

msaroufim commented Sep 8, 2024 •

edited

Loading

pytorch-bot bot commented Sep 8, 2024 •

edited

Loading

bdhirsh commented Sep 10, 2024 •

edited

Loading

eellison commented Sep 12, 2024 •

edited

Loading

leslie-fang-intel commented Sep 13, 2024 •

edited

Loading