Add CPU AMP instance type for Inductor CPU AMP tests (#5089)

zxd1997066 · tylertitsworth · Tyler Titsworth · web-flow · commit 8b3bd34cab72 · 2024-04-30T09:32:44.000-07:00
PR [#116456](pytorch/pytorch#116456) successfully introduced the initial TorchInductor CPU performance smoke test on c5.12xlarge instances into Pytorch CI. FP32-related issues identified during our regular CPU Dashboard refresh tests trending to decrease. However, we have observed a recent increase in AMP-related issues. Since the community typically lacks access to SPR nodes for debugging, we propose enabling a similar test for AMP on c7i instances within the CI pipeline, as outlined in the earlier proposal. To facilitate this, we request the addition of SPR instances to the PyTorch runner pool. For performance test, it needs optimal stability, so we suggest to use c7i.metal-24xl for it if possible. We counted the launch times of inductor cpu smoketest from Apr 1st to Apr 7th, here is the result and the average times/day is **174**. However, many tests may be cancelled after launch due to various reasons, such as PR change. We can assume 70% ~ 100% tests can be completed. </head> date | Apr 1st | Apr 2nd | Apr 3rd | Apr 4th | Apr 5th | Apr 6th | Apr 7th -- | -- | -- | -- | -- | -- | -- | -- launch times | 110 | 187 | 182 | 236 | 253 | 162 | 89 </body> Each test costs ~20 mins, and it runs on c5.12xlarge, which base pricing is **2.04 USD per hour**. So current inductor cpu smoketest average costs **82.82 ~ 118.32 USD per day**. If c7i.metal-24xl instance is available, we can add AMP datatype into inductor cpu smoketest, the expected cost time is ~30mins/time, including the previous FP32 datatype. The base pricing of c7i.metal-24xl is **4.284 USD per hour**, so the average cost of the test will become **260.896 ~ 372.708 USD per day** after adding c7i.metal-24xl instance and AMP datatype in inductor cpu smoketest. We also plan to add accuracy test in CI for AMP static shape default wrapper. Accuracy test does not need optimal stability, so we have two options: 1. running it on c7i.metal-24xl instance; 2. running it on c7i.16xlarge. We collected the time cost for both two options, and here is the result. </head> Option | instance type | Threads | Precision | Mode | Shapes | Wrappers | TB Acc /h | HF Acc /h | TIMM Acc /h | Total Time /h -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- option1 | c7i.metal-24xl | Multi-thread-single-socket | amp | inference | static | default | 0.6 | 0.3 | 0.6 | 1.5 option2 | c7i.16xlarge | Multi-thread-single-socket | amp | inference | static | default | 0.6 | 0.3 | 0.6 | 1.5 </body> We can dispatch the accuracy test to multiple instance to save time. For option1, The average base cost of accuracy test is **782.69 ~ 1118.124**(174 * 1.5 * 4.284) USD per day. The base pricing of c7i.16xlarge is **2.856 USD per hour**, so for option2, the average base cost of accuracy test is **521.79 ~ 745.416**(174 * 1.5 * 2.856) USD per day. For future nightly performance test, we plan to run multi-thread static shape default wrapper and dynamic shape CPP wrapper for fp32 data type, multi-thread static shape default wrapper and dynamic shape default wrapper for AMP data type. We collected the time cost for each test, and here is the results. </head> instance type | Threads | Precision | Mode | Shapes | Wrappers | TB Perf /h | TB Acc /h | HF Perf /h | HF Acc /h | TIMM Perf /h | TIMM Acc /h | Total Time /h -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- c5.12xlarge | Multi-thread-single-socket | fp32 | inference | static | default | 1.8 | 1 | 2 | 0.5 | 3 | 1 | 9.3 c5.12xlarge | Multi-thread-single-socket | fp32 | inference | dynamic | cpp | 2.7 | 1.3 | 2.3 | 0.8 | 3 | 2 | 12.1 c7i.metal-24xl | Multi-thread-single-socket | fp32 | inference | static | default | 1 | 0.6 | 1 | 0.5 | 1.5 | 0.6 | 5.2 c7i.metal-24xl | Multi-thread-single-socket | fp32 | inference | dynamic | cpp | 1.7 | 1 | 1.2 | 0.5 | 1.8 | 1 | 7.2 c7i.metal-24xl | Multi-thread-single-socket | AMP | inference | static | default | 0.8 | 0.6 | 0.7 | 0.3 | 1 | 0.6 | 4 c7i.metal-24xl | Multi-thread-single-socket | AMP | inference | dynamic | default | 0.8 | 0.7 | 1 | 0.4 | 1.2 | 0.8 | 4.9 </body> Current c5.12xlarge can cover the fp32 nightly test, the base cost is **43.66** (9.3 * 2.04 + 12.1 * 2.04) USD per day. If c7i.metal-24xl instance is available, both fp32 nightly test and AMP nightly test can run on it. The base cost of fp32 nightly test will become **53.12** (5.2 * 4.284 + 7.2 * 4.284) USD per day. And the base cost of AMP nightly cost is **38.13** (4 * 4.284 + 4.9 * 4.284) USD per day. --------- Co-authored-by: Tyler Titsworth <titswortht@gmail.com> Co-authored-by: Tyler Titsworth <tyler.titsworth@intel.com>
diff --git a/.github/arc-node-config.yaml b/.github/arc-node-config.yaml
@@ -16,6 +16,34 @@ nodeConfig:
       - key: kubernetes.io/os
         operator: In
         values: ["linux"]
+  - nodeType: compute-amd64-amp
+    requirements:
+      - key: "karpenter.k8s.aws/instance-category"
+        operator: In
+        values: ["c7i"]
+      - key: "karpenter.k8s.aws/instance-cpu"
+        operator: In
+        values: ["16", "32", "48", "64", "96", "192"]
+      - key: "kubernetes.io/arch"
+        operator: In
+        values: ["amd64"]
+      - key: "karpenter.sh/capacity-type"
+        operator: In
+        values: ["on-demand"]
+  - nodeType: compute-amd64-amp-metal
+    requirements:
+      - key: "karpenter.k8s.aws/instance-category"
+        operator: In
+        values: ["c7i.metal"]
+      - key: "karpenter.k8s.aws/instance-cpu"
+        operator: In
+        values: ["96", "192"]
+      - key: "kubernetes.io/arch"
+        operator: In
+        values: ["amd64"]
+      - key: "karpenter.sh/capacity-type"
+        operator: In
+        values: ["on-demand"]
   - nodeType: compute-amd64-avx512
     requirements:
       - key: "karpenter.k8s.aws/instance-category"
diff --git a/.github/arc-runner-config.yaml b/.github/arc-runner-config.yaml
@@ -15,6 +15,22 @@ runnerConfig:
     runnerLabel: linux.2xlarge
     containerMode: dind-rootless
     workingDiskSpace: 150Gi
+  - cpu: 16.0
+    maxRunners: 100
+    memory: 16Gi
+    minRunners: 0
+    nodeType: compute-amd64-amp
+    runnerLabel: linux.16xl.spr
+    containerMode: dind-rootless
+    workingDiskSpace: 200Gi
+  - cpu: 24.0
+    maxRunners: 100
+    memory: 45Gi
+    minRunners: 0
+    nodeType: compute-amd64-amp-metal
+    runnerLabel: linux.24xl.spr-metal
+    containerMode: dind-rootless
+    workingDiskSpace: 200Gi
   - cpu: 8.0
     maxRunners: 300
     memory: 16128Mi
diff --git a/.github/scale-config.yml b/.github/scale-config.yml
@@ -27,6 +27,18 @@ runner_types:
     is_ephemeral: false
     max_available: 1000
     os: linux
+  linux.24xl.spr-metal:
+    disk_size: 200
+    instance_type: c7i.metal-24xl
+    is_ephemeral: false
+    max_available: 30
+    os: linux
+  linux.16xlarge.spr:
+    disk_size: 200
+    instance_type: c7i.16xlarge
+    is_ephemeral: false
+    max_available: 30
+    os: linux
   linux.12xlarge.ephemeral:
     disk_size: 200
     instance_type: c5.12xlarge