Skip to content

Commit

Permalink
Add CPU AMP instance type for Inductor CPU AMP tests (#5089)
Browse files Browse the repository at this point in the history
PR [#116456](pytorch/pytorch#116456)
successfully introduced the initial TorchInductor CPU performance smoke
test on c5.12xlarge instances into Pytorch CI. FP32-related issues
identified during our regular CPU Dashboard refresh tests trending to
decrease.

However, we have observed a recent increase in AMP-related issues. Since
the community typically lacks access to SPR nodes for debugging, we
propose enabling a similar test for AMP on c7i instances within the CI
pipeline, as outlined in the earlier proposal. To facilitate this, we
request the addition of SPR instances to the PyTorch runner pool. For
performance test, it needs optimal stability, so we suggest to use
c7i.metal-24xl for it if possible.

We counted the launch times of inductor cpu smoketest from Apr 1st to
Apr 7th, here is the result and the average times/day is **174**.
However, many tests may be cancelled after launch due to various
reasons, such as PR change. We can assume 70% ~ 100% tests can be
completed.

</head>

date | Apr 1st | Apr 2nd | Apr 3rd | Apr 4th | Apr 5th | Apr 6th | Apr
7th
-- | -- | -- | -- | -- | -- | -- | --
launch times | 110 | 187 | 182 | 236 | 253 | 162 | 89

</body>

Each test costs ~20 mins, and it runs on c5.12xlarge, which base pricing
is **2.04 USD per hour**. So current inductor cpu smoketest average
costs **82.82 ~ 118.32 USD per day**. If c7i.metal-24xl instance is
available, we can add AMP datatype into inductor cpu smoketest, the
expected cost time is ~30mins/time, including the previous FP32
datatype. The base pricing of c7i.metal-24xl is **4.284 USD per hour**,
so the average cost of the test will become **260.896 ~ 372.708 USD per
day** after adding c7i.metal-24xl instance and AMP datatype in inductor
cpu smoketest.

We also plan to add accuracy test in CI for AMP static shape default
wrapper. Accuracy test does not need optimal stability, so we have two
options: 1. running it on c7i.metal-24xl instance; 2. running it on
c7i.16xlarge. We collected the time cost for both two options, and here
is the result.


</head>

Option | instance type | Threads | Precision | Mode | Shapes | Wrappers
| TB Acc /h | HF Acc /h | TIMM Acc /h | Total Time /h
-- | -- | -- | -- | -- | -- | -- | -- | -- | -- | --
option1 | c7i.metal-24xl | Multi-thread-single-socket | amp | inference
| static | default | 0.6 | 0.3 | 0.6 | 1.5
option2 | c7i.16xlarge | Multi-thread-single-socket | amp | inference |
static | default | 0.6 | 0.3 | 0.6 | 1.5

</body>

We can dispatch the accuracy test to multiple instance to save time. For
option1, The average base cost of accuracy test is **782.69 ~
1118.124**(174 * 1.5 * 4.284) USD per day. The base pricing of
c7i.16xlarge is **2.856 USD per hour**, so for option2, the average base
cost of accuracy test is **521.79 ~ 745.416**(174 * 1.5 * 2.856) USD per
day.

For future nightly performance test, we plan to run multi-thread static
shape default wrapper and dynamic shape CPP wrapper for fp32 data type,
multi-thread static shape default wrapper and dynamic shape default
wrapper for AMP data type. We collected the time cost for each test, and
here is the results.

</head>

instance type | Threads | Precision | Mode | Shapes | Wrappers | TB Perf
/h | TB Acc /h | HF Perf /h | HF Acc /h | TIMM Perf /h | TIMM Acc /h |
Total Time /h
 -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | --
c5.12xlarge | Multi-thread-single-socket | fp32 | inference | static |
default | 1.8 | 1 | 2 | 0.5 | 3 | 1 | 9.3
c5.12xlarge | Multi-thread-single-socket | fp32 | inference | dynamic |
cpp | 2.7 | 1.3 | 2.3 | 0.8 | 3 | 2 | 12.1
c7i.metal-24xl | Multi-thread-single-socket | fp32 | inference | static
| default | 1 | 0.6 | 1 | 0.5 | 1.5 | 0.6 | 5.2
c7i.metal-24xl | Multi-thread-single-socket | fp32 | inference | dynamic
| cpp | 1.7 | 1 | 1.2 | 0.5 | 1.8 | 1 | 7.2
c7i.metal-24xl | Multi-thread-single-socket | AMP | inference | static |
default | 0.8 | 0.6 | 0.7 | 0.3 | 1 | 0.6 | 4
c7i.metal-24xl | Multi-thread-single-socket | AMP | inference | dynamic
| default | 0.8 | 0.7 | 1 | 0.4 | 1.2 | 0.8 | 4.9

</body>

Current c5.12xlarge can cover the fp32 nightly test, the base cost is
**43.66** (9.3 * 2.04 + 12.1 * 2.04) USD per day. If c7i.metal-24xl
instance is available, both fp32 nightly test and AMP nightly test can
run on it. The base cost of fp32 nightly test will become **53.12** (5.2
* 4.284 + 7.2 * 4.284) USD per day. And the base cost of AMP nightly
cost is **38.13** (4 * 4.284 + 4.9 * 4.284) USD per day.

---------

Co-authored-by: Tyler Titsworth <[email protected]>
Co-authored-by: Tyler Titsworth <[email protected]>
  • Loading branch information
3 people authored Apr 30, 2024
1 parent 6736ce7 commit 8b3bd34
Show file tree
Hide file tree
Showing 3 changed files with 56 additions and 0 deletions.
28 changes: 28 additions & 0 deletions .github/arc-node-config.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -16,6 +16,34 @@ nodeConfig:
- key: kubernetes.io/os
operator: In
values: ["linux"]
- nodeType: compute-amd64-amp
requirements:
- key: "karpenter.k8s.aws/instance-category"
operator: In
values: ["c7i"]
- key: "karpenter.k8s.aws/instance-cpu"
operator: In
values: ["16", "32", "48", "64", "96", "192"]
- key: "kubernetes.io/arch"
operator: In
values: ["amd64"]
- key: "karpenter.sh/capacity-type"
operator: In
values: ["on-demand"]
- nodeType: compute-amd64-amp-metal
requirements:
- key: "karpenter.k8s.aws/instance-category"
operator: In
values: ["c7i.metal"]
- key: "karpenter.k8s.aws/instance-cpu"
operator: In
values: ["96", "192"]
- key: "kubernetes.io/arch"
operator: In
values: ["amd64"]
- key: "karpenter.sh/capacity-type"
operator: In
values: ["on-demand"]
- nodeType: compute-amd64-avx512
requirements:
- key: "karpenter.k8s.aws/instance-category"
Expand Down
16 changes: 16 additions & 0 deletions .github/arc-runner-config.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -15,6 +15,22 @@ runnerConfig:
runnerLabel: linux.2xlarge
containerMode: dind-rootless
workingDiskSpace: 150Gi
- cpu: 16.0
maxRunners: 100
memory: 16Gi
minRunners: 0
nodeType: compute-amd64-amp
runnerLabel: linux.16xl.spr
containerMode: dind-rootless
workingDiskSpace: 200Gi
- cpu: 24.0
maxRunners: 100
memory: 45Gi
minRunners: 0
nodeType: compute-amd64-amp-metal
runnerLabel: linux.24xl.spr-metal
containerMode: dind-rootless
workingDiskSpace: 200Gi
- cpu: 8.0
maxRunners: 300
memory: 16128Mi
Expand Down
12 changes: 12 additions & 0 deletions .github/scale-config.yml
Original file line number Diff line number Diff line change
Expand Up @@ -27,6 +27,18 @@ runner_types:
is_ephemeral: false
max_available: 1000
os: linux
linux.24xl.spr-metal:
disk_size: 200
instance_type: c7i.metal-24xl
is_ephemeral: false
max_available: 30
os: linux
linux.16xlarge.spr:
disk_size: 200
instance_type: c7i.16xlarge
is_ephemeral: false
max_available: 30
os: linux
linux.12xlarge.ephemeral:
disk_size: 200
instance_type: c5.12xlarge
Expand Down

0 comments on commit 8b3bd34

Please sign in to comment.