Skip to content

Commit

Permalink
chore: bump codecov/codecov-action from 4 to 5 (#1093)
Browse files Browse the repository at this point in the history
Bumps [codecov/codecov-action](https://github.com/codecov/codecov-action) from 4 to 5.
- [Release notes](https://github.com/codecov/codecov-action/releases)
- [Changelog](https://github.com/codecov/codecov-action/blob/main/CHANGELOG.md)
- [Commits](codecov/codecov-action@v4...v5)

---
updated-dependencies:
- dependency-name: codecov/codecov-action
  dependency-type: direct:production
  update-type: version-update:semver-major
...

Signed-off-by: dependabot[bot] <[email protected]>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
  • Loading branch information
dependabot[bot] authored Nov 18, 2024
1 parent 636c9d1 commit cb0900f
Show file tree
Hide file tree
Showing 9 changed files with 16 additions and 16 deletions.
4 changes: 2 additions & 2 deletions .github/workflows/CI.yml
Original file line number Diff line number Diff line change
Expand Up @@ -88,7 +88,7 @@ jobs:
- uses: julia-actions/julia-processcoverage@v1
with:
directories: src,ext,lib/LuxCore/src,lib/LuxCore/ext,lib/MLDataDevices/src,lib/MLDataDevices/ext,lib/WeightInitializers/src,lib/WeightInitializers/ext,lib/LuxLib/src,lib/LuxLib/ext,lib/LuxTestUtils/src
- uses: codecov/codecov-action@v4
- uses: codecov/codecov-action@v5
with:
files: lcov.info
token: ${{ secrets.CODECOV_TOKEN }}
Expand Down Expand Up @@ -127,7 +127,7 @@ jobs:
- uses: julia-actions/julia-processcoverage@v1
with:
directories: src,ext,lib/LuxCore/src,lib/LuxCore/ext,lib/MLDataDevices/src,lib/MLDataDevices/ext,lib/WeightInitializers/src,lib/WeightInitializers/ext,lib/LuxLib/src,lib/LuxLib/ext,lib/LuxTestUtils/src
- uses: codecov/codecov-action@v4
- uses: codecov/codecov-action@v5
with:
files: lcov.info
token: ${{ secrets.CODECOV_TOKEN }}
Expand Down
2 changes: 1 addition & 1 deletion .github/workflows/CIPreRelease.yml
Original file line number Diff line number Diff line change
Expand Up @@ -87,7 +87,7 @@ jobs:
- uses: julia-actions/julia-processcoverage@v1
with:
directories: src,ext
- uses: codecov/codecov-action@v4
- uses: codecov/codecov-action@v5
with:
files: lcov.info
token: ${{ secrets.CODECOV_TOKEN }}
Expand Down
4 changes: 2 additions & 2 deletions .github/workflows/CI_LuxCUDA.yml
Original file line number Diff line number Diff line change
Expand Up @@ -50,7 +50,7 @@ jobs:
- uses: julia-actions/julia-processcoverage@v1
with:
directories: lib/LuxCUDA/src
- uses: codecov/codecov-action@v4
- uses: codecov/codecov-action@v5
with:
files: lcov.info
token: ${{ secrets.CODECOV_TOKEN }}
Expand All @@ -77,7 +77,7 @@ jobs:
- uses: julia-actions/julia-processcoverage@v1
with:
directories: lib/LuxCUDA/src
- uses: codecov/codecov-action@v4
- uses: codecov/codecov-action@v5
with:
files: lcov.info
token: ${{ secrets.CODECOV_TOKEN }}
Expand Down
4 changes: 2 additions & 2 deletions .github/workflows/CI_LuxCore.yml
Original file line number Diff line number Diff line change
Expand Up @@ -64,7 +64,7 @@ jobs:
- uses: julia-actions/julia-processcoverage@v1
with:
directories: lib/LuxCore/src,lib/LuxCore/ext,lib/MLDataDevices/src,lib/MLDataDevices/ext
- uses: codecov/codecov-action@v4
- uses: codecov/codecov-action@v5
with:
files: lcov.info
token: ${{ secrets.CODECOV_TOKEN }}
Expand Down Expand Up @@ -105,7 +105,7 @@ jobs:
- uses: julia-actions/julia-processcoverage@v1
with:
directories: lib/LuxCore/src,lib/LuxCore/ext,lib/MLDataDevices/src,lib/MLDataDevices/ext
- uses: codecov/codecov-action@v4
- uses: codecov/codecov-action@v5
with:
files: lcov.info
token: ${{ secrets.CODECOV_TOKEN }}
Expand Down
4 changes: 2 additions & 2 deletions .github/workflows/CI_LuxLib.yml
Original file line number Diff line number Diff line change
Expand Up @@ -100,7 +100,7 @@ jobs:
- uses: julia-actions/julia-processcoverage@v1
with:
directories: lib/LuxLib/src,lib/LuxLib/ext,lib/LuxCore/src,lib/LuxCore/ext,lib/MLDataDevices/src,lib/MLDataDevices/ext,lib/LuxTestUtils/src
- uses: codecov/codecov-action@v4
- uses: codecov/codecov-action@v5
with:
files: lcov.info
token: ${{ secrets.CODECOV_TOKEN }}
Expand Down Expand Up @@ -145,7 +145,7 @@ jobs:
- uses: julia-actions/julia-processcoverage@v1
with:
directories: lib/LuxLib/src,lib/LuxLib/ext,lib/LuxCore/src,lib/LuxCore/ext,lib/MLDataDevices/src,lib/MLDataDevices/ext,lib/LuxTestUtils/src
- uses: codecov/codecov-action@v4
- uses: codecov/codecov-action@v5
with:
files: lcov.info
token: ${{ secrets.CODECOV_TOKEN }}
Expand Down
4 changes: 2 additions & 2 deletions .github/workflows/CI_LuxTestUtils.yml
Original file line number Diff line number Diff line change
Expand Up @@ -53,7 +53,7 @@ jobs:
- uses: julia-actions/julia-processcoverage@v1
with:
directories: lib/LuxTestUtils/src,lib/MLDataDevices/src,lib/MLDataDevices/ext
- uses: codecov/codecov-action@v4
- uses: codecov/codecov-action@v5
with:
files: lcov.info
token: ${{ secrets.CODECOV_TOKEN }}
Expand Down Expand Up @@ -84,7 +84,7 @@ jobs:
- uses: julia-actions/julia-processcoverage@v1
with:
directories: lib/LuxTestUtils/src
- uses: codecov/codecov-action@v4
- uses: codecov/codecov-action@v5
with:
files: lcov.info
token: ${{ secrets.CODECOV_TOKEN }}
Expand Down
4 changes: 2 additions & 2 deletions .github/workflows/CI_MLDataDevices.yml
Original file line number Diff line number Diff line change
Expand Up @@ -63,7 +63,7 @@ jobs:
- uses: julia-actions/julia-processcoverage@v1
with:
directories: lib/MLDataDevices/src,lib/MLDataDevices/ext
- uses: codecov/codecov-action@v4
- uses: codecov/codecov-action@v5
with:
files: lcov.info
token: ${{ secrets.CODECOV_TOKEN }}
Expand Down Expand Up @@ -95,7 +95,7 @@ jobs:
- uses: julia-actions/julia-processcoverage@v1
with:
directories: lib/MLDataDevices/src,lib/MLDataDevices/ext
- uses: codecov/codecov-action@v4
- uses: codecov/codecov-action@v5
with:
files: lcov.info
token: ${{ secrets.CODECOV_TOKEN }}
Expand Down
4 changes: 2 additions & 2 deletions .github/workflows/CI_WeightInitializers.yml
Original file line number Diff line number Diff line change
Expand Up @@ -53,7 +53,7 @@ jobs:
- uses: julia-actions/julia-processcoverage@v1
with:
directories: lib/WeightInitializers/src,lib/WeightInitializers/ext
- uses: codecov/codecov-action@v4
- uses: codecov/codecov-action@v5
with:
files: lcov.info
token: ${{ secrets.CODECOV_TOKEN }}
Expand Down Expand Up @@ -84,7 +84,7 @@ jobs:
- uses: julia-actions/julia-processcoverage@v1
with:
directories: lib/WeightInitializers/src,lib/WeightInitializers/ext
- uses: codecov/codecov-action@v4
- uses: codecov/codecov-action@v5
with:
files: lcov.info
token: ${{ secrets.CODECOV_TOKEN }}
Expand Down
2 changes: 1 addition & 1 deletion .github/workflows/Downstream.yml
Original file line number Diff line number Diff line change
Expand Up @@ -76,7 +76,7 @@ jobs:
exit(0) # Exit immediately, as a success
end
- uses: julia-actions/julia-processcoverage@v1
- uses: codecov/codecov-action@v4
- uses: codecov/codecov-action@v5
with:
files: lcov.info
token: ${{ secrets.CODECOV_TOKEN }}
Expand Down

1 comment on commit cb0900f

@github-actions
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lux Benchmarks

Benchmark suite Current: cb0900f Previous: 636c9d1 Ratio
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/2 thread(s) 3875 ns 4250 ns 0.91
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/4 thread(s) 4375 ns 4292 ns 1.02
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/8 thread(s) 5083 ns 5000 ns 1.02
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/1 thread(s) 4208 ns 3916 ns 1.07
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/CUDA 60144 ns 60054 ns 1.00
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/2 thread(s) 10625 ns 10833 ns 0.98
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/4 thread(s) 10666 ns 10042 ns 1.06
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/8 thread(s) 11375 ns 10792 ns 1.05
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/1 thread(s) 10334 ns 10666.5 ns 0.97
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/CUDA 421452 ns 425278 ns 0.99
bias_activation(32, act=relu)(32 x 128)/forward/CPU/2 thread(s) 1250 ns 1167 ns 1.07
bias_activation(32, act=relu)(32 x 128)/forward/CPU/4 thread(s) 1292 ns 1208 ns 1.07
bias_activation(32, act=relu)(32 x 128)/forward/CPU/8 thread(s) 1250 ns 1459 ns 0.86
bias_activation(32, act=relu)(32 x 128)/forward/CPU/1 thread(s) 1167 ns 1208 ns 0.97
bias_activation(32, act=relu)(32 x 128)/forward/GPU/CUDA 18149 ns 18417 ns 0.99
bias_activation(32, act=relu)(32 x 128)/zygote/CPU/2 thread(s) 4167 ns 3917 ns 1.06
bias_activation(32, act=relu)(32 x 128)/zygote/CPU/4 thread(s) 4042 ns 4000 ns 1.01
bias_activation(32, act=relu)(32 x 128)/zygote/CPU/8 thread(s) 4292 ns 4250 ns 1.01
bias_activation(32, act=relu)(32 x 128)/zygote/CPU/1 thread(s) 3625 ns 4125 ns 0.88
bias_activation(32, act=relu)(32 x 128)/zygote/GPU/CUDA 109548 ns 109745 ns 1.00
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 56166 ns 58250 ns 0.96
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 46709 ns 46500 ns 1.00
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 46334 ns 46917 ns 0.99
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 82291 ns 83833.5 ns 0.98
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA 37127 ns 37085 ns 1.00
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 2031334 ns 2032104.5 ns 1.00
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 2096166.5 ns 2088666 ns 1.00
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 2086458 ns 2082333 ns 1.00
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 1997167 ns 2021708.5 ns 0.99
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 197158.5 ns 194358 ns 1.01
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 143042 ns 144167 ns 0.99
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 145583.5 ns 143333.5 ns 1.02
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 146709 ns 145792 ns 1.01
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 149500 ns 144875 ns 1.03
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA 166231 ns 166324.5 ns 1.00
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 1138708.5 ns 1120666.5 ns 1.02
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 1128583 ns 1116812.5 ns 1.01
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 1062083.5 ns 1115750 ns 0.95
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 1115041.5 ns 1153437.5 ns 0.97
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 530934 ns 524143 ns 1.01
layernorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/2 thread(s) 3125 ns 3834 ns 0.82
layernorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/4 thread(s) 3458 ns 3625 ns 0.95
layernorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/8 thread(s) 4292 ns 4334 ns 0.99
layernorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/1 thread(s) 3375 ns 3292 ns 1.03
layernorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/CUDA 70464 ns 70563 ns 1.00
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/2 thread(s) 9208 ns 9875 ns 0.93
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/4 thread(s) 8917 ns 10375 ns 0.86
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/8 thread(s) 9125 ns 9458 ns 0.96
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/1 thread(s) 9166 ns 8542 ns 1.07
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/CUDA 483194.5 ns 479347 ns 1.01
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 15333 ns 15583.5 ns 0.98
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 15458 ns 15375 ns 1.01
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 17333 ns 17000 ns 1.02
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 17062.5 ns 15250 ns 1.12
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA 53962 ns 54126 ns 1.00
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 214583.5 ns 213416 ns 1.01
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 212667 ns 214083.5 ns 0.99
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 214625 ns 215583 ns 1.00
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 225250 ns 246541.5 ns 0.91
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 273370 ns 271347.5 ns 1.01
bias_activation(2, act=relu)(2 x 128)/forward/CPU/2 thread(s) 458 ns 750 ns 0.61
bias_activation(2, act=relu)(2 x 128)/forward/CPU/4 thread(s) 666 ns 625 ns 1.07
bias_activation(2, act=relu)(2 x 128)/forward/CPU/8 thread(s) 750 ns 750 ns 1
bias_activation(2, act=relu)(2 x 128)/forward/CPU/1 thread(s) 500 ns 709 ns 0.71
bias_activation(2, act=relu)(2 x 128)/forward/GPU/CUDA 17502.5 ns 17843 ns 0.98
bias_activation(2, act=relu)(2 x 128)/zygote/CPU/2 thread(s) 1542 ns 1708 ns 0.90
bias_activation(2, act=relu)(2 x 128)/zygote/CPU/4 thread(s) 1667 ns 1541 ns 1.08
bias_activation(2, act=relu)(2 x 128)/zygote/CPU/8 thread(s) 1834 ns 1791 ns 1.02
bias_activation(2, act=relu)(2 x 128)/zygote/CPU/1 thread(s) 1375 ns 1458 ns 0.94
bias_activation(2, act=relu)(2 x 128)/zygote/GPU/CUDA 101667.5 ns 102473 ns 0.99
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 7125 ns 7208 ns 0.99
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 5917 ns 5833 ns 1.01
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 5792 ns 5917 ns 0.98
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 9917 ns 10292 ns 0.96
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA 23886 ns 23188 ns 1.03
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 221417 ns 221458 ns 1.00
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 228125 ns 227959 ns 1.00
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 228666 ns 228500 ns 1.00
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 220500 ns 214729 ns 1.03
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 169891 ns 169404 ns 1.00
dense(32, bias=false, act=relu)(32 x 128)/forward/CPU/2 thread(s) 3958 ns 3916 ns 1.01
dense(32, bias=false, act=relu)(32 x 128)/forward/CPU/4 thread(s) 3958 ns 3916 ns 1.01
dense(32, bias=false, act=relu)(32 x 128)/forward/CPU/8 thread(s) 3875 ns 3916 ns 0.99
dense(32, bias=false, act=relu)(32 x 128)/forward/CPU/1 thread(s) 3875 ns 3958 ns 0.98
dense(32, bias=false, act=relu)(32 x 128)/forward/GPU/CUDA 23537 ns 23907 ns 0.98
dense(32, bias=false, act=relu)(32 x 128)/zygote/CPU/2 thread(s) 16750 ns 16583 ns 1.01
dense(32, bias=false, act=relu)(32 x 128)/zygote/CPU/4 thread(s) 17042 ns 16583 ns 1.03
dense(32, bias=false, act=relu)(32 x 128)/zygote/CPU/8 thread(s) 16875 ns 17042 ns 0.99
dense(32, bias=false, act=relu)(32 x 128)/zygote/CPU/1 thread(s) 16750 ns 16542 ns 1.01
dense(32, bias=false, act=relu)(32 x 128)/zygote/GPU/CUDA 159725 ns 162027 ns 0.99
dense(512, bias=false, act=gelu)(512 x 128)/forward/CPU/2 thread(s) 570333 ns 569083 ns 1.00
dense(512, bias=false, act=gelu)(512 x 128)/forward/CPU/4 thread(s) 574000 ns 578041 ns 0.99
dense(512, bias=false, act=gelu)(512 x 128)/forward/CPU/8 thread(s) 579125 ns 573625 ns 1.01
dense(512, bias=false, act=gelu)(512 x 128)/forward/CPU/1 thread(s) 571125 ns 570916 ns 1.00
dense(512, bias=false, act=gelu)(512 x 128)/forward/GPU/CUDA 113492 ns 112937.5 ns 1.00
dense(512, bias=false, act=gelu)(512 x 128)/zygote/CPU/2 thread(s) 1428041 ns 1422062.5 ns 1.00
dense(512, bias=false, act=gelu)(512 x 128)/zygote/CPU/4 thread(s) 1422333 ns 1417459 ns 1.00
dense(512, bias=false, act=gelu)(512 x 128)/zygote/CPU/8 thread(s) 1423708 ns 1420875 ns 1.00
dense(512, bias=false, act=gelu)(512 x 128)/zygote/CPU/1 thread(s) 1423458 ns 1422667 ns 1.00
dense(512, bias=false, act=gelu)(512 x 128)/zygote/GPU/CUDA 208571.5 ns 212002 ns 0.98
lenet(28, 28, 1, 64)/forward/CPU/2 thread(s) 1051187.5 ns 1076417 ns 0.98
lenet(28, 28, 1, 64)/forward/CPU/4 thread(s) 971896 ns 970125 ns 1.00
lenet(28, 28, 1, 64)/forward/CPU/8 thread(s) 1346062.5 ns 1341062.5 ns 1.00
lenet(28, 28, 1, 64)/forward/CPU/1 thread(s) 1306416 ns 1282542 ns 1.02
lenet(28, 28, 1, 64)/forward/GPU/CUDA 272301 ns 274403 ns 0.99
lenet(28, 28, 1, 64)/zygote/CPU/2 thread(s) 5990916 ns 5768459 ns 1.04
lenet(28, 28, 1, 64)/zygote/CPU/4 thread(s) 4519875 ns 4594917 ns 0.98
lenet(28, 28, 1, 64)/zygote/CPU/8 thread(s) 4948416.5 ns 4948750 ns 1.00
lenet(28, 28, 1, 64)/zygote/CPU/1 thread(s) 5523125 ns 5721687.5 ns 0.97
lenet(28, 28, 1, 64)/zygote/GPU/CUDA 1070952 ns 1071440 ns 1.00
dense(2, bias=true, act=relu)(2 x 128)/forward/CPU/2 thread(s) 500 ns 500 ns 1
dense(2, bias=true, act=relu)(2 x 128)/forward/CPU/4 thread(s) 542 ns 542 ns 1
dense(2, bias=true, act=relu)(2 x 128)/forward/CPU/8 thread(s) 583 ns 583 ns 1
dense(2, bias=true, act=relu)(2 x 128)/forward/CPU/1 thread(s) 542 ns 500 ns 1.08
dense(2, bias=true, act=relu)(2 x 128)/forward/GPU/CUDA 23553 ns 23971 ns 0.98
dense(2, bias=true, act=relu)(2 x 128)/zygote/CPU/2 thread(s) 2084 ns 2083 ns 1.00
dense(2, bias=true, act=relu)(2 x 128)/zygote/CPU/4 thread(s) 2167 ns 2125 ns 1.02
dense(2, bias=true, act=relu)(2 x 128)/zygote/CPU/8 thread(s) 2167 ns 2167 ns 1
dense(2, bias=true, act=relu)(2 x 128)/zygote/CPU/1 thread(s) 2125 ns 2083 ns 1.02
dense(2, bias=true, act=relu)(2 x 128)/zygote/GPU/CUDA 168963.5 ns 174370 ns 0.97
layernorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/2 thread(s) 3875 ns 4042 ns 0.96
layernorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/4 thread(s) 4167 ns 4459 ns 0.93
layernorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/8 thread(s) 5250 ns 4791.5 ns 1.10
layernorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/1 thread(s) 3666 ns 3666 ns 1
layernorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/CUDA 65091 ns 65101 ns 1.00
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/2 thread(s) 11416 ns 10917 ns 1.05
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/4 thread(s) 11292 ns 11500 ns 0.98
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/8 thread(s) 12333.5 ns 12333 ns 1.00
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/1 thread(s) 11209 ns 11166 ns 1.00
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/CUDA 446962.5 ns 449038 ns 1.00
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/2 thread(s) 6458.5 ns 6583 ns 0.98
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/4 thread(s) 6792 ns 6312.5 ns 1.08
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/8 thread(s) 7833.5 ns 7833 ns 1.00
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/1 thread(s) 6250 ns 6250 ns 1
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/CUDA 52555 ns 52027 ns 1.01
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/2 thread(s) 16584 ns 17125 ns 0.97
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/4 thread(s) 17791 ns 16959 ns 1.05
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/8 thread(s) 17375 ns 18959 ns 0.92
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/1 thread(s) 17125 ns 17437.5 ns 0.98
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/CUDA 308634 ns 297375.5 ns 1.04
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/2 thread(s) 625 ns 583 ns 1.07
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/4 thread(s) 666 ns 542 ns 1.23
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/8 thread(s) 583 ns 625 ns 0.93
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/1 thread(s) 625 ns 541 ns 1.16
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/CUDA 32320 ns 32771 ns 0.99
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/2 thread(s) 8541 ns 8854.5 ns 0.96
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/4 thread(s) 9167 ns 9500 ns 0.96
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/8 thread(s) 9500 ns 8958 ns 1.06
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/1 thread(s) 9479.5 ns 8541.5 ns 1.11
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/CUDA 159616 ns 158837.5 ns 1.00
dense(512, bias=false, act=identity)(512 x 128)/forward/CPU/2 thread(s) 64750 ns 64542 ns 1.00
dense(512, bias=false, act=identity)(512 x 128)/forward/CPU/4 thread(s) 64625 ns 64417 ns 1.00
dense(512, bias=false, act=identity)(512 x 128)/forward/CPU/8 thread(s) 64292 ns 64542 ns 1.00
dense(512, bias=false, act=identity)(512 x 128)/forward/CPU/1 thread(s) 64542 ns 64750 ns 1.00
dense(512, bias=false, act=identity)(512 x 128)/forward/GPU/CUDA 111041.5 ns 111087 ns 1.00
dense(512, bias=false, act=identity)(512 x 128)/zygote/CPU/2 thread(s) 292000 ns 277792 ns 1.05
dense(512, bias=false, act=identity)(512 x 128)/zygote/CPU/4 thread(s) 292084 ns 284292 ns 1.03
dense(512, bias=false, act=identity)(512 x 128)/zygote/CPU/8 thread(s) 275666 ns 282125 ns 0.98
dense(512, bias=false, act=identity)(512 x 128)/zygote/CPU/1 thread(s) 275708 ns 286333.5 ns 0.96
dense(512, bias=false, act=identity)(512 x 128)/zygote/GPU/CUDA 183441 ns 185412.5 ns 0.99
mlp7layer_bn(gelu)(32 x 256)/forward/CPU/2 thread(s) 3191791 ns 3283437 ns 0.97
mlp7layer_bn(gelu)(32 x 256)/forward/CPU/4 thread(s) 3043437.5 ns 3018229 ns 1.01
mlp7layer_bn(gelu)(32 x 256)/forward/CPU/8 thread(s) 3020437.5 ns 3058917 ns 0.99
mlp7layer_bn(gelu)(32 x 256)/forward/CPU/1 thread(s) 4089708 ns 4032979 ns 1.01
mlp7layer_bn(gelu)(32 x 256)/forward/GPU/CUDA 601857 ns 618259 ns 0.97
mlp7layer_bn(gelu)(32 x 256)/zygote/CPU/2 thread(s) 7582625 ns 7620500 ns 1.00
mlp7layer_bn(gelu)(32 x 256)/zygote/CPU/4 thread(s) 7473208.5 ns 7434375 ns 1.01
mlp7layer_bn(gelu)(32 x 256)/zygote/CPU/8 thread(s) 7437833 ns 7258208 ns 1.02
mlp7layer_bn(gelu)(32 x 256)/zygote/CPU/1 thread(s) 8187292 ns 8312542 ns 0.98
mlp7layer_bn(gelu)(32 x 256)/zygote/GPU/CUDA 1317154 ns 1382144 ns 0.95
mlp7layer_bn(gelu)(32 x 256)/enzyme/CPU/2 thread(s) 18957000 ns 18771167 ns 1.01
mlp7layer_bn(gelu)(32 x 256)/enzyme/CPU/4 thread(s) 19047250 ns 19155875 ns 0.99
mlp7layer_bn(gelu)(32 x 256)/enzyme/CPU/8 thread(s) 19104542 ns 19055084 ns 1.00
mlp7layer_bn(gelu)(32 x 256)/enzyme/CPU/1 thread(s) 15686625 ns 16613000 ns 0.94
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/forward/CPU/2 thread(s) 23902625 ns 23424834 ns 1.02
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/forward/CPU/4 thread(s) 34420458 ns 34218917 ns 1.01
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/forward/CPU/8 thread(s) 37002333 ns 37348958 ns 0.99
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/forward/CPU/1 thread(s) 34848770.5 ns 35414708 ns 0.98
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/forward/GPU/CUDA 1857006 ns 1860633 ns 1.00
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/zygote/CPU/2 thread(s) 191696375.5 ns 188862542 ns 1.02
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/zygote/CPU/4 thread(s) 164341792 ns 164640583.5 ns 1.00
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/zygote/CPU/8 thread(s) 152698167 ns 152867000 ns 1.00
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/zygote/CPU/1 thread(s) 439655916 ns 449351167 ns 0.98
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/zygote/GPU/CUDA 13895377 ns 13884229 ns 1.00
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/enzyme/CPU/2 thread(s) 292126520.5 ns 289481604.5 ns 1.01
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/enzyme/CPU/4 thread(s) 340023312 ns 265154292 ns 1.28
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/enzyme/CPU/8 thread(s) 298857875 ns 299135959 ns 1.00
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/enzyme/CPU/1 thread(s) 335240875 ns 399738312 ns 0.84
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 22250 ns 21916 ns 1.02
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 23083 ns 23750 ns 0.97
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 23959 ns 25083 ns 0.96
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 23417 ns 22916.5 ns 1.02
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA 96101 ns 97130.5 ns 0.99
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 103542 ns 103125 ns 1.00
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 103541 ns 103667 ns 1.00
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 104791 ns 104771 ns 1.00
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 113250 ns 103396 ns 1.10
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 512131 ns 503270 ns 1.02
layernorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/2 thread(s) 5834 ns 6375 ns 0.92
layernorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/4 thread(s) 6375 ns 6750 ns 0.94
layernorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/8 thread(s) 7000 ns 6792 ns 1.03
layernorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/1 thread(s) 6125 ns 5584 ns 1.10
layernorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/CUDA 68297.5 ns 67691 ns 1.01
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/2 thread(s) 15208 ns 14875 ns 1.02
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/4 thread(s) 15750 ns 15895.5 ns 0.99
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/8 thread(s) 16583 ns 16520.5 ns 1.00
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/1 thread(s) 15062.5 ns 15416 ns 0.98
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/CUDA 474148.5 ns 474411 ns 1.00
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/forward/CPU/2 thread(s) 3053958 ns 2993375 ns 1.02
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/forward/CPU/4 thread(s) 2089500 ns 2048666.5 ns 1.02
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/forward/CPU/8 thread(s) 2270042 ns 2260292 ns 1.00
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/forward/CPU/1 thread(s) 4804875 ns 4882041 ns 0.98
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/forward/GPU/CUDA 582756 ns 586320.5 ns 0.99
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/zygote/CPU/2 thread(s) 23872458.5 ns 23515125 ns 1.02
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/zygote/CPU/4 thread(s) 18056937.5 ns 17982770.5 ns 1.00
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/zygote/CPU/8 thread(s) 17766021 ns 17986666 ns 0.99
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/zygote/CPU/1 thread(s) 35515208 ns 36296250 ns 0.98
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/zygote/GPU/CUDA 3103295.5 ns 3101860 ns 1.00
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/enzyme/CPU/2 thread(s) 33801000 ns 33484041.5 ns 1.01
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/enzyme/CPU/4 thread(s) 27630916.5 ns 27547583 ns 1.00
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/enzyme/CPU/8 thread(s) 27435750 ns 27396833.5 ns 1.00
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/enzyme/CPU/1 thread(s) 41597458 ns 42046625.5 ns 0.99
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 74917 ns 71917 ns 1.04
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 72541 ns 73625 ns 0.99
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 76416 ns 75834 ns 1.01
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 74375 ns 74042 ns 1.00
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA 103583 ns 103235 ns 1.00
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 221146 ns 206687.5 ns 1.07
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 219166 ns 320208.5 ns 0.68
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 208875 ns 208417 ns 1.00
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 206542 ns 205625 ns 1.00
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 560403 ns 548628.5 ns 1.02
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/2 thread(s) 12166 ns 11791 ns 1.03
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/4 thread(s) 12208.5 ns 13208 ns 0.92
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/8 thread(s) 13167 ns 12625 ns 1.04
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/1 thread(s) 12042 ns 11792 ns 1.02
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/CUDA 71403 ns 70856 ns 1.01
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/2 thread(s) 26979.5 ns 26083 ns 1.03
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/4 thread(s) 27167 ns 27104.5 ns 1.00
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/8 thread(s) 27958.5 ns 27584 ns 1.01
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/1 thread(s) 26459 ns 26958 ns 0.98
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/CUDA 472464 ns 471996 ns 1.00
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/2 thread(s) 12437.5 ns 12125 ns 1.03
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/4 thread(s) 12979 ns 12792 ns 1.01
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/8 thread(s) 14167 ns 13417 ns 1.06
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/1 thread(s) 12125 ns 12459 ns 0.97
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/CUDA 53400 ns 52898.5 ns 1.01
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/2 thread(s) 25625 ns 25416 ns 1.01
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/4 thread(s) 26292 ns 25833 ns 1.02
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/8 thread(s) 26416 ns 26500 ns 1.00
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/1 thread(s) 26167 ns 26208 ns 1.00
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/CUDA 306626.5 ns 303358.5 ns 1.01
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 180729 ns 180458 ns 1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 182709 ns 181792 ns 1.01
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 183875 ns 183125 ns 1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 180833 ns 179833 ns 1.01
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA 56252.5 ns 56401 ns 1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 593541.5 ns 582583.5 ns 1.02
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 593916 ns 583312.5 ns 1.02
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 584021 ns 583709 ns 1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 582917 ns 584896 ns 1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 289288.5 ns 286433.5 ns 1.01
layernorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/2 thread(s) 6500 ns 6084 ns 1.07
layernorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/4 thread(s) 6125 ns 6625 ns 0.92
layernorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/8 thread(s) 7792 ns 7209 ns 1.08
layernorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/1 thread(s) 6145.5 ns 5709 ns 1.08
layernorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/CUDA 70132.5 ns 70607 ns 0.99
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/2 thread(s) 14271 ns 13666 ns 1.04
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/4 thread(s) 14916 ns 14333 ns 1.04
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/8 thread(s) 15500 ns 15209 ns 1.02
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/1 thread(s) 14000 ns 14250 ns 0.98
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/CUDA 460852.5 ns 461794 ns 1.00
batchedmm(512, Bsize=4)/forward/CPU/2 thread(s) 1175354 ns 1174166.5 ns 1.00
batchedmm(512, Bsize=4)/forward/CPU/4 thread(s) 1353000 ns 1239604 ns 1.09
batchedmm(512, Bsize=4)/forward/CPU/8 thread(s) 1269979 ns 1267334 ns 1.00
batchedmm(512, Bsize=4)/forward/CPU/1 thread(s) 1317500 ns 1308146 ns 1.01
batchedmm(512, Bsize=4)/forward/GPU/CUDA 302455 ns 301138 ns 1.00
batchedmm(512, Bsize=4)/zygote/CPU/2 thread(s) 4288500 ns 4120792 ns 1.04
batchedmm(512, Bsize=4)/zygote/CPU/4 thread(s) 4366958 ns 4346770.5 ns 1.00
batchedmm(512, Bsize=4)/zygote/CPU/8 thread(s) 4543917 ns 4613625 ns 0.98
batchedmm(512, Bsize=4)/zygote/CPU/1 thread(s) 4469000 ns 4699020.5 ns 0.95
batchedmm(512, Bsize=4)/zygote/GPU/CUDA 1030148 ns 1054798 ns 0.98
dense(2, bias=true, act=gelu)(2 x 128)/forward/CPU/2 thread(s) 1792 ns 1792 ns 1
dense(2, bias=true, act=gelu)(2 x 128)/forward/CPU/4 thread(s) 1875 ns 1833 ns 1.02
dense(2, bias=true, act=gelu)(2 x 128)/forward/CPU/8 thread(s) 1833 ns 1833 ns 1
dense(2, bias=true, act=gelu)(2 x 128)/forward/CPU/1 thread(s) 1875 ns 1875 ns 1
dense(2, bias=true, act=gelu)(2 x 128)/forward/GPU/CUDA 23497 ns 24192 ns 0.97
dense(2, bias=true, act=gelu)(2 x 128)/zygote/CPU/2 thread(s) 4834 ns 4792 ns 1.01
dense(2, bias=true, act=gelu)(2 x 128)/zygote/CPU/4 thread(s) 5041 ns 4875 ns 1.03
dense(2, bias=true, act=gelu)(2 x 128)/zygote/CPU/8 thread(s) 4875 ns 4916 ns 0.99
dense(2, bias=true, act=gelu)(2 x 128)/zygote/CPU/1 thread(s) 4875 ns 4875 ns 1
dense(2, bias=true, act=gelu)(2 x 128)/zygote/GPU/CUDA 185923.5 ns 188688 ns 0.99
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/2 thread(s) 5500 ns 6125 ns 0.90
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/4 thread(s) 6167 ns 5833 ns 1.06
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/8 thread(s) 6459 ns 7146 ns 0.90
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/1 thread(s) 5583 ns 5500 ns 1.02
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/CUDA 55454.5 ns 55155.5 ns 1.01
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/2 thread(s) 10667 ns 10833 ns 0.98
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/4 thread(s) 11750 ns 11000 ns 1.07
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/8 thread(s) 11458 ns 11625 ns 0.99
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/1 thread(s) 10667 ns 10708 ns 1.00
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/CUDA 337381 ns 328957 ns 1.03
dense(2, bias=false, act=relu)(2 x 128)/forward/CPU/2 thread(s) 292 ns 292 ns 1
dense(2, bias=false, act=relu)(2 x 128)/forward/CPU/4 thread(s) 375 ns 333 ns 1.13
dense(2, bias=false, act=relu)(2 x 128)/forward/CPU/8 thread(s) 333 ns 333 ns 1
dense(2, bias=false, act=relu)(2 x 128)/forward/CPU/1 thread(s) 292 ns 334 ns 0.87
dense(2, bias=false, act=relu)(2 x 128)/forward/GPU/CUDA 22737 ns 23157 ns 0.98
dense(2, bias=false, act=relu)(2 x 128)/zygote/CPU/2 thread(s) 2708 ns 2750 ns 0.98
dense(2, bias=false, act=relu)(2 x 128)/zygote/CPU/4 thread(s) 3000 ns 2792 ns 1.07
dense(2, bias=false, act=relu)(2 x 128)/zygote/CPU/8 thread(s) 3000 ns 3042 ns 0.99
dense(2, bias=false, act=relu)(2 x 128)/zygote/CPU/1 thread(s) 2750 ns 2750 ns 1
dense(2, bias=false, act=relu)(2 x 128)/zygote/GPU/CUDA 157057 ns 159786.5 ns 0.98
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/2 thread(s) 11625 ns 11583 ns 1.00
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/4 thread(s) 12250 ns 11500 ns 1.07
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/8 thread(s) 12708 ns 12646 ns 1.00
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/1 thread(s) 11417 ns 11270.5 ns 1.01
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/CUDA 56422 ns 57234 ns 0.99
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/2 thread(s) 24250 ns 24750 ns 0.98
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/4 thread(s) 25208 ns 24833.5 ns 1.02
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/8 thread(s) 25000 ns 25500 ns 0.98
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/1 thread(s) 25437.5 ns 24500 ns 1.04
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/CUDA 294376.5 ns 295767.5 ns 1.00
dense(32, bias=true, act=relu)(32 x 128)/forward/CPU/2 thread(s) 4167 ns 4167 ns 1
dense(32, bias=true, act=relu)(32 x 128)/forward/CPU/4 thread(s) 4208 ns 4250 ns 0.99
dense(32, bias=true, act=relu)(32 x 128)/forward/CPU/8 thread(s) 4167 ns 4250 ns 0.98
dense(32, bias=true, act=relu)(32 x 128)/forward/CPU/1 thread(s) 4208 ns 4208 ns 1
dense(32, bias=true, act=relu)(32 x 128)/forward/GPU/CUDA 24716 ns 25133 ns 0.98
dense(32, bias=true, act=relu)(32 x 128)/zygote/CPU/2 thread(s) 16042 ns 16083 ns 1.00
dense(32, bias=true, act=relu)(32 x 128)/zygote/CPU/4 thread(s) 16417 ns 16250 ns 1.01
dense(32, bias=true, act=relu)(32 x 128)/zygote/CPU/8 thread(s) 16250 ns 16291 ns 1.00
dense(32, bias=true, act=relu)(32 x 128)/zygote/CPU/1 thread(s) 16167 ns 16000 ns 1.01
dense(32, bias=true, act=relu)(32 x 128)/zygote/GPU/CUDA 193381 ns 196801.5 ns 0.98
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/2 thread(s) 5750 ns 5750 ns 1
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/4 thread(s) 6083 ns 5792 ns 1.05
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/8 thread(s) 5750 ns 5792 ns 0.99
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/1 thread(s) 5833 ns 5791 ns 1.01
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/CUDA 33569 ns 33759 ns 0.99
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/2 thread(s) 20479.5 ns 21166.5 ns 0.97
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/4 thread(s) 21000 ns 20875 ns 1.01
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/8 thread(s) 21208 ns 21375 ns 0.99
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/1 thread(s) 21104.5 ns 20792 ns 1.02
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/CUDA 174365.5 ns 177483 ns 0.98
batchedmm(16, Bsize=512)/forward/CPU/2 thread(s) 375416.5 ns 400729.5 ns 0.94
batchedmm(16, Bsize=512)/forward/CPU/4 thread(s) 374666.5 ns 374229 ns 1.00
batchedmm(16, Bsize=512)/forward/CPU/8 thread(s) 488312.5 ns 489041 ns 1.00
batchedmm(16, Bsize=512)/forward/CPU/1 thread(s) 524187.5 ns 505209 ns 1.04
batchedmm(16, Bsize=512)/forward/GPU/CUDA 66372.5 ns 66692.5 ns 1.00
batchedmm(16, Bsize=512)/zygote/CPU/2 thread(s) 931978.5 ns 976604.5 ns 0.95
batchedmm(16, Bsize=512)/zygote/CPU/4 thread(s) 880291.5 ns 885854.5 ns 0.99
batchedmm(16, Bsize=512)/zygote/CPU/8 thread(s) 1223791.5 ns 1239959 ns 0.99
batchedmm(16, Bsize=512)/zygote/CPU/1 thread(s) 1351833.5 ns 1414417 ns 0.96
batchedmm(16, Bsize=512)/zygote/GPU/CUDA 192149.5 ns 190141.5 ns 1.01
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 81312.5 ns 81625 ns 1.00
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 80750 ns 81375 ns 0.99
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 80792 ns 81875 ns 0.99
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 80937 ns 82792 ns 0.98
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA 192807 ns 193437.5 ns 1.00
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 1932917 ns 1921958 ns 1.01
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 1916542 ns 1883688 ns 1.02
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 1926479 ns 1929792 ns 1.00
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 1921042 ns 1938584 ns 0.99
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 394461 ns 388434 ns 1.02
dense(2, bias=true, act=identity)(2 x 128)/forward/CPU/2 thread(s) 291 ns 292 ns 1.00
dense(2, bias=true, act=identity)(2 x 128)/forward/CPU/4 thread(s) 292 ns 333 ns 0.88
dense(2, bias=true, act=identity)(2 x 128)/forward/CPU/8 thread(s) 292 ns 292 ns 1
dense(2, bias=true, act=identity)(2 x 128)/forward/CPU/1 thread(s) 333 ns 333 ns 1
dense(2, bias=true, act=identity)(2 x 128)/forward/GPU/CUDA 22118 ns 22427.5 ns 0.99
dense(2, bias=true, act=identity)(2 x 128)/zygote/CPU/2 thread(s) 1750 ns 1792 ns 0.98
dense(2, bias=true, act=identity)(2 x 128)/zygote/CPU/4 thread(s) 1834 ns 1833 ns 1.00
dense(2, bias=true, act=identity)(2 x 128)/zygote/CPU/8 thread(s) 1834 ns 1833 ns 1.00
dense(2, bias=true, act=identity)(2 x 128)/zygote/CPU/1 thread(s) 1792 ns 1792 ns 1
dense(2, bias=true, act=identity)(2 x 128)/zygote/GPU/CUDA 166019.5 ns 170483 ns 0.97
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/2 thread(s) 6250 ns 6354.5 ns 0.98
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/4 thread(s) 7208 ns 6750 ns 1.07
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/8 thread(s) 8166 ns 7687.5 ns 1.06
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/1 thread(s) 6312.5 ns 6479 ns 0.97
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/CUDA 57360.5 ns 59523 ns 0.96
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/2 thread(s) 8917 ns 9041 ns 0.99
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/4 thread(s) 9167 ns 9083 ns 1.01
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/8 thread(s) 9208 ns 9333 ns 0.99
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/1 thread(s) 9250 ns 8833 ns 1.05
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/CUDA 301535 ns 308338.5 ns 0.98
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/forward/CPU/2 thread(s) 156508063 ns 119707062.5 ns 1.31
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/forward/CPU/4 thread(s) 173937500 ns 173955792 ns 1.00
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/forward/CPU/8 thread(s) 148141208 ns 148074917 ns 1.00
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/forward/CPU/1 thread(s) 106478500 ns 108269666 ns 0.98
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/forward/GPU/CUDA 5474150 ns 5474309.5 ns 1.00
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/zygote/CPU/2 thread(s) 673237875 ns 617752083 ns 1.09
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/zygote/CPU/4 thread(s) 556883000 ns 555432583 ns 1.00
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/zygote/CPU/8 thread(s) 453960458.5 ns 451206208 ns 1.01
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/zygote/CPU/1 thread(s) 759297583 ns 776597541.5 ns 0.98
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/zygote/GPU/CUDA 38204722 ns 34955587.5 ns 1.09
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/enzyme/CPU/2 thread(s) 701496583 ns 649274125 ns 1.08
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/enzyme/CPU/4 thread(s) 667076166 ns 665965354.5 ns 1.00
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/enzyme/CPU/8 thread(s) 586800771 ns 585624583.5 ns 1.00
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/enzyme/CPU/1 thread(s) 744632000 ns 749969750 ns 0.99
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 56833 ns 59208 ns 0.96
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 48042 ns 47875 ns 1.00
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 47125 ns 48166 ns 0.98
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 84541 ns 85167 ns 0.99
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA 37576 ns 37958 ns 0.99
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 1935541 ns 1929270.5 ns 1.00
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 1985208 ns 1968792 ns 1.01
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 1979834 ns 1987333 ns 1.00
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 1893771 ns 1916063 ns 0.99
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 174934 ns 175872 ns 0.99
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 267875 ns 267875 ns 1
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 288042 ns 265500 ns 1.08
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 270229.5 ns 269958 ns 1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 267250 ns 266125 ns 1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA 128767 ns 129983.5 ns 0.99
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 665041 ns 585834 ns 1.14
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 668958 ns 595458 ns 1.12
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 589167 ns 587916 ns 1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 596209 ns 585000 ns 1.02
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 703647.5 ns 697007.5 ns 1.01
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 2205417 ns 2148500 ns 1.03
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 2188541 ns 2209500 ns 0.99
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 2100166.5 ns 2103416 ns 1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 2225499.5 ns 2160708 ns 1.03
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA 133307.5 ns 133956 ns 1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 5538625 ns 5496208 ns 1.01
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 5527958 ns 5493000 ns 1.01
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 5503250 ns 5496042 ns 1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 5491271 ns 5572625 ns 0.99
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 759584.5 ns 737128.5 ns 1.03
dense(512, bias=true, act=gelu)(512 x 128)/forward/CPU/2 thread(s) 638667 ns 639417 ns 1.00
dense(512, bias=true, act=gelu)(512 x 128)/forward/CPU/4 thread(s) 640458 ns 657709 ns 0.97
dense(512, bias=true, act=gelu)(512 x 128)/forward/CPU/8 thread(s) 648875 ns 639917 ns 1.01
dense(512, bias=true, act=gelu)(512 x 128)/forward/CPU/1 thread(s) 636167 ns 638917 ns 1.00
dense(512, bias=true, act=gelu)(512 x 128)/forward/GPU/CUDA 47137 ns 47806 ns 0.99
dense(512, bias=true, act=gelu)(512 x 128)/zygote/CPU/2 thread(s) 1796937.5 ns 1824541 ns 0.98
dense(512, bias=true, act=gelu)(512 x 128)/zygote/CPU/4 thread(s) 1724292 ns 1726917 ns 1.00
dense(512, bias=true, act=gelu)(512 x 128)/zygote/CPU/8 thread(s) 1720542 ns 1719687.5 ns 1.00
dense(512, bias=true, act=gelu)(512 x 128)/zygote/CPU/1 thread(s) 2104520.5 ns 2101292 ns 1.00
dense(512, bias=true, act=gelu)(512 x 128)/zygote/GPU/CUDA 218174.5 ns 226913 ns 0.96
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 57000 ns 58583 ns 0.97
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 46833 ns 45167 ns 1.04
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 47083 ns 47750 ns 0.99
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 84542 ns 84958.5 ns 1.00
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA 28335 ns 29092 ns 0.97
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 2047750 ns 2034917 ns 1.01
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 2077083 ns 2064062.5 ns 1.01
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 2092083 ns 2093625 ns 1.00
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 1939979 ns 2025250 ns 0.96
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 191381.5 ns 192854.5 ns 0.99
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/forward/CPU/2 thread(s) 13410020.5 ns 13439541.5 ns 1.00
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/forward/CPU/4 thread(s) 12472750 ns 12486020.5 ns 1.00
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/forward/CPU/8 thread(s) 12570979 ns 12585000 ns 1.00
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/forward/CPU/1 thread(s) 15234500 ns 15058604.5 ns 1.01
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/forward/GPU/CUDA 512740.5 ns 514768 ns 1.00
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/zygote/CPU/2 thread(s) 47584458 ns 47224604.5 ns 1.01
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/zygote/CPU/4 thread(s) 41911083 ns 41768333 ns 1.00
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/zygote/CPU/8 thread(s) 41152979.5 ns 40759687.5 ns 1.01
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/zygote/CPU/1 thread(s) 58152541 ns 59312833 ns 0.98
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/zygote/GPU/CUDA 3249099 ns 3244631 ns 1.00
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/enzyme/CPU/2 thread(s) 74313208.5 ns 73979958 ns 1.00
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/enzyme/CPU/4 thread(s) 91931958.5 ns 68237542 ns 1.35
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/enzyme/CPU/8 thread(s) 91156000 ns 90322875 ns 1.01
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/enzyme/CPU/1 thread(s) 76595709 ns 77058250 ns 0.99
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 57334 ns 58792 ns 0.98
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 47417 ns 47083.5 ns 1.01
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 47250 ns 47625 ns 0.99
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 84375 ns 84834 ns 0.99
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA 48075 ns 47986 ns 1.00
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 1930959 ns 1919646 ns 1.01
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 1977562.5 ns 1965041.5 ns 1.01
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 1977250 ns 1977583 ns 1.00
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 1816292 ns 1902500 ns 0.95
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 196217.5 ns 194479 ns 1.01
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/2 thread(s) 334 ns 292 ns 1.14
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/4 thread(s) 417 ns 292 ns 1.43
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/8 thread(s) 334 ns 375 ns 0.89
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/1 thread(s) 333 ns 292 ns 1.14
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/CUDA 32756 ns 32641 ns 1.00
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/2 thread(s) 6125 ns 6000 ns 1.02
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/4 thread(s) 6583 ns 6083 ns 1.08
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/8 thread(s) 6542 ns 6542 ns 1
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/1 thread(s) 6208 ns 6083 ns 1.02
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/CUDA 178147.5 ns 173275 ns 1.03
dense(2, bias=false, act=identity)(2 x 128)/forward/CPU/2 thread(s) 291 ns 250 ns 1.16
dense(2, bias=false, act=identity)(2 x 128)/forward/CPU/4 thread(s) 292 ns 292 ns 1
dense(2, bias=false, act=identity)(2 x 128)/forward/CPU/8 thread(s) 292 ns 291 ns 1.00
dense(2, bias=false, act=identity)(2 x 128)/forward/CPU/1 thread(s) 250 ns 291 ns 0.86
dense(2, bias=false, act=identity)(2 x 128)/forward/GPU/CUDA 31948 ns 32425 ns 0.99
dense(2, bias=false, act=identity)(2 x 128)/zygote/CPU/2 thread(s) 2625 ns 2625 ns 1
dense(2, bias=false, act=identity)(2 x 128)/zygote/CPU/4 thread(s) 2875 ns 2666 ns 1.08
dense(2, bias=false, act=identity)(2 x 128)/zygote/CPU/8 thread(s) 2834 ns 2917 ns 0.97
dense(2, bias=false, act=identity)(2 x 128)/zygote/CPU/1 thread(s) 2625 ns 2584 ns 1.02
dense(2, bias=false, act=identity)(2 x 128)/zygote/GPU/CUDA 164100 ns 160394.5 ns 1.02
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/forward/CPU/2 thread(s) 323244146 ns 287480500 ns 1.12
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/forward/CPU/4 thread(s) 340740458 ns 339790334 ns 1.00
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/forward/CPU/8 thread(s) 314512041.5 ns 314236083.5 ns 1.00
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/forward/CPU/1 thread(s) 271130916 ns 270187875 ns 1.00
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/forward/GPU/CUDA 7115553 ns 7108297.5 ns 1.00
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/zygote/CPU/2 thread(s) 1053603541.5 ns 989833917 ns 1.06
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/zygote/CPU/4 thread(s) 941056333 ns 940591916 ns 1.00
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/zygote/CPU/8 thread(s) 854610104 ns 853322000.5 ns 1.00
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/zygote/CPU/1 thread(s) 1162236250 ns 1178549334 ns 0.99
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/zygote/GPU/CUDA 33945165 ns 34044401 ns 1.00
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/enzyme/CPU/2 thread(s) 1364084083.5 ns 1316176791.5 ns 1.04
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/enzyme/CPU/4 thread(s) 1705661833 ns 1348661437.5 ns 1.26
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/enzyme/CPU/8 thread(s) 1621953875 ns 1629837083 ns 1.00
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/enzyme/CPU/1 thread(s) 1313183229.5 ns 1293144333.5 ns 1.02
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 1410000 ns 1406584 ns 1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 1408291.5 ns 1404458.5 ns 1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 1453645.5 ns 1409375 ns 1.03
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 1407209 ns 1410334 ns 1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA 127861 ns 127864 ns 1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 5051959 ns 5021959 ns 1.01
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 5013583.5 ns 5007792 ns 1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 5028416.5 ns 5030667 ns 1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 5027271 ns 5052000 ns 1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 604299 ns 550210.5 ns 1.10
vgg16(32, 32, 3, 32)/forward/CPU/2 thread(s) 161226250 ns 174975458.5 ns 0.92
vgg16(32, 32, 3, 32)/forward/CPU/4 thread(s) 131446875 ns 131550875 ns 1.00
vgg16(32, 32, 3, 32)/forward/CPU/8 thread(s) 127042083 ns 129143375.5 ns 0.98
vgg16(32, 32, 3, 32)/forward/CPU/1 thread(s) 155626750.5 ns 161588000 ns 0.96
vgg16(32, 32, 3, 32)/forward/GPU/CUDA 4974919.5 ns 4877735 ns 1.02
vgg16(32, 32, 3, 32)/zygote/CPU/2 thread(s) 850481958 ns 666469042 ns 1.28
vgg16(32, 32, 3, 32)/zygote/CPU/4 thread(s) 644255791 ns 640200042 ns 1.01
vgg16(32, 32, 3, 32)/zygote/CPU/8 thread(s) 496077667 ns 534233208 ns 0.93
vgg16(32, 32, 3, 32)/zygote/CPU/1 thread(s) 685984875 ns 868077834 ns 0.79
vgg16(32, 32, 3, 32)/zygote/GPU/CUDA 15948822 ns 16128771 ns 0.99
batchedmm(512, Bsize=32)/forward/CPU/2 thread(s) 9064833.5 ns 8899521 ns 1.02
batchedmm(512, Bsize=32)/forward/CPU/4 thread(s) 8770396 ns 8695125 ns 1.01
batchedmm(512, Bsize=32)/forward/CPU/8 thread(s) 7878104.5 ns 7843000 ns 1.00
batchedmm(512, Bsize=32)/forward/CPU/1 thread(s) 10163000 ns 10351917 ns 0.98
batchedmm(512, Bsize=32)/forward/GPU/CUDA 1608837.5 ns 1610313.5 ns 1.00
batchedmm(512, Bsize=32)/zygote/CPU/2 thread(s) 37348729 ns 36519833 ns 1.02
batchedmm(512, Bsize=32)/zygote/CPU/4 thread(s) 36970124.5 ns 36646083 ns 1.01
batchedmm(512, Bsize=32)/zygote/CPU/8 thread(s) 33623167 ns 33248208.5 ns 1.01
batchedmm(512, Bsize=32)/zygote/CPU/1 thread(s) 38875729.5 ns 40043375 ns 0.97
batchedmm(512, Bsize=32)/zygote/GPU/CUDA 6455570 ns 6450575 ns 1.00
bias_activation(32, act=tanh)(32 x 128)/forward/CPU/2 thread(s) 47375 ns 47625 ns 0.99
bias_activation(32, act=tanh)(32 x 128)/forward/CPU/4 thread(s) 47750 ns 47375 ns 1.01
bias_activation(32, act=tanh)(32 x 128)/forward/CPU/8 thread(s) 47583 ns 47709 ns 1.00
bias_activation(32, act=tanh)(32 x 128)/forward/CPU/1 thread(s) 47625 ns 47292 ns 1.01
bias_activation(32, act=tanh)(32 x 128)/forward/GPU/CUDA 18855 ns 19138 ns 0.99
bias_activation(32, act=tanh)(32 x 128)/zygote/CPU/2 thread(s) 50250 ns 50458 ns 1.00
bias_activation(32, act=tanh)(32 x 128)/zygote/CPU/4 thread(s) 50750 ns 50292 ns 1.01
bias_activation(32, act=tanh)(32 x 128)/zygote/CPU/8 thread(s) 50416 ns 53334 ns 0.95
bias_activation(32, act=tanh)(32 x 128)/zygote/CPU/1 thread(s) 50292 ns 50416 ns 1.00
bias_activation(32, act=tanh)(32 x 128)/zygote/GPU/CUDA 202264 ns 174370 ns 1.16
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/2 thread(s) 6375 ns 6834 ns 0.93
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/4 thread(s) 7187.5 ns 6584 ns 1.09
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/8 thread(s) 8417 ns 7875.5 ns 1.07
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/1 thread(s) 6708 ns 6687.5 ns 1.00
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/CUDA 108599.5 ns 86131.5 ns 1.26
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/2 thread(s) 9604.5 ns 9958 ns 0.96
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/4 thread(s) 10209 ns 9584 ns 1.07
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/8 thread(s) 10292 ns 10250 ns 1.00
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/1 thread(s) 10583 ns 10042 ns 1.05
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/CUDA 610519 ns 515499.5 ns 1.18
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/2 thread(s) 5958 ns 6500 ns 0.92
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/4 thread(s) 6375 ns 5708 ns 1.12
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/8 thread(s) 7583 ns 7208 ns 1.05
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/1 thread(s) 5542 ns 5333 ns 1.04
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/CUDA 131186.5 ns 95565.5 ns 1.37
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/2 thread(s) 12875 ns 12645.5 ns 1.02
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/4 thread(s) 13208 ns 12542 ns 1.05
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/8 thread(s) 13583 ns 13750 ns 0.99
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/1 thread(s) 12875 ns 12937.5 ns 1.00
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/CUDA 530393 ns 469414.5 ns 1.13
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/2 thread(s) 1000 ns 958 ns 1.04
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/4 thread(s) 1167 ns 1083 ns 1.08
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/8 thread(s) 1042 ns 1042 ns 1
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/1 thread(s) 1042 ns 1042 ns 1
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/CUDA 32479.5 ns 32890 ns 0.99
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/2 thread(s) 7833.5 ns 7562.5 ns 1.04
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/4 thread(s) 8042 ns 7750 ns 1.04
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/8 thread(s) 8083 ns 8208 ns 0.98
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/1 thread(s) 7916 ns 8000 ns 0.99
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/CUDA 216406.5 ns 199158.5 ns 1.09
bias_activation(32, act=gelu)(32 x 128)/forward/CPU/2 thread(s) 23042 ns 23083 ns 1.00
bias_activation(32, act=gelu)(32 x 128)/forward/CPU/4 thread(s) 23542 ns 23083.5 ns 1.02
bias_activation(32, act=gelu)(32 x 128)/forward/CPU/8 thread(s) 23333 ns 23375 ns 1.00
bias_activation(32, act=gelu)(32 x 128)/forward/CPU/1 thread(s) 23375 ns 23250 ns 1.01
bias_activation(32, act=gelu)(32 x 128)/forward/GPU/CUDA 19066 ns 18687 ns 1.02
bias_activation(32, act=gelu)(32 x 128)/zygote/CPU/2 thread(s) 52291.5 ns 52292 ns 1.00
bias_activation(32, act=gelu)(32 x 128)/zygote/CPU/4 thread(s) 52500 ns 52375 ns 1.00
bias_activation(32, act=gelu)(32 x 128)/zygote/CPU/8 thread(s) 53166.5 ns 52750 ns 1.01
bias_activation(32, act=gelu)(32 x 128)/zygote/CPU/1 thread(s) 52125 ns 52375 ns 1.00
bias_activation(32, act=gelu)(32 x 128)/zygote/GPU/CUDA 309714.5 ns 257093.5 ns 1.20
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 1413917 ns 1406750 ns 1.01
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 1401104 ns 1401479.5 ns 1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 1457583.5 ns 1402562.5 ns 1.04
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 1402271 ns 1408270.5 ns 1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA 196285 ns 196328 ns 1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 5045083 ns 5008146 ns 1.01
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 4724458 ns 4702667 ns 1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 5023021 ns 5024104 ns 1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 4706104.5 ns 5039750 ns 0.93
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 644560.5 ns 556806.5 ns 1.16
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/forward/CPU/2 thread(s) 3086125.5 ns 3030084 ns 1.02
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/forward/CPU/4 thread(s) 2087104.5 ns 2079354 ns 1.00
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/forward/CPU/8 thread(s) 2281125 ns 2282729.5 ns 1.00
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/forward/CPU/1 thread(s) 4848375 ns 4945062.5 ns 0.98
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/forward/GPU/CUDA 580262 ns 581296 ns 1.00
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/zygote/CPU/2 thread(s) 24765000.5 ns 24403958.5 ns 1.01
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/zygote/CPU/4 thread(s) 18889791.5 ns 18897937.5 ns 1.00
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/zygote/CPU/8 thread(s) 19005084 ns 18907333 ns 1.01
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/zygote/CPU/1 thread(s) 36681292 ns 37159666 ns 0.99
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/zygote/GPU/CUDA 3253871.5 ns 3184896 ns 1.02
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/enzyme/CPU/2 thread(s) 34537875 ns 34104791.5 ns 1.01
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/enzyme/CPU/4 thread(s) 28314500 ns 28272250 ns 1.00
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/enzyme/CPU/8 thread(s) 27967000 ns 27994771 ns 1.00
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/enzyme/CPU/1 thread(s) 41702500 ns 42199625 ns 0.99
batchedmm(512, Bsize=512)/forward/CPU/2 thread(s) 144041208 ns 142261750 ns 1.01
batchedmm(512, Bsize=512)/forward/CPU/4 thread(s) 143168583 ns 143002917 ns 1.00
batchedmm(512, Bsize=512)/forward/CPU/8 thread(s) 124247521 ns 125056229 ns 0.99
batchedmm(512, Bsize=512)/forward/CPU/1 thread(s) 173506729 ns 168210729 ns 1.03
batchedmm(512, Bsize=512)/forward/GPU/CUDA 22768605 ns 22549101 ns 1.01
batchedmm(512, Bsize=512)/zygote/CPU/2 thread(s) 957619479 ns 924197146 ns 1.04
batchedmm(512, Bsize=512)/zygote/CPU/4 thread(s) 1175957479.5 ns 881679833 ns 1.33
batchedmm(512, Bsize=512)/zygote/CPU/8 thread(s) 739734292 ns 679210937 ns 1.09
batchedmm(512, Bsize=512)/zygote/CPU/1 thread(s) 672317125 ns 691445167 ns 0.97
batchedmm(512, Bsize=512)/zygote/GPU/CUDA 118020449 ns 118243896 ns 1.00
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 73979 ns 78875 ns 0.94
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 75750 ns 73916.5 ns 1.02
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 75416 ns 76792 ns 0.98
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 72854.5 ns 74292 ns 0.98
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA 300521.5 ns 204610.5 ns 1.47
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 287875 ns 191000 ns 1.51
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 285333 ns 189917 ns 1.50
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 204208 ns 192291 ns 1.06
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 287375 ns 285687.5 ns 1.01
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 1342742 ns 1149282.5 ns 1.17
batchedmm(512, Bsize=128)/forward/CPU/2 thread(s) 36185500 ns 35495792 ns 1.02
batchedmm(512, Bsize=128)/forward/CPU/4 thread(s) 35466000.5 ns 35648750 ns 0.99
batchedmm(512, Bsize=128)/forward/CPU/8 thread(s) 32336688 ns 32319292 ns 1.00
batchedmm(512, Bsize=128)/forward/CPU/1 thread(s) 40972250 ns 41619250.5 ns 0.98
batchedmm(512, Bsize=128)/forward/GPU/CUDA 5837876 ns 5843597 ns 1.00
batchedmm(512, Bsize=128)/zygote/CPU/2 thread(s) 151179834 ns 147692250 ns 1.02
batchedmm(512, Bsize=128)/zygote/CPU/4 thread(s) 151456979 ns 153061583 ns 0.99
batchedmm(512, Bsize=128)/zygote/CPU/8 thread(s) 136606104 ns 133656084 ns 1.02
batchedmm(512, Bsize=128)/zygote/CPU/1 thread(s) 287372208 ns 228263645.5 ns 1.26
batchedmm(512, Bsize=128)/zygote/GPU/CUDA 34877857 ns 34880956 ns 1.00
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/forward/CPU/2 thread(s) 155986916 ns 120866228.5 ns 1.29
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/forward/CPU/4 thread(s) 174507459 ns 174040042 ns 1.00
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/forward/CPU/8 thread(s) 148111416.5 ns 147840500 ns 1.00
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/forward/CPU/1 thread(s) 102908562.5 ns 102334125 ns 1.01
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/forward/GPU/CUDA 5463707 ns 5477094 ns 1.00
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/zygote/CPU/2 thread(s) 520380250 ns 470548750 ns 1.11
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/zygote/CPU/4 thread(s) 465489750 ns 467155625 ns 1.00
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/zygote/CPU/8 thread(s) 439138000 ns 436800021 ns 1.01
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/zygote/CPU/1 thread(s) 742252417 ns 762586062.5 ns 0.97
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/zygote/GPU/CUDA 35175845 ns 32255358.5 ns 1.09
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/enzyme/CPU/2 thread(s) 698201250 ns 650080584 ns 1.07
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/enzyme/CPU/4 thread(s) 654820792 ns 654363708.5 ns 1.00
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/enzyme/CPU/8 thread(s) 571273229.5 ns 577159563 ns 0.99
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/enzyme/CPU/1 thread(s) 850215250 ns 870785625 ns 0.98
mlp7layer_bn(relu)(32 x 256)/forward/CPU/2 thread(s) 1101520.5 ns 1344042 ns 0.82
mlp7layer_bn(relu)(32 x 256)/forward/CPU/4 thread(s) 970208.5 ns 906229 ns 1.07
mlp7layer_bn(relu)(32 x 256)/forward/CPU/8 thread(s) 920500 ns 903583.5 ns 1.02
mlp7layer_bn(relu)(32 x 256)/forward/CPU/1 thread(s) 1945375.5 ns 2053417 ns 0.95
mlp7layer_bn(relu)(32 x 256)/forward/GPU/CUDA 580245.5 ns 582678.5 ns 1.00
mlp7layer_bn(relu)(32 x 256)/zygote/CPU/2 thread(s) 2907896 ns 2957750 ns 0.98
mlp7layer_bn(relu)(32 x 256)/zygote/CPU/4 thread(s) 2595708 ns 2598375 ns 1.00
mlp7layer_bn(relu)(32 x 256)/zygote/CPU/8 thread(s) 2606333 ns 2617083.5 ns 1.00
mlp7layer_bn(relu)(32 x 256)/zygote/CPU/1 thread(s) 3655000 ns 3768041.5 ns 0.97
mlp7layer_bn(relu)(32 x 256)/zygote/GPU/CUDA 1734207 ns 1806655.5 ns 0.96
mlp7layer_bn(relu)(32 x 256)/enzyme/CPU/2 thread(s) 6744875 ns 6643083 ns 1.02
mlp7layer_bn(relu)(32 x 256)/enzyme/CPU/4 thread(s) 6498208 ns 6464750 ns 1.01
mlp7layer_bn(relu)(32 x 256)/enzyme/CPU/8 thread(s) 6503854.5 ns 6500583.5 ns 1.00
mlp7layer_bn(relu)(32 x 256)/enzyme/CPU/1 thread(s) 4423604.5 ns 4561542 ns 0.97
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 7208 ns 7167 ns 1.01
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 6083 ns 6208 ns 0.98
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 5958.5 ns 6167 ns 0.97
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 9959 ns 10375 ns 0.96
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA 25201 ns 26058 ns 0.97
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 212291 ns 213208.5 ns 1.00
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 220750 ns 220271 ns 1.00
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 220125 ns 220708 ns 1.00
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 206792 ns 206000 ns 1.00
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 262467.5 ns 261050.5 ns 1.01
vgg16(32, 32, 3, 64)/forward/CPU/2 thread(s) 316552750 ns 311672646 ns 1.02
vgg16(32, 32, 3, 64)/forward/CPU/4 thread(s) 221682708 ns 221886520.5 ns 1.00
vgg16(32, 32, 3, 64)/forward/CPU/8 thread(s) 187257688 ns 182666958 ns 1.03
vgg16(32, 32, 3, 64)/forward/CPU/1 thread(s) 311596375 ns 306867104.5 ns 1.02
vgg16(32, 32, 3, 64)/forward/GPU/CUDA 7676203 ns 7678016 ns 1.00
vgg16(32, 32, 3, 64)/zygote/CPU/2 thread(s) 1093022833.5 ns 1080144500 ns 1.01
vgg16(32, 32, 3, 64)/zygote/CPU/4 thread(s) 911616145.5 ns 906381437.5 ns 1.01
vgg16(32, 32, 3, 64)/zygote/CPU/8 thread(s) 815656375 ns 825800000 ns 0.99
vgg16(32, 32, 3, 64)/zygote/CPU/1 thread(s) 1161401125 ns 1190620500 ns 0.98
vgg16(32, 32, 3, 64)/zygote/GPU/CUDA 26547253 ns 26457172 ns 1.00
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/2 thread(s) 5292 ns 5458 ns 0.97
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/4 thread(s) 5667 ns 5125 ns 1.11
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/8 thread(s) 6625 ns 6333 ns 1.05
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/1 thread(s) 5125 ns 5083 ns 1.01
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/CUDA 167889.5 ns 152413 ns 1.10
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/2 thread(s) 7083 ns 7292 ns 0.97
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/4 thread(s) 7375 ns 7208.5 ns 1.02
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/8 thread(s) 7459 ns 7458 ns 1.00
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/1 thread(s) 7437.5 ns 6875 ns 1.08
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/CUDA 650263 ns 620099.5 ns 1.05
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/2 thread(s) 542 ns 541 ns 1.00
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/4 thread(s) 709 ns 542 ns 1.31
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/8 thread(s) 667 ns 625 ns 1.07
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/1 thread(s) 542 ns 500 ns 1.08
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/CUDA 23809 ns 24884 ns 0.96
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/2 thread(s) 9041.5 ns 8833 ns 1.02
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/4 thread(s) 9791 ns 8875 ns 1.10
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/8 thread(s) 9208.5 ns 9708 ns 0.95
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/1 thread(s) 9042 ns 9250 ns 0.98
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/CUDA 233459 ns 225409.5 ns 1.04
bias_activation(512, act=gelu)(512 x 128)/forward/CPU/2 thread(s) 351417 ns 356375 ns 0.99
bias_activation(512, act=gelu)(512 x 128)/forward/CPU/4 thread(s) 352250 ns 354208 ns 0.99
bias_activation(512, act=gelu)(512 x 128)/forward/CPU/8 thread(s) 353063 ns 353708.5 ns 1.00
bias_activation(512, act=gelu)(512 x 128)/forward/CPU/1 thread(s) 353333 ns 352374.5 ns 1.00
bias_activation(512, act=gelu)(512 x 128)/forward/GPU/CUDA 21613 ns 21669 ns 1.00
bias_activation(512, act=gelu)(512 x 128)/zygote/CPU/2 thread(s) 791250 ns 827958 ns 0.96
bias_activation(512, act=gelu)(512 x 128)/zygote/CPU/4 thread(s) 808979 ns 775125 ns 1.04
bias_activation(512, act=gelu)(512 x 128)/zygote/CPU/8 thread(s) 773625 ns 828062.5 ns 0.93
bias_activation(512, act=gelu)(512 x 128)/zygote/CPU/1 thread(s) 824084 ns 830458.5 ns 0.99
bias_activation(512, act=gelu)(512 x 128)/zygote/GPU/CUDA 305844 ns 270211 ns 1.13
batchedmm(16, Bsize=32)/forward/CPU/2 thread(s) 314958 ns 335583 ns 0.94
batchedmm(16, Bsize=32)/forward/CPU/4 thread(s) 333625 ns 334417 ns 1.00
batchedmm(16, Bsize=32)/forward/CPU/8 thread(s) 448667 ns 452750 ns 0.99
batchedmm(16, Bsize=32)/forward/CPU/1 thread(s) 331833 ns 308896 ns 1.07
batchedmm(16, Bsize=32)/forward/GPU/CUDA 17811 ns 17938 ns 0.99
batchedmm(16, Bsize=32)/zygote/CPU/2 thread(s) 682125 ns 685917 ns 0.99
batchedmm(16, Bsize=32)/zygote/CPU/4 thread(s) 746791.5 ns 740625 ns 1.01
batchedmm(16, Bsize=32)/zygote/CPU/8 thread(s) 1029167 ns 1037042 ns 0.99
batchedmm(16, Bsize=32)/zygote/CPU/1 thread(s) 700937.5 ns 694791 ns 1.01
batchedmm(16, Bsize=32)/zygote/GPU/CUDA 273907.5 ns 231915 ns 1.18
batchedmm(16, Bsize=128)/forward/CPU/2 thread(s) 328083 ns 350459 ns 0.94
batchedmm(16, Bsize=128)/forward/CPU/4 thread(s) 348979 ns 349333.5 ns 1.00
batchedmm(16, Bsize=128)/forward/CPU/8 thread(s) 424375 ns 428792 ns 0.99
batchedmm(16, Bsize=128)/forward/CPU/1 thread(s) 370666 ns 351729 ns 1.05
batchedmm(16, Bsize=128)/forward/GPU/CUDA 22237 ns 22606 ns 0.98
batchedmm(16, Bsize=128)/zygote/CPU/2 thread(s) 743604 ns 750375 ns 0.99
batchedmm(16, Bsize=128)/zygote/CPU/4 thread(s) 750229 ns 744250 ns 1.01
batchedmm(16, Bsize=128)/zygote/CPU/8 thread(s) 1076375 ns 1079542 ns 1.00
batchedmm(16, Bsize=128)/zygote/CPU/1 thread(s) 822541 ns 825979.5 ns 1.00
batchedmm(16, Bsize=128)/zygote/GPU/CUDA 220485.5 ns 213391.5 ns 1.03
bias_activation(2, act=tanh)(2 x 128)/forward/CPU/2 thread(s) 3334 ns 3542 ns 0.94
bias_activation(2, act=tanh)(2 x 128)/forward/CPU/4 thread(s) 3792 ns 3583 ns 1.06
bias_activation(2, act=tanh)(2 x 128)/forward/CPU/8 thread(s) 3625 ns 3666 ns 0.99
bias_activation(2, act=tanh)(2 x 128)/forward/CPU/1 thread(s) 3583 ns 3520.5 ns 1.02
bias_activation(2, act=tanh)(2 x 128)/forward/GPU/CUDA 18068 ns 17871 ns 1.01
bias_activation(2, act=tanh)(2 x 128)/zygote/CPU/2 thread(s) 4166 ns 4104.5 ns 1.01
bias_activation(2, act=tanh)(2 x 128)/zygote/CPU/4 thread(s) 4542 ns 4167 ns 1.09
bias_activation(2, act=tanh)(2 x 128)/zygote/CPU/8 thread(s) 4250 ns 4417 ns 0.96
bias_activation(2, act=tanh)(2 x 128)/zygote/CPU/1 thread(s) 4334 ns 4083 ns 1.06
bias_activation(2, act=tanh)(2 x 128)/zygote/GPU/CUDA 278097 ns 238254.5 ns 1.17
layernorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/2 thread(s) 3292 ns 3792 ns 0.87
layernorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/4 thread(s) 3645.5 ns 4125 ns 0.88
layernorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/8 thread(s) 4708 ns 4625 ns 1.02
layernorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/1 thread(s) 4042 ns 3500 ns 1.15
layernorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/CUDA 212235.5 ns 178544.5 ns 1.19
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/2 thread(s) 8042 ns 8417 ns 0.96
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/4 thread(s) 8417 ns 7958 ns 1.06
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/8 thread(s) 8792 ns 8666 ns 1.01
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/1 thread(s) 8167 ns 8458 ns 0.97
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/CUDA 1255478 ns 1052090.5 ns 1.19
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 204000 ns 203583 ns 1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 211375 ns 214542 ns 0.99
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 211042 ns 210292 ns 1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 200541 ns 201875 ns 0.99
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA 34367 ns 34516 ns 1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 605708.5 ns 607937.5 ns 1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 625021 ns 667208.5 ns 0.94
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 620792 ns 667479 ns 0.93
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 582583 ns 631687 ns 0.92
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 361289.5 ns 291952 ns 1.24
batchedmm(128, Bsize=128)/forward/CPU/2 thread(s) 973333 ns 972042 ns 1.00
batchedmm(128, Bsize=128)/forward/CPU/4 thread(s) 950209 ns 932792 ns 1.02
batchedmm(128, Bsize=128)/forward/CPU/8 thread(s) 955541 ns 955812.5 ns 1.00
batchedmm(128, Bsize=128)/forward/CPU/1 thread(s) 1286000.5 ns 1334188 ns 0.96
batchedmm(128, Bsize=128)/forward/GPU/CUDA 207830 ns 207894 ns 1.00
batchedmm(128, Bsize=128)/zygote/CPU/2 thread(s) 4594084 ns 4516458.5 ns 1.02
batchedmm(128, Bsize=128)/zygote/CPU/4 thread(s) 4500750.5 ns 4463146 ns 1.01
batchedmm(128, Bsize=128)/zygote/CPU/8 thread(s) 4304583 ns 4308209 ns 1.00
batchedmm(128, Bsize=128)/zygote/CPU/1 thread(s) 6304625 ns 6464792 ns 0.98
batchedmm(128, Bsize=128)/zygote/GPU/CUDA 925479 ns 938833.5 ns 0.99
layernorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/2 thread(s) 3333 ns 3500 ns 0.95
layernorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/4 thread(s) 3583 ns 3583 ns 1
layernorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/8 thread(s) 4250 ns 4083 ns 1.04
layernorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/1 thread(s) 3541 ns 3125 ns 1.13
layernorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/CUDA 240989.5 ns 174208 ns 1.38
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/2 thread(s) 6875 ns 7292 ns 0.94
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/4 thread(s) 7542 ns 7084 ns 1.06
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/8 thread(s) 7375 ns 7667 ns 0.96
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/1 thread(s) 7042 ns 6959 ns 1.01
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/CUDA 1039649.5 ns 935667 ns 1.11
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/forward/CPU/2 thread(s) 1636792 ns 1654291.5 ns 0.99
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/forward/CPU/4 thread(s) 1175749.5 ns 1178000 ns 1.00
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/forward/CPU/8 thread(s) 1347167 ns 1375667 ns 0.98
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/forward/CPU/1 thread(s) 2463271 ns 2330125 ns 1.06
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/forward/GPU/CUDA 213096 ns 212833 ns 1.00
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/zygote/CPU/2 thread(s) 12388416 ns 12374250 ns 1.00
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/zygote/CPU/4 thread(s) 9551437.5 ns 9567834 ns 1.00
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/zygote/CPU/8 thread(s) 9305937.5 ns 9311229.5 ns 1.00
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/zygote/CPU/1 thread(s) 18088000 ns 18171313 ns 1.00
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/zygote/GPU/CUDA 1951605 ns 1941291 ns 1.01
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/enzyme/CPU/2 thread(s) 17398084 ns 17396166 ns 1.00
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/enzyme/CPU/4 thread(s) 14348854.5 ns 14397645.5 ns 1.00
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/enzyme/CPU/8 thread(s) 14347271 ns 14397458 ns 1.00
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/enzyme/CPU/1 thread(s) 21112104 ns 21079729.5 ns 1.00
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 94729.5 ns 87646.5 ns 1.08
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 90667 ns 96250 ns 0.94
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 92375 ns 94125 ns 0.98
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 114395.5 ns 133000 ns 0.86
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA 125574 ns 125997 ns 1.00
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 2039792 ns 2029500 ns 1.01
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 1808208.5 ns 2013042 ns 0.90
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 2033666.5 ns 2027313 ns 1.00
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 2022500 ns 2051291 ns 0.99
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 1052869 ns 959259 ns 1.10
batchedmm(2, Bsize=4)/forward/CPU/2 thread(s) 326041.5 ns 347583 ns 0.94
batchedmm(2, Bsize=4)/forward/CPU/4 thread(s) 344833 ns 344146 ns 1.00
batchedmm(2, Bsize=4)/forward/CPU/8 thread(s) 396416 ns 399083 ns 0.99
batchedmm(2, Bsize=4)/forward/CPU/1 thread(s) 314708 ns 286854 ns 1.10
batchedmm(2, Bsize=4)/forward/GPU/CUDA 15677 ns 16054 ns 0.98
batchedmm(2, Bsize=4)/zygote/CPU/2 thread(s) 701042 ns 705709 ns 0.99
batchedmm(2, Bsize=4)/zygote/CPU/4 thread(s) 733209 ns 728500 ns 1.01
batchedmm(2, Bsize=4)/zygote/CPU/8 thread(s) 1020500 ns 1019208 ns 1.00
batchedmm(2, Bsize=4)/zygote/CPU/1 thread(s) 656250 ns 649146 ns 1.01
batchedmm(2, Bsize=4)/zygote/GPU/CUDA 196145.5 ns 186898.5 ns 1.05
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 7084 ns 7250 ns 0.98
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 5541 ns 5834 ns 0.95
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 6084 ns 5916 ns 1.03
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 10000 ns 10375 ns 0.96
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA 34060 ns 34237 ns 0.99
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 221166.5 ns 222854 ns 0.99
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 220916.5 ns 224917 ns 0.98
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 220167 ns 233500 ns 0.94
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 217124.5 ns 214667 ns 1.01
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 344547 ns 287423 ns 1.20
dense(32, bias=true, act=identity)(32 x 128)/forward/CPU/2 thread(s) 3750 ns 3708 ns 1.01
dense(32, bias=true, act=identity)(32 x 128)/forward/CPU/4 thread(s) 3709 ns 3708 ns 1.00
dense(32, bias=true, act=identity)(32 x 128)/forward/CPU/8 thread(s) 3708 ns 3708 ns 1
dense(32, bias=true, act=identity)(32 x 128)/forward/CPU/1 thread(s) 3667 ns 3709 ns 0.99
dense(32, bias=true, act=identity)(32 x 128)/forward/GPU/CUDA 22568 ns 22887 ns 0.99
dense(32, bias=true, act=identity)(32 x 128)/zygote/CPU/2 thread(s) 14167 ns 14417 ns 0.98
dense(32, bias=true, act=identity)(32 x 128)/zygote/CPU/4 thread(s) 14375 ns 14417 ns 1.00
dense(32, bias=true, act=identity)(32 x 128)/zygote/CPU/8 thread(s) 14458 ns 14375 ns 1.01
dense(32, bias=true, act=identity)(32 x 128)/zygote/CPU/1 thread(s) 14416 ns 14250 ns 1.01
dense(32, bias=true, act=identity)(32 x 128)/zygote/GPU/CUDA 487124.5 ns 433354.5 ns 1.12
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 97500 ns 138895.5 ns 0.70
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 93417 ns 93687.5 ns 1.00
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 96687.5 ns 100229.5 ns 0.96
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 91875 ns 94083 ns 0.98
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA 124929 ns 125360 ns 1.00
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 1940875 ns 1922979 ns 1.01
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 1919916.5 ns 1925979.5 ns 1.00
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 1931229.5 ns 1923562.5 ns 1.00
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 1917271.5 ns 1922291 ns 1.00
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 955641 ns 884217 ns 1.08
lenet(28, 28, 1, 32)/forward/CPU/2 thread(s) 854084 ns 877312.5 ns 0.97
lenet(28, 28, 1, 32)/forward/CPU/4 thread(s) 826333 ns 821375 ns 1.01
lenet(28, 28, 1, 32)/forward/CPU/8 thread(s) 1211000 ns 1222916.5 ns 0.99
lenet(28, 28, 1, 32)/forward/CPU/1 thread(s) 955354.5 ns 940166 ns 1.02
lenet(28, 28, 1, 32)/forward/GPU/CUDA 272141 ns 270283.5 ns 1.01
lenet(28, 28, 1, 32)/zygote/CPU/2 thread(s) 2801124.5 ns 2811896 ns 1.00
lenet(28, 28, 1, 32)/zygote/CPU/4 thread(s) 2515333 ns 2435875 ns 1.03
lenet(28, 28, 1, 32)/zygote/CPU/8 thread(s) 3309625 ns 3368479 ns 0.98
lenet(28, 28, 1, 32)/zygote/CPU/1 thread(s) 3416625 ns 3411708.5 ns 1.00
lenet(28, 28, 1, 32)/zygote/GPU/CUDA 1612126.5 ns 1507174 ns 1.07
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 17062.5 ns 17667 ns 0.97
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 16708.5 ns 16271 ns 1.03
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 18937 ns 18375 ns 1.03
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 15167 ns 16937.5 ns 0.90
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA 142123.5 ns 141332.5 ns 1.01
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 223437.5 ns 256250 ns 0.87
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 215958 ns 216042 ns 1.00
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 216125 ns 257583 ns 0.84
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 255708.5 ns 221917 ns 1.15
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 644779 ns 582084 ns 1.11
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 222292 ns 222250 ns 1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 221750 ns 221750 ns 1
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 222542 ns 222250 ns 1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 220917 ns 219458 ns 1.01
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA 271274.5 ns 260776.5 ns 1.04
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 509083 ns 509000 ns 1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 501292 ns 553542 ns 0.91
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 496750 ns 559708.5 ns 0.89
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 550583 ns 503250 ns 1.09
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 1401190 ns 1236203 ns 1.13
batchedmm(16, Bsize=4)/forward/CPU/2 thread(s) 304437.5 ns 337667 ns 0.90
batchedmm(16, Bsize=4)/forward/CPU/4 thread(s) 331687.5 ns 332458 ns 1.00
batchedmm(16, Bsize=4)/forward/CPU/8 thread(s) 376292 ns 376750 ns 1.00
batchedmm(16, Bsize=4)/forward/CPU/1 thread(s) 321812.5 ns 297895.5 ns 1.08
batchedmm(16, Bsize=4)/forward/GPU/CUDA 16554 ns 16751 ns 0.99
batchedmm(16, Bsize=4)/zygote/CPU/2 thread(s) 708875 ns 715896 ns 0.99
batchedmm(16, Bsize=4)/zygote/CPU/4 thread(s) 736875 ns 727792 ns 1.01
batchedmm(16, Bsize=4)/zygote/CPU/8 thread(s) 1020209 ns 1021333 ns 1.00
batchedmm(16, Bsize=4)/zygote/CPU/1 thread(s) 668458 ns 663750 ns 1.01
batchedmm(16, Bsize=4)/zygote/GPU/CUDA 196065 ns 191390 ns 1.02
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 17854 ns 18500 ns 0.97
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 18520.5 ns 18000 ns 1.03
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 19667 ns 19125 ns 1.03
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 16209 ns 17520.5 ns 0.93
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA 146750.5 ns 144715.5 ns 1.01
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 247604 ns 221250 ns 1.12
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 212500 ns 211979 ns 1.00
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 212917 ns 225167 ns 0.95
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 211750.5 ns 230917 ns 0.92
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 1011803 ns 877142.5 ns 1.15
layernorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/2 thread(s) 4125 ns 4208 ns 0.98
layernorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/4 thread(s) 4125 ns 4875 ns 0.85
layernorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/8 thread(s) 5187.5 ns 5250 ns 0.99
layernorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/1 thread(s) 4084 ns 4020.5 ns 1.02
layernorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/CUDA 201325 ns 180911 ns 1.11
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/2 thread(s) 10667 ns 10417 ns 1.02
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/4 thread(s) 10875 ns 10479.5 ns 1.04
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/8 thread(s) 10500 ns 11167 ns 0.94
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/1 thread(s) 10375 ns 10208.5 ns 1.02
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/CUDA 1050725 ns 1008910 ns 1.04
layernorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/2 thread(s) 3375 ns 3375 ns 1
layernorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/4 thread(s) 3625 ns 3541 ns 1.02
layernorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/8 thread(s) 4167 ns 3958 ns 1.05
layernorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/1 thread(s) 3291 ns 3312.5 ns 0.99
layernorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/CUDA 242454 ns 218262 ns 1.11
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/2 thread(s) 7542 ns 7292 ns 1.03
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/4 thread(s) 7666 ns 7292 ns 1.05
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/8 thread(s) 7750 ns 7959 ns 0.97
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/1 thread(s) 7333 ns 7292 ns 1.01
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/CUDA 1067571 ns 1066214 ns 1.00
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/forward/CPU/2 thread(s) 24057353.5 ns 23448500 ns 1.03
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/forward/CPU/4 thread(s) 34753459 ns 35001375 ns 0.99
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/forward/CPU/8 thread(s) 37792125 ns 37680292 ns 1.00
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/forward/CPU/1 thread(s) 34828583.5 ns 35380500 ns 0.98
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/forward/GPU/CUDA 1854184 ns 1853791.5 ns 1.00
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/zygote/CPU/2 thread(s) 187222542 ns 184430542 ns 1.02
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/zygote/CPU/4 thread(s) 160010375 ns 159371667 ns 1.00
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/zygote/CPU/8 thread(s) 146721854.5 ns 146466937 ns 1.00
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/zygote/CPU/1 thread(s) 412776417 ns 422477750 ns 0.98
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/zygote/GPU/CUDA 16508303 ns 16507463.5 ns 1.00
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/enzyme/CPU/2 thread(s) 437495583 ns 426064458 ns 1.03
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/enzyme/CPU/4 thread(s) 253838438 ns 254339271 ns 1.00
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/enzyme/CPU/8 thread(s) 232343979.5 ns 232745062.5 ns 1.00
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/enzyme/CPU/1 thread(s) 483540875 ns 496585354 ns 0.97
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 183854 ns 183208 ns 1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 183625 ns 186646 ns 0.98
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 185334 ns 185333 ns 1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 184167 ns 183875 ns 1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA 220968 ns 200485.5 ns 1.10
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 594000 ns 597479 ns 0.99
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 632437.5 ns 587062.5 ns 1.08
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 586084 ns 635083.5 ns 0.92
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 628500 ns 621062 ns 1.01
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 1061303.5 ns 1047688.5 ns 1.01
batchedmm(128, Bsize=512)/forward/CPU/2 thread(s) 3892042 ns 3838792 ns 1.01
batchedmm(128, Bsize=512)/forward/CPU/4 thread(s) 3642708 ns 3832229 ns 0.95
batchedmm(128, Bsize=512)/forward/CPU/8 thread(s) 3572042 ns 3508709 ns 1.02
batchedmm(128, Bsize=512)/forward/CPU/1 thread(s) 5353250 ns 5482833 ns 0.98
batchedmm(128, Bsize=512)/forward/GPU/CUDA 549368 ns 553997 ns 0.99
batchedmm(128, Bsize=512)/zygote/CPU/2 thread(s) 17901624.5 ns 17434062.5 ns 1.03
batchedmm(128, Bsize=512)/zygote/CPU/4 thread(s) 17281292 ns 17172083.5 ns 1.01
batchedmm(128, Bsize=512)/zygote/CPU/8 thread(s) 16574875 ns 16682312 ns 0.99
batchedmm(128, Bsize=512)/zygote/CPU/1 thread(s) 22050250 ns 23187875 ns 0.95
batchedmm(128, Bsize=512)/zygote/GPU/CUDA 2630980 ns 2617230 ns 1.01
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/2 thread(s) 541 ns 542 ns 1.00
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/4 thread(s) 625 ns 583 ns 1.07
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/8 thread(s) 584 ns 584 ns 1
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/1 thread(s) 584 ns 542 ns 1.08
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/CUDA 31762 ns 32155 ns 0.99
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/2 thread(s) 9145.5 ns 8729.5 ns 1.05
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/4 thread(s) 9208 ns 8958 ns 1.03
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/8 thread(s) 9417 ns 9875 ns 0.95
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/1 thread(s) 9208 ns 9000 ns 1.02
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/CUDA 262912.5 ns 263867.5 ns 1.00
vgg16(32, 32, 3, 128)/forward/CPU/2 thread(s) 505346750 ns 497911458 ns 1.01
vgg16(32, 32, 3, 128)/forward/CPU/4 thread(s) 429818666.5 ns 429219083.5 ns 1.00
vgg16(32, 32, 3, 128)/forward/CPU/8 thread(s) 433256333.5 ns 375191709 ns 1.15
vgg16(32, 32, 3, 128)/forward/CPU/1 thread(s) 677373875 ns 681622604 ns 0.99
vgg16(32, 32, 3, 128)/forward/GPU/CUDA 12487373 ns 12475655 ns 1.00
vgg16(32, 32, 3, 128)/zygote/CPU/2 thread(s) 2066713500 ns 2053970374.5 ns 1.01
vgg16(32, 32, 3, 128)/zygote/CPU/4 thread(s) 1635890000 ns 1638205959 ns 1.00
vgg16(32, 32, 3, 128)/zygote/CPU/8 thread(s) 1494391792 ns 1496730145.5 ns 1.00
vgg16(32, 32, 3, 128)/zygote/CPU/1 thread(s) 2208031208.5 ns 2238026229 ns 0.99
vgg16(32, 32, 3, 128)/zygote/GPU/CUDA 49163495.5 ns 49241901 ns 1.00
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/forward/CPU/2 thread(s) 1632500.5 ns 1664062.5 ns 0.98
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/forward/CPU/4 thread(s) 1173583 ns 1174542 ns 1.00
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/forward/CPU/8 thread(s) 1383958 ns 1401625.5 ns 0.99
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/forward/CPU/1 thread(s) 2483292 ns 2450125 ns 1.01
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/forward/GPU/CUDA 214736 ns 217292.5 ns 0.99
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/zygote/CPU/2 thread(s) 12776042 ns 12720875 ns 1.00
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/zygote/CPU/4 thread(s) 9939062.5 ns 9941146 ns 1.00
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/zygote/CPU/8 thread(s) 9686917 ns 9680542 ns 1.00
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/zygote/CPU/1 thread(s) 18349375 ns 18500792 ns 0.99
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/zygote/GPU/CUDA 2056758 ns 2013409 ns 1.02
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/enzyme/CPU/2 thread(s) 17758729.5 ns 17699916.5 ns 1.00
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/enzyme/CPU/4 thread(s) 14689958 ns 14725083 ns 1.00
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/enzyme/CPU/8 thread(s) 14551125 ns 14579208 ns 1.00
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/enzyme/CPU/1 thread(s) 21399666 ns 22341458 ns 0.96
dense(32, bias=false, act=gelu)(32 x 128)/forward/CPU/2 thread(s) 26250 ns 26208 ns 1.00
dense(32, bias=false, act=gelu)(32 x 128)/forward/CPU/4 thread(s) 26292 ns 26250 ns 1.00
dense(32, bias=false, act=gelu)(32 x 128)/forward/CPU/8 thread(s) 26333 ns 26208 ns 1.00
dense(32, bias=false, act=gelu)(32 x 128)/forward/CPU/1 thread(s) 26250 ns 26291 ns 1.00
dense(32, bias=false, act=gelu)(32 x 128)/forward/GPU/CUDA 24146 ns 24007 ns 1.01
dense(32, bias=false, act=gelu)(32 x 128)/zygote/CPU/2 thread(s) 66791 ns 66917 ns 1.00
dense(32, bias=false, act=gelu)(32 x 128)/zygote/CPU/4 thread(s) 67292 ns 67500 ns 1.00
dense(32, bias=false, act=gelu)(32 x 128)/zygote/CPU/8 thread(s) 68417 ns 67334 ns 1.02
dense(32, bias=false, act=gelu)(32 x 128)/zygote/CPU/1 thread(s) 66709 ns 66750 ns 1.00
dense(32, bias=false, act=gelu)(32 x 128)/zygote/GPU/CUDA 391053.5 ns 393891.5 ns 0.99
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 204333 ns 204125 ns 1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 210125 ns 211667 ns 0.99
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 209458 ns 209208 ns 1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 198792 ns 199083 ns 1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA 26289 ns 26011 ns 1.01
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 642083 ns 646792 ns 0.99
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 624354.5 ns 622583 ns 1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 621729.5 ns 632375 ns 0.98
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 627000.5 ns 636417 ns 0.99
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 357106 ns 350976 ns 1.02
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 645625 ns 651500 ns 0.99
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 636292 ns 633916.5 ns 1.00
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 602667 ns 637250 ns 0.95
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 672375 ns 643083.5 ns 1.05
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA 132245.5 ns 132301.5 ns 1.00
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 2294979 ns 2256000 ns 1.02
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 2157208 ns 2122875.5 ns 1.02
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 2246208 ns 2253625 ns 1.00
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 2249458 ns 2306375 ns 0.98
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 1236985 ns 1180087 ns 1.05
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 17937.5 ns 17958 ns 1.00
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 18416.5 ns 18208 ns 1.01
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 20083 ns 20541 ns 0.98
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 18895.5 ns 19458 ns 0.97
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA 145580 ns 146027 ns 1.00
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 259583 ns 262166.5 ns 0.99
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 261791 ns 219542 ns 1.19
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 219084 ns 231375 ns 0.95
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 257520.5 ns 229229.5 ns 1.12
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 1034996 ns 996296 ns 1.04
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/2 thread(s) 542 ns 583 ns 0.93
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/4 thread(s) 667 ns 584 ns 1.14
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/8 thread(s) 625 ns 625 ns 1
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/1 thread(s) 583 ns 500 ns 1.17
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/CUDA 23604 ns 23586 ns 1.00
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/2 thread(s) 9750 ns 9708 ns 1.00
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/4 thread(s) 10292 ns 9542 ns 1.08
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/8 thread(s) 10250 ns 9875 ns 1.04
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/1 thread(s) 9333 ns 9709 ns 0.96
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/CUDA 260113.5 ns 259286.5 ns 1.00
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/2 thread(s) 5083.5 ns 5500 ns 0.92
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/4 thread(s) 5792 ns 5875 ns 0.99
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/8 thread(s) 6833 ns 7000 ns 0.98
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/1 thread(s) 5375 ns 4792 ns 1.12
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/CUDA 229273.5 ns 225853.5 ns 1.02
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/2 thread(s) 6709 ns 6875 ns 0.98
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/4 thread(s) 7667 ns 7250 ns 1.06
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/8 thread(s) 7583 ns 7542 ns 1.01
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/1 thread(s) 6937.5 ns 7125 ns 0.97
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/CUDA 777061.5 ns 770392.5 ns 1.01
bias_activation(2, act=gelu)(2 x 128)/forward/CPU/2 thread(s) 1917 ns 2000 ns 0.96
bias_activation(2, act=gelu)(2 x 128)/forward/CPU/4 thread(s) 2500 ns 2125 ns 1.18
bias_activation(2, act=gelu)(2 x 128)/forward/CPU/8 thread(s) 2208 ns 2250 ns 0.98
bias_activation(2, act=gelu)(2 x 128)/forward/CPU/1 thread(s) 2250 ns 2312.5 ns 0.97
bias_activation(2, act=gelu)(2 x 128)/forward/GPU/CUDA 18340 ns 18125 ns 1.01
bias_activation(2, act=gelu)(2 x 128)/zygote/CPU/2 thread(s) 6542 ns 6333 ns 1.03
bias_activation(2, act=gelu)(2 x 128)/zygote/CPU/4 thread(s) 6667 ns 6584 ns 1.01
bias_activation(2, act=gelu)(2 x 128)/zygote/CPU/8 thread(s) 6666 ns 6542 ns 1.02
bias_activation(2, act=gelu)(2 x 128)/zygote/CPU/1 thread(s) 6584 ns 6416 ns 1.03
bias_activation(2, act=gelu)(2 x 128)/zygote/GPU/CUDA 320616.5 ns 321687.5 ns 1.00
bias_activation(512, act=tanh)(512 x 128)/forward/CPU/2 thread(s) 750542 ns 746833.5 ns 1.00
bias_activation(512, act=tanh)(512 x 128)/forward/CPU/4 thread(s) 746792 ns 747125 ns 1.00
bias_activation(512, act=tanh)(512 x 128)/forward/CPU/8 thread(s) 746916 ns 749895.5 ns 1.00
bias_activation(512, act=tanh)(512 x 128)/forward/CPU/1 thread(s) 750584 ns 761667 ns 0.99
bias_activation(512, act=tanh)(512 x 128)/forward/GPU/CUDA 21795 ns 21408 ns 1.02
bias_activation(512, act=tanh)(512 x 128)/zygote/CPU/2 thread(s) 805145.5 ns 791208 ns 1.02
bias_activation(512, act=tanh)(512 x 128)/zygote/CPU/4 thread(s) 791604 ns 793145.5 ns 1.00
bias_activation(512, act=tanh)(512 x 128)/zygote/CPU/8 thread(s) 772584 ns 791625 ns 0.98
bias_activation(512, act=tanh)(512 x 128)/zygote/CPU/1 thread(s) 810645.5 ns 794917 ns 1.02
bias_activation(512, act=tanh)(512 x 128)/zygote/GPU/CUDA 302046.5 ns 294715 ns 1.02
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 6959 ns 7250 ns 0.96
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 5917 ns 5875 ns 1.01
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 6000 ns 5833 ns 1.03
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 10167 ns 10584 ns 0.96
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA 32896 ns 32998 ns 1.00
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 228770.5 ns 262333 ns 0.87
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 227709 ns 228875 ns 0.99
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 228084 ns 237083.5 ns 0.96
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 225625.5 ns 257792 ns 0.88
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 359979 ns 357007.5 ns 1.01
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/2 thread(s) 10250 ns 10334 ns 0.99
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/4 thread(s) 10208 ns 10208 ns 1
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/8 thread(s) 11042 ns 11000 ns 1.00
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/1 thread(s) 9958 ns 9875 ns 1.01
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/CUDA 245976 ns 239662.5 ns 1.03
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/2 thread(s) 24896 ns 24417 ns 1.02
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/4 thread(s) 24000 ns 24292 ns 0.99
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/8 thread(s) 25416.5 ns 26292 ns 0.97
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/1 thread(s) 24625 ns 24958 ns 0.99
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/CUDA 1114734 ns 1082647 ns 1.03
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/forward/CPU/2 thread(s) 106794687 ns 107049500 ns 1.00
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/forward/CPU/4 thread(s) 118367979 ns 117776354.5 ns 1.01
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/forward/CPU/8 thread(s) 120992291 ns 120966209 ns 1.00
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/forward/CPU/1 thread(s) 118045833 ns 118076875 ns 1.00
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/forward/GPU/CUDA 2655666 ns 2634519 ns 1.01
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/zygote/CPU/2 thread(s) 397097667 ns 392901583 ns 1.01
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/zygote/CPU/4 thread(s) 368138875 ns 366575334 ns 1.00
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/zygote/CPU/8 thread(s) 357737125 ns 425037083 ns 0.84
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/zygote/CPU/1 thread(s) 483722209 ns 488336500 ns 0.99
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/zygote/GPU/CUDA 15195689 ns 15175401 ns 1.00
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/enzyme/CPU/2 thread(s) 769405854 ns 759144667 ns 1.01
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/enzyme/CPU/4 thread(s) 762934333 ns 580473834 ns 1.31
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/enzyme/CPU/8 thread(s) 748099729.5 ns 745822521 ns 1.00
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/enzyme/CPU/1 thread(s) 772112770.5 ns 776881791.5 ns 0.99
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/2 thread(s) 6417 ns 6792 ns 0.94
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/4 thread(s) 7375 ns 7334 ns 1.01
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/8 thread(s) 8187 ns 7916 ns 1.03
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/1 thread(s) 8708.5 ns 7771 ns 1.12
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/CUDA 243458.5 ns 232079 ns 1.05
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/2 thread(s) 13625 ns 13667 ns 1.00
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/4 thread(s) 14834 ns 13875 ns 1.07
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/8 thread(s) 14834 ns 14666 ns 1.01
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/1 thread(s) 14000 ns 13958 ns 1.00
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/CUDA 1081512.5 ns 1036141 ns 1.04
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/2 thread(s) 5500 ns 6042 ns 0.91
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/4 thread(s) 6083.5 ns 6083 ns 1.00
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/8 thread(s) 7500 ns 7500 ns 1
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/1 thread(s) 5625 ns 5875 ns 0.96
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/CUDA 236881 ns 228105 ns 1.04
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/2 thread(s) 12583 ns 12208 ns 1.03
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/4 thread(s) 12750 ns 12292 ns 1.04
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/8 thread(s) 13000 ns 12792 ns 1.02
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/1 thread(s) 12542 ns 12208 ns 1.03
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/CUDA 792100 ns 752122.5 ns 1.05
batchedmm(2, Bsize=128)/forward/CPU/2 thread(s) 328937.5 ns 352541.5 ns 0.93
batchedmm(2, Bsize=128)/forward/CPU/4 thread(s) 345250 ns 343312.5 ns 1.01
batchedmm(2, Bsize=128)/forward/CPU/8 thread(s) 398625 ns 401083 ns 0.99
batchedmm(2, Bsize=128)/forward/CPU/1 thread(s) 315687.5 ns 289708 ns 1.09
batchedmm(2, Bsize=128)/forward/GPU/CUDA 17026 ns 16741 ns 1.02
batchedmm(2, Bsize=128)/zygote/CPU/2 thread(s) 701750 ns 707250 ns 0.99
batchedmm(2, Bsize=128)/zygote/CPU/4 thread(s) 734417 ns 723354 ns 1.02
batchedmm(2, Bsize=128)/zygote/CPU/8 thread(s) 1025666 ns 1029646 ns 1.00
batchedmm(2, Bsize=128)/zygote/CPU/1 thread(s) 663750 ns 651000 ns 1.02
batchedmm(2, Bsize=128)/zygote/GPU/CUDA 202330 ns 196251.5 ns 1.03
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/2 thread(s) 292 ns 292 ns 1
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/4 thread(s) 417 ns 333 ns 1.25
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/8 thread(s) 416 ns 375 ns 1.11
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/1 thread(s) 292 ns 292 ns 1
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/CUDA 23795 ns 23593 ns 1.01
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/2 thread(s) 6250 ns 6166.5 ns 1.01
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/4 thread(s) 6750 ns 6334 ns 1.07
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/8 thread(s) 6500 ns 6542 ns 0.99
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/1 thread(s) 6104.5 ns 6145.5 ns 0.99
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/CUDA 242897.5 ns 238191.5 ns 1.02
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/2 thread(s) 5875 ns 5792 ns 1.01
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/4 thread(s) 6042 ns 5834 ns 1.04
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/8 thread(s) 5917 ns 5917 ns 1
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/1 thread(s) 5875 ns 5833 ns 1.01
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/CUDA 24778 ns 24230 ns 1.02
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/2 thread(s) 21834 ns 21291 ns 1.03
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/4 thread(s) 21542 ns 21167 ns 1.02
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/8 thread(s) 21750 ns 21750 ns 1
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/1 thread(s) 21417 ns 21312.5 ns 1.00
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/CUDA 265364.5 ns 261937.5 ns 1.01
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 184375 ns 173854.5 ns 1.06
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 185000 ns 144500 ns 1.28
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 149541 ns 150042 ns 1.00
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 190750 ns 186208 ns 1.02
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA 168165 ns 167345 ns 1.00
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 1361667 ns 1326917 ns 1.03
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 1306875.5 ns 1312688 ns 1.00
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 1318541.5 ns 1318666 ns 1.00
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 1332084 ns 1368667 ns 0.97
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 1372553 ns 1291618 ns 1.06
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 24458 ns 24520.5 ns 1.00
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 22729 ns 22979.5 ns 0.99
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 25000 ns 23500 ns 1.06
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 22374.5 ns 22542 ns 0.99
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA 355948 ns 288056 ns 1.24
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 176958 ns 177417 ns 1.00
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 131167 ns 127375 ns 1.03
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 126166.5 ns 128042 ns 0.99
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 177542 ns 183542 ns 0.97
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 1491511 ns 1415779 ns 1.05
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/2 thread(s) 292 ns 292 ns 1
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/4 thread(s) 417 ns 375 ns 1.11
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/8 thread(s) 375 ns 375 ns 1
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/1 thread(s) 333 ns 292 ns 1.14
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/CUDA 23138 ns 23671 ns 0.98
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/2 thread(s) 6125 ns 6083.5 ns 1.01
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/4 thread(s) 6917 ns 6541 ns 1.06
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/8 thread(s) 6667 ns 6625 ns 1.01
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/1 thread(s) 6250 ns 6084 ns 1.03
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/CUDA 259300 ns 259434 ns 1.00
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/2 thread(s) 4458 ns 4750 ns 0.94
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/4 thread(s) 4875 ns 5000 ns 0.97
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/8 thread(s) 5708.5 ns 5292 ns 1.08
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/1 thread(s) 4833 ns 4750 ns 1.02
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/CUDA 258768.5 ns 245363 ns 1.05
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/2 thread(s) 9709 ns 9917 ns 0.98
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/4 thread(s) 10083 ns 9667 ns 1.04
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/8 thread(s) 10417 ns 10375 ns 1.00
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/1 thread(s) 10041.5 ns 10292 ns 0.98
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/CUDA 1358754 ns 1314812 ns 1.03
dense(2, bias=false, act=gelu)(2 x 128)/forward/CPU/2 thread(s) 1625 ns 1583 ns 1.03
dense(2, bias=false, act=gelu)(2 x 128)/forward/CPU/4 thread(s) 1666 ns 1584 ns 1.05
dense(2, bias=false, act=gelu)(2 x 128)/forward/CPU/8 thread(s) 1667 ns 1625 ns 1.03
dense(2, bias=false, act=gelu)(2 x 128)/forward/CPU/1 thread(s) 1583 ns 1625 ns 0.97
dense(2, bias=false, act=gelu)(2 x 128)/forward/GPU/CUDA 23306 ns 24016 ns 0.97
dense(2, bias=false, act=gelu)(2 x 128)/zygote/CPU/2 thread(s) 5625 ns 5666 ns 0.99
dense(2, bias=false, act=gelu)(2 x 128)/zygote/CPU/4 thread(s) 6125 ns 5750 ns 1.07
dense(2, bias=false, act=gelu)(2 x 128)/zygote/CPU/8 thread(s) 6041 ns 6041 ns 1
dense(2, bias=false, act=gelu)(2 x 128)/zygote/CPU/1 thread(s) 5625 ns 5625 ns 1
dense(2, bias=false, act=gelu)(2 x 128)/zygote/GPU/CUDA 275587 ns 277849.5 ns 0.99
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/forward/CPU/2 thread(s) 6813916.5 ns 6854521 ns 0.99
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/forward/CPU/4 thread(s) 6428416 ns 6386541.5 ns 1.01
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/forward/CPU/8 thread(s) 6554167 ns 6525687 ns 1.00
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/forward/CPU/1 thread(s) 7571104.5 ns 7618792 ns 0.99
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/forward/GPU/CUDA 213811 ns 215416 ns 0.99
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/zygote/CPU/2 thread(s) 24163500 ns 24090500 ns 1.00
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/zygote/CPU/4 thread(s) 21359167 ns 21303500 ns 1.00
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/zygote/CPU/8 thread(s) 21066083 ns 21036500 ns 1.00
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/zygote/CPU/1 thread(s) 29670209 ns 29890395.5 ns 0.99
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/zygote/GPU/CUDA 2101483 ns 2106254.5 ns 1.00
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/enzyme/CPU/2 thread(s) 37462416 ns 37262000 ns 1.01
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/enzyme/CPU/4 thread(s) 45862833.5 ns 34088667 ns 1.35
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/enzyme/CPU/8 thread(s) 45876667 ns 45642416 ns 1.01
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/enzyme/CPU/1 thread(s) 38235959 ns 38194208 ns 1.00
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/2 thread(s) 5459 ns 5625 ns 0.97
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/4 thread(s) 6250 ns 6250 ns 1
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/8 thread(s) 6958 ns 7354.5 ns 0.95
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/1 thread(s) 5292 ns 5959 ns 0.89
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/CUDA 238588.5 ns 230775.5 ns 1.03
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/2 thread(s) 7959 ns 7750 ns 1.03
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/4 thread(s) 8334 ns 8750 ns 0.95
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/8 thread(s) 8250 ns 8917 ns 0.93
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/1 thread(s) 8250 ns 8000 ns 1.03
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/CUDA 1068264.5 ns 1028275 ns 1.04
lenet(28, 28, 1, 128)/forward/CPU/2 thread(s) 1529292 ns 1567584 ns 0.98
lenet(28, 28, 1, 128)/forward/CPU/4 thread(s) 1266666.5 ns 1259666 ns 1.01
lenet(28, 28, 1, 128)/forward/CPU/8 thread(s) 1623709 ns 1635083 ns 0.99
lenet(28, 28, 1, 128)/forward/CPU/1 thread(s) 2163750 ns 2158312.5 ns 1.00
lenet(28, 28, 1, 128)/forward/GPU/CUDA 279544 ns 279134 ns 1.00
lenet(28, 28, 1, 128)/zygote/CPU/2 thread(s) 7968292 ns 7896167 ns 1.01
lenet(28, 28, 1, 128)/zygote/CPU/4 thread(s) 6533250 ns 6584917 ns 0.99
lenet(28, 28, 1, 128)/zygote/CPU/8 thread(s) 7125792 ns 7159250 ns 1.00
lenet(28, 28, 1, 128)/zygote/CPU/1 thread(s) 10479375 ns 10512687 ns 1.00
lenet(28, 28, 1, 128)/zygote/GPU/CUDA 1874497 ns 1840167 ns 1.02
batchedmm(128, Bsize=4)/forward/CPU/2 thread(s) 320667 ns 345250 ns 0.93
batchedmm(128, Bsize=4)/forward/CPU/4 thread(s) 346291 ns 346354 ns 1.00
batchedmm(128, Bsize=4)/forward/CPU/8 thread(s) 428584 ns 390416 ns 1.10
batchedmm(128, Bsize=4)/forward/CPU/1 thread(s) 345375 ns 318708 ns 1.08
batchedmm(128, Bsize=4)/forward/GPU/CUDA 46619.5 ns 47055.5 ns 0.99
batchedmm(128, Bsize=4)/zygote/CPU/2 thread(s) 745958.5 ns 727521 ns 1.03
batchedmm(128, Bsize=4)/zygote/CPU/4 thread(s) 791666.5 ns 784708.5 ns 1.01
batchedmm(128, Bsize=4)/zygote/CPU/8 thread(s) 1073208.5 ns 1082625 ns 0.99
batchedmm(128, Bsize=4)/zygote/CPU/1 thread(s) 776479 ns 770875 ns 1.01
batchedmm(128, Bsize=4)/zygote/GPU/CUDA 311670 ns 301195.5 ns 1.03
dense(512, bias=true, act=identity)(512 x 128)/forward/CPU/2 thread(s) 396708.5 ns 396958 ns 1.00
dense(512, bias=true, act=identity)(512 x 128)/forward/CPU/4 thread(s) 287917 ns 288000 ns 1.00
dense(512, bias=true, act=identity)(512 x 128)/forward/CPU/8 thread(s) 288250 ns 288083 ns 1.00
dense(512, bias=true, act=identity)(512 x 128)/forward/CPU/1 thread(s) 753417 ns 749292 ns 1.01
dense(512, bias=true, act=identity)(512 x 128)/forward/GPU/CUDA 44556 ns 43886 ns 1.02
dense(512, bias=true, act=identity)(512 x 128)/zygote/CPU/2 thread(s) 645167 ns 662916 ns 0.97
dense(512, bias=true, act=identity)(512 x 128)/zygote/CPU/4 thread(s) 527667 ns 527125 ns 1.00
dense(512, bias=true, act=identity)(512 x 128)/zygote/CPU/8 thread(s) 532000 ns 530875 ns 1.00
dense(512, bias=true, act=identity)(512 x 128)/zygote/CPU/1 thread(s) 974292 ns 974417 ns 1.00
dense(512, bias=true, act=identity)(512 x 128)/zygote/GPU/CUDA 190424 ns 189409.5 ns 1.01
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 668958 ns 654312 ns 1.02
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 629749.5 ns 666542 ns 0.94
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 544375 ns 634583.5 ns 0.86
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 643396 ns 668771 ns 0.96
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA 132592.5 ns 131826.5 ns 1.01
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 2485646 ns 2477937.5 ns 1.00
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 2448562.5 ns 2464542 ns 0.99
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 2450292 ns 2457437.5 ns 1.00
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 2461146 ns 2485229.5 ns 0.99
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 1408688 ns 1437399.5 ns 0.98
batchedmm(2, Bsize=32)/forward/CPU/2 thread(s) 324000.5 ns 343542 ns 0.94
batchedmm(2, Bsize=32)/forward/CPU/4 thread(s) 344459 ns 340729.5 ns 1.01
batchedmm(2, Bsize=32)/forward/CPU/8 thread(s) 396583 ns 397417 ns 1.00
batchedmm(2, Bsize=32)/forward/CPU/1 thread(s) 314083.5 ns 287750 ns 1.09
batchedmm(2, Bsize=32)/forward/GPU/CUDA 16193 ns 16030 ns 1.01
batchedmm(2, Bsize=32)/zygote/CPU/2 thread(s) 700875 ns 705459 ns 0.99
batchedmm(2, Bsize=32)/zygote/CPU/4 thread(s) 734292 ns 728458 ns 1.01
batchedmm(2, Bsize=32)/zygote/CPU/8 thread(s) 1020625 ns 1024291 ns 1.00
batchedmm(2, Bsize=32)/zygote/CPU/1 thread(s) 656584 ns 650375 ns 1.01
batchedmm(2, Bsize=32)/zygote/GPU/CUDA 201017 ns 194160 ns 1.04
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 1461042 ns 1463292 ns 1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 1503750 ns 1505458 ns 1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 1504625 ns 1502125 ns 1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 1442917 ns 1441292 ns 1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA 40991 ns 39983.5 ns 1.03
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 5155750 ns 5129459 ns 1.01
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 5279833.5 ns 5289916 ns 1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 5308333.5 ns 5304042 ns 1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 4987604 ns 5024750 ns 0.99
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 200839 ns 197042 ns 1.02
dense(32, bias=false, act=identity)(32 x 128)/forward/CPU/2 thread(s) 3750 ns 3667 ns 1.02
dense(32, bias=false, act=identity)(32 x 128)/forward/CPU/4 thread(s) 3709 ns 3708 ns 1.00
dense(32, bias=false, act=identity)(32 x 128)/forward/CPU/8 thread(s) 3667 ns 3708 ns 0.99
dense(32, bias=false, act=identity)(32 x 128)/forward/CPU/1 thread(s) 3708 ns 3708 ns 1
dense(32, bias=false, act=identity)(32 x 128)/forward/GPU/CUDA 33187 ns 33420 ns 0.99
dense(32, bias=false, act=identity)(32 x 128)/zygote/CPU/2 thread(s) 14958 ns 15042 ns 0.99
dense(32, bias=false, act=identity)(32 x 128)/zygote/CPU/4 thread(s) 15395.5 ns 15042 ns 1.02
dense(32, bias=false, act=identity)(32 x 128)/zygote/CPU/8 thread(s) 15375 ns 15500 ns 0.99
dense(32, bias=false, act=identity)(32 x 128)/zygote/CPU/1 thread(s) 15083 ns 14916 ns 1.01
dense(32, bias=false, act=identity)(32 x 128)/zygote/GPU/CUDA 379072.5 ns 365311.5 ns 1.04
dense(512, bias=false, act=relu)(512 x 128)/forward/CPU/2 thread(s) 71541 ns 71250 ns 1.00
dense(512, bias=false, act=relu)(512 x 128)/forward/CPU/4 thread(s) 71542 ns 71042 ns 1.01
dense(512, bias=false, act=relu)(512 x 128)/forward/CPU/8 thread(s) 71270.5 ns 71125 ns 1.00
dense(512, bias=false, act=relu)(512 x 128)/forward/CPU/1 thread(s) 71083 ns 71375 ns 1.00
dense(512, bias=false, act=relu)(512 x 128)/forward/GPU/CUDA 112914 ns 113406.5 ns 1.00
dense(512, bias=false, act=relu)(512 x 128)/zygote/CPU/2 thread(s) 325333 ns 318209 ns 1.02
dense(512, bias=false, act=relu)(512 x 128)/zygote/CPU/4 thread(s) 320729.5 ns 319583.5 ns 1.00
dense(512, bias=false, act=relu)(512 x 128)/zygote/CPU/8 thread(s) 318792 ns 318959 ns 1.00
dense(512, bias=false, act=relu)(512 x 128)/zygote/CPU/1 thread(s) 317333 ns 324583 ns 0.98
dense(512, bias=false, act=relu)(512 x 128)/zygote/GPU/CUDA 193733 ns 193354.5 ns 1.00
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/2 thread(s) 1000 ns 958 ns 1.04
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/4 thread(s) 1125 ns 1042 ns 1.08
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/8 thread(s) 1083 ns 1083 ns 1
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/1 thread(s) 1000 ns 958 ns 1.04
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/CUDA 23845 ns 23463 ns 1.02
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/2 thread(s) 7750 ns 8000 ns 0.97
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/4 thread(s) 8583 ns 7875 ns 1.09
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/8 thread(s) 8500 ns 8417 ns 1.01
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/1 thread(s) 7750 ns 7792 ns 0.99
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/CUDA 262768.5 ns 259238.5 ns 1.01
batchedmm(128, Bsize=32)/forward/CPU/2 thread(s) 456417 ns 463917 ns 0.98
batchedmm(128, Bsize=32)/forward/CPU/4 thread(s) 472584 ns 473271 ns 1.00
batchedmm(128, Bsize=32)/forward/CPU/8 thread(s) 554479 ns 552125 ns 1.00
batchedmm(128, Bsize=32)/forward/CPU/1 thread(s) 550167 ns 532500 ns 1.03
batchedmm(128, Bsize=32)/forward/GPU/CUDA 128330 ns 129678.5 ns 0.99
batchedmm(128, Bsize=32)/zygote/CPU/2 thread(s) 1408750 ns 1387812.5 ns 1.02
batchedmm(128, Bsize=32)/zygote/CPU/4 thread(s) 1380958 ns 1377541 ns 1.00
batchedmm(128, Bsize=32)/zygote/CPU/8 thread(s) 1632666.5 ns 1616416.5 ns 1.01
batchedmm(128, Bsize=32)/zygote/CPU/1 thread(s) 1597604 ns 1628791.5 ns 0.98
batchedmm(128, Bsize=32)/zygote/GPU/CUDA 274089 ns 274924 ns 1.00
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/2 thread(s) 334 ns 333 ns 1.00
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/4 thread(s) 417 ns 375 ns 1.11
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/8 thread(s) 375 ns 375 ns 1
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/1 thread(s) 333 ns 333 ns 1
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/CUDA 31588 ns 31727 ns 1.00
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/2 thread(s) 6083 ns 6292 ns 0.97
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/4 thread(s) 6750 ns 6333 ns 1.07
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/8 thread(s) 6458 ns 6542 ns 0.99
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/1 thread(s) 6125 ns 6125 ns 1
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/CUDA 263587.5 ns 261595.5 ns 1.01
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 1767792 ns 1723041.5 ns 1.03
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 1726375 ns 1728375 ns 1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 1725708 ns 1726646 ns 1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 1773250 ns 1770333 ns 1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA 168887 ns 168725.5 ns 1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 4406958 ns 4375500 ns 1.01
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 4358916 ns 4362459 ns 1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 4369792 ns 4361583.5 ns 1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 4367125 ns 4405833 ns 0.99
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 1241756.5 ns 1247425 ns 1.00
bias_activation(512, act=relu)(512 x 128)/forward/CPU/2 thread(s) 6750 ns 6770.5 ns 1.00
bias_activation(512, act=relu)(512 x 128)/forward/CPU/4 thread(s) 7000 ns 6834 ns 1.02
bias_activation(512, act=relu)(512 x 128)/forward/CPU/8 thread(s) 6792 ns 9458 ns 0.72
bias_activation(512, act=relu)(512 x 128)/forward/CPU/1 thread(s) 6750 ns 6791 ns 0.99
bias_activation(512, act=relu)(512 x 128)/forward/GPU/CUDA 19512 ns 19612 ns 0.99
bias_activation(512, act=relu)(512 x 128)/zygote/CPU/2 thread(s) 51584 ns 35416.5 ns 1.46
bias_activation(512, act=relu)(512 x 128)/zygote/CPU/4 thread(s) 48771 ns 51458 ns 0.95
bias_activation(512, act=relu)(512 x 128)/zygote/CPU/8 thread(s) 33250 ns 64791 ns 0.51
bias_activation(512, act=relu)(512 x 128)/zygote/CPU/1 thread(s) 52958 ns 51041 ns 1.04
bias_activation(512, act=relu)(512 x 128)/zygote/GPU/CUDA 210086 ns 292950.5 ns 0.72
batchedmm(2, Bsize=512)/forward/CPU/2 thread(s) 328750 ns 357250 ns 0.92
batchedmm(2, Bsize=512)/forward/CPU/4 thread(s) 344958 ns 346916 ns 0.99
batchedmm(2, Bsize=512)/forward/CPU/8 thread(s) 408250 ns 408708 ns 1.00
batchedmm(2, Bsize=512)/forward/CPU/1 thread(s) 323500 ns 296583 ns 1.09
batchedmm(2, Bsize=512)/forward/GPU/CUDA 18058 ns 18608 ns 0.97
batchedmm(2, Bsize=512)/zygote/CPU/2 thread(s) 719583.5 ns 717145.5 ns 1.00
batchedmm(2, Bsize=512)/zygote/CPU/4 thread(s) 735666.5 ns 736354 ns 1.00
batchedmm(2, Bsize=512)/zygote/CPU/8 thread(s) 1034250 ns 1037916 ns 1.00
batchedmm(2, Bsize=512)/zygote/CPU/1 thread(s) 684646 ns 678583 ns 1.01
batchedmm(2, Bsize=512)/zygote/GPU/CUDA 345041 ns 330667.5 ns 1.04
dense(512, bias=true, act=relu)(512 x 128)/forward/CPU/2 thread(s) 75459 ns 75250 ns 1.00
dense(512, bias=true, act=relu)(512 x 128)/forward/CPU/4 thread(s) 75292 ns 74125 ns 1.02
dense(512, bias=true, act=relu)(512 x 128)/forward/CPU/8 thread(s) 75167 ns 75125 ns 1.00
dense(512, bias=true, act=relu)(512 x 128)/forward/CPU/1 thread(s) 75333 ns 75417 ns 1.00
dense(512, bias=true, act=relu)(512 x 128)/forward/GPU/CUDA 46969 ns 47294 ns 0.99
dense(512, bias=true, act=relu)(512 x 128)/zygote/CPU/2 thread(s) 332833 ns 323916 ns 1.03
dense(512, bias=true, act=relu)(512 x 128)/zygote/CPU/4 thread(s) 325833 ns 328833 ns 0.99
dense(512, bias=true, act=relu)(512 x 128)/zygote/CPU/8 thread(s) 324583 ns 324333 ns 1.00
dense(512, bias=true, act=relu)(512 x 128)/zygote/CPU/1 thread(s) 323834 ns 330666.5 ns 0.98
dense(512, bias=true, act=relu)(512 x 128)/zygote/GPU/CUDA 207979 ns 212001.5 ns 0.98
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 1487708 ns 1489375 ns 1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 1530375 ns 1532667 ns 1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 1530750 ns 1529750 ns 1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 1466417 ns 1466083 ns 1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA 51505.5 ns 52463 ns 0.98
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 5146312.5 ns 5117041.5 ns 1.01
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 5151604.5 ns 5286709 ns 0.97
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 5003270.5 ns 5278500 ns 0.95
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 4984709 ns 5018959 ns 0.99
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 205494.5 ns 207470 ns 0.99
dense(32, bias=true, act=gelu)(32 x 128)/forward/CPU/2 thread(s) 28250 ns 28125 ns 1.00
dense(32, bias=true, act=gelu)(32 x 128)/forward/CPU/4 thread(s) 28334 ns 28250 ns 1.00
dense(32, bias=true, act=gelu)(32 x 128)/forward/CPU/8 thread(s) 28333 ns 28208 ns 1.00
dense(32, bias=true, act=gelu)(32 x 128)/forward/CPU/1 thread(s) 28167 ns 28292 ns 1.00
dense(32, bias=true, act=gelu)(32 x 128)/forward/GPU/CUDA 24407 ns 25137 ns 0.97
dense(32, bias=true, act=gelu)(32 x 128)/zygote/CPU/2 thread(s) 66500 ns 66375 ns 1.00
dense(32, bias=true, act=gelu)(32 x 128)/zygote/CPU/4 thread(s) 66375 ns 66417 ns 1.00
dense(32, bias=true, act=gelu)(32 x 128)/zygote/CPU/8 thread(s) 67458 ns 66333 ns 1.02
dense(32, bias=true, act=gelu)(32 x 128)/zygote/CPU/1 thread(s) 66417 ns 66417 ns 1
dense(32, bias=true, act=gelu)(32 x 128)/zygote/GPU/CUDA 525547 ns 515009.5 ns 1.02
mlp7layer_bn(tanh)(32 x 256)/forward/CPU/2 thread(s) 1383749.5 ns 1500833 ns 0.92
mlp7layer_bn(tanh)(32 x 256)/forward/CPU/4 thread(s) 1059771 ns 1126208 ns 0.94
mlp7layer_bn(tanh)(32 x 256)/forward/CPU/8 thread(s) 1061458 ns 1072167 ns 0.99
mlp7layer_bn(tanh)(32 x 256)/forward/CPU/1 thread(s) 2248687.5 ns 2256270.5 ns 1.00
mlp7layer_bn(tanh)(32 x 256)/forward/GPU/CUDA 581876.5 ns 588519 ns 0.99
mlp7layer_bn(tanh)(32 x 256)/zygote/CPU/2 thread(s) 3035479 ns 3091229.5 ns 0.98
mlp7layer_bn(tanh)(32 x 256)/zygote/CPU/4 thread(s) 2745250 ns 2749250 ns 1.00
mlp7layer_bn(tanh)(32 x 256)/zygote/CPU/8 thread(s) 2740958 ns 2739000 ns 1.00
mlp7layer_bn(tanh)(32 x 256)/zygote/CPU/1 thread(s) 3811500 ns 3873500 ns 0.98
mlp7layer_bn(tanh)(32 x 256)/zygote/GPU/CUDA 2064611 ns 2038545 ns 1.01
mlp7layer_bn(tanh)(32 x 256)/enzyme/CPU/2 thread(s) 8921042 ns 8835083.5 ns 1.01
mlp7layer_bn(tanh)(32 x 256)/enzyme/CPU/4 thread(s) 8776625 ns 8781542 ns 1.00
mlp7layer_bn(tanh)(32 x 256)/enzyme/CPU/8 thread(s) 8768729.5 ns 8779291.5 ns 1.00
mlp7layer_bn(tanh)(32 x 256)/enzyme/CPU/1 thread(s) 6359583 ns 6522313 ns 0.98
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 82083.5 ns 80792 ns 1.02
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 81562.5 ns 80833 ns 1.01
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 83125 ns 83792 ns 0.99
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 80583 ns 130000.5 ns 0.62
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA 192403.5 ns 193569 ns 0.99
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 2040625 ns 2027583 ns 1.01
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 1935354.5 ns 1958291 ns 0.99
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 2023083 ns 2028042 ns 1.00
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 2003562.5 ns 2038833.5 ns 0.98
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 805958 ns 794524 ns 1.01

This comment was automatically generated by workflow using github-action-benchmark.

Please sign in to comment.