-
Notifications
You must be signed in to change notification settings - Fork 64
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
chore: bump codecov/codecov-action from 4 to 5 (#1093)
Bumps [codecov/codecov-action](https://github.com/codecov/codecov-action) from 4 to 5. - [Release notes](https://github.com/codecov/codecov-action/releases) - [Changelog](https://github.com/codecov/codecov-action/blob/main/CHANGELOG.md) - [Commits](codecov/codecov-action@v4...v5) --- updated-dependencies: - dependency-name: codecov/codecov-action dependency-type: direct:production update-type: version-update:semver-major ... Signed-off-by: dependabot[bot] <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
- Loading branch information
1 parent
636c9d1
commit cb0900f
Showing
9 changed files
with
16 additions
and
16 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
cb0900f
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lux Benchmarks
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/2 thread(s)
3875
ns4250
ns0.91
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/4 thread(s)
4375
ns4292
ns1.02
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/8 thread(s)
5083
ns5000
ns1.02
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/1 thread(s)
4208
ns3916
ns1.07
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/CUDA
60144
ns60054
ns1.00
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/2 thread(s)
10625
ns10833
ns0.98
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/4 thread(s)
10666
ns10042
ns1.06
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/8 thread(s)
11375
ns10792
ns1.05
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/1 thread(s)
10334
ns10666.5
ns0.97
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/CUDA
421452
ns425278
ns0.99
bias_activation(32, act=relu)(32 x 128)/forward/CPU/2 thread(s)
1250
ns1167
ns1.07
bias_activation(32, act=relu)(32 x 128)/forward/CPU/4 thread(s)
1292
ns1208
ns1.07
bias_activation(32, act=relu)(32 x 128)/forward/CPU/8 thread(s)
1250
ns1459
ns0.86
bias_activation(32, act=relu)(32 x 128)/forward/CPU/1 thread(s)
1167
ns1208
ns0.97
bias_activation(32, act=relu)(32 x 128)/forward/GPU/CUDA
18149
ns18417
ns0.99
bias_activation(32, act=relu)(32 x 128)/zygote/CPU/2 thread(s)
4167
ns3917
ns1.06
bias_activation(32, act=relu)(32 x 128)/zygote/CPU/4 thread(s)
4042
ns4000
ns1.01
bias_activation(32, act=relu)(32 x 128)/zygote/CPU/8 thread(s)
4292
ns4250
ns1.01
bias_activation(32, act=relu)(32 x 128)/zygote/CPU/1 thread(s)
3625
ns4125
ns0.88
bias_activation(32, act=relu)(32 x 128)/zygote/GPU/CUDA
109548
ns109745
ns1.00
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
56166
ns58250
ns0.96
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
46709
ns46500
ns1.00
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
46334
ns46917
ns0.99
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
82291
ns83833.5
ns0.98
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA
37127
ns37085
ns1.00
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
2031334
ns2032104.5
ns1.00
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
2096166.5
ns2088666
ns1.00
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
2086458
ns2082333
ns1.00
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
1997167
ns2021708.5
ns0.99
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
197158.5
ns194358
ns1.01
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
143042
ns144167
ns0.99
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
145583.5
ns143333.5
ns1.02
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
146709
ns145792
ns1.01
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
149500
ns144875
ns1.03
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA
166231
ns166324.5
ns1.00
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
1138708.5
ns1120666.5
ns1.02
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
1128583
ns1116812.5
ns1.01
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
1062083.5
ns1115750
ns0.95
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
1115041.5
ns1153437.5
ns0.97
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
530934
ns524143
ns1.01
layernorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/2 thread(s)
3125
ns3834
ns0.82
layernorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/4 thread(s)
3458
ns3625
ns0.95
layernorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/8 thread(s)
4292
ns4334
ns0.99
layernorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/1 thread(s)
3375
ns3292
ns1.03
layernorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/CUDA
70464
ns70563
ns1.00
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/2 thread(s)
9208
ns9875
ns0.93
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/4 thread(s)
8917
ns10375
ns0.86
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/8 thread(s)
9125
ns9458
ns0.96
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/1 thread(s)
9166
ns8542
ns1.07
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/CUDA
483194.5
ns479347
ns1.01
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
15333
ns15583.5
ns0.98
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
15458
ns15375
ns1.01
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
17333
ns17000
ns1.02
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
17062.5
ns15250
ns1.12
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA
53962
ns54126
ns1.00
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
214583.5
ns213416
ns1.01
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
212667
ns214083.5
ns0.99
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
214625
ns215583
ns1.00
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
225250
ns246541.5
ns0.91
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
273370
ns271347.5
ns1.01
bias_activation(2, act=relu)(2 x 128)/forward/CPU/2 thread(s)
458
ns750
ns0.61
bias_activation(2, act=relu)(2 x 128)/forward/CPU/4 thread(s)
666
ns625
ns1.07
bias_activation(2, act=relu)(2 x 128)/forward/CPU/8 thread(s)
750
ns750
ns1
bias_activation(2, act=relu)(2 x 128)/forward/CPU/1 thread(s)
500
ns709
ns0.71
bias_activation(2, act=relu)(2 x 128)/forward/GPU/CUDA
17502.5
ns17843
ns0.98
bias_activation(2, act=relu)(2 x 128)/zygote/CPU/2 thread(s)
1542
ns1708
ns0.90
bias_activation(2, act=relu)(2 x 128)/zygote/CPU/4 thread(s)
1667
ns1541
ns1.08
bias_activation(2, act=relu)(2 x 128)/zygote/CPU/8 thread(s)
1834
ns1791
ns1.02
bias_activation(2, act=relu)(2 x 128)/zygote/CPU/1 thread(s)
1375
ns1458
ns0.94
bias_activation(2, act=relu)(2 x 128)/zygote/GPU/CUDA
101667.5
ns102473
ns0.99
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
7125
ns7208
ns0.99
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
5917
ns5833
ns1.01
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
5792
ns5917
ns0.98
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
9917
ns10292
ns0.96
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA
23886
ns23188
ns1.03
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
221417
ns221458
ns1.00
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
228125
ns227959
ns1.00
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
228666
ns228500
ns1.00
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
220500
ns214729
ns1.03
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
169891
ns169404
ns1.00
dense(32, bias=false, act=relu)(32 x 128)/forward/CPU/2 thread(s)
3958
ns3916
ns1.01
dense(32, bias=false, act=relu)(32 x 128)/forward/CPU/4 thread(s)
3958
ns3916
ns1.01
dense(32, bias=false, act=relu)(32 x 128)/forward/CPU/8 thread(s)
3875
ns3916
ns0.99
dense(32, bias=false, act=relu)(32 x 128)/forward/CPU/1 thread(s)
3875
ns3958
ns0.98
dense(32, bias=false, act=relu)(32 x 128)/forward/GPU/CUDA
23537
ns23907
ns0.98
dense(32, bias=false, act=relu)(32 x 128)/zygote/CPU/2 thread(s)
16750
ns16583
ns1.01
dense(32, bias=false, act=relu)(32 x 128)/zygote/CPU/4 thread(s)
17042
ns16583
ns1.03
dense(32, bias=false, act=relu)(32 x 128)/zygote/CPU/8 thread(s)
16875
ns17042
ns0.99
dense(32, bias=false, act=relu)(32 x 128)/zygote/CPU/1 thread(s)
16750
ns16542
ns1.01
dense(32, bias=false, act=relu)(32 x 128)/zygote/GPU/CUDA
159725
ns162027
ns0.99
dense(512, bias=false, act=gelu)(512 x 128)/forward/CPU/2 thread(s)
570333
ns569083
ns1.00
dense(512, bias=false, act=gelu)(512 x 128)/forward/CPU/4 thread(s)
574000
ns578041
ns0.99
dense(512, bias=false, act=gelu)(512 x 128)/forward/CPU/8 thread(s)
579125
ns573625
ns1.01
dense(512, bias=false, act=gelu)(512 x 128)/forward/CPU/1 thread(s)
571125
ns570916
ns1.00
dense(512, bias=false, act=gelu)(512 x 128)/forward/GPU/CUDA
113492
ns112937.5
ns1.00
dense(512, bias=false, act=gelu)(512 x 128)/zygote/CPU/2 thread(s)
1428041
ns1422062.5
ns1.00
dense(512, bias=false, act=gelu)(512 x 128)/zygote/CPU/4 thread(s)
1422333
ns1417459
ns1.00
dense(512, bias=false, act=gelu)(512 x 128)/zygote/CPU/8 thread(s)
1423708
ns1420875
ns1.00
dense(512, bias=false, act=gelu)(512 x 128)/zygote/CPU/1 thread(s)
1423458
ns1422667
ns1.00
dense(512, bias=false, act=gelu)(512 x 128)/zygote/GPU/CUDA
208571.5
ns212002
ns0.98
lenet(28, 28, 1, 64)/forward/CPU/2 thread(s)
1051187.5
ns1076417
ns0.98
lenet(28, 28, 1, 64)/forward/CPU/4 thread(s)
971896
ns970125
ns1.00
lenet(28, 28, 1, 64)/forward/CPU/8 thread(s)
1346062.5
ns1341062.5
ns1.00
lenet(28, 28, 1, 64)/forward/CPU/1 thread(s)
1306416
ns1282542
ns1.02
lenet(28, 28, 1, 64)/forward/GPU/CUDA
272301
ns274403
ns0.99
lenet(28, 28, 1, 64)/zygote/CPU/2 thread(s)
5990916
ns5768459
ns1.04
lenet(28, 28, 1, 64)/zygote/CPU/4 thread(s)
4519875
ns4594917
ns0.98
lenet(28, 28, 1, 64)/zygote/CPU/8 thread(s)
4948416.5
ns4948750
ns1.00
lenet(28, 28, 1, 64)/zygote/CPU/1 thread(s)
5523125
ns5721687.5
ns0.97
lenet(28, 28, 1, 64)/zygote/GPU/CUDA
1070952
ns1071440
ns1.00
dense(2, bias=true, act=relu)(2 x 128)/forward/CPU/2 thread(s)
500
ns500
ns1
dense(2, bias=true, act=relu)(2 x 128)/forward/CPU/4 thread(s)
542
ns542
ns1
dense(2, bias=true, act=relu)(2 x 128)/forward/CPU/8 thread(s)
583
ns583
ns1
dense(2, bias=true, act=relu)(2 x 128)/forward/CPU/1 thread(s)
542
ns500
ns1.08
dense(2, bias=true, act=relu)(2 x 128)/forward/GPU/CUDA
23553
ns23971
ns0.98
dense(2, bias=true, act=relu)(2 x 128)/zygote/CPU/2 thread(s)
2084
ns2083
ns1.00
dense(2, bias=true, act=relu)(2 x 128)/zygote/CPU/4 thread(s)
2167
ns2125
ns1.02
dense(2, bias=true, act=relu)(2 x 128)/zygote/CPU/8 thread(s)
2167
ns2167
ns1
dense(2, bias=true, act=relu)(2 x 128)/zygote/CPU/1 thread(s)
2125
ns2083
ns1.02
dense(2, bias=true, act=relu)(2 x 128)/zygote/GPU/CUDA
168963.5
ns174370
ns0.97
layernorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/2 thread(s)
3875
ns4042
ns0.96
layernorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/4 thread(s)
4167
ns4459
ns0.93
layernorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/8 thread(s)
5250
ns4791.5
ns1.10
layernorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/1 thread(s)
3666
ns3666
ns1
layernorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/CUDA
65091
ns65101
ns1.00
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/2 thread(s)
11416
ns10917
ns1.05
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/4 thread(s)
11292
ns11500
ns0.98
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/8 thread(s)
12333.5
ns12333
ns1.00
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/1 thread(s)
11209
ns11166
ns1.00
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/CUDA
446962.5
ns449038
ns1.00
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/2 thread(s)
6458.5
ns6583
ns0.98
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/4 thread(s)
6792
ns6312.5
ns1.08
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/8 thread(s)
7833.5
ns7833
ns1.00
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/1 thread(s)
6250
ns6250
ns1
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/CUDA
52555
ns52027
ns1.01
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/2 thread(s)
16584
ns17125
ns0.97
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/4 thread(s)
17791
ns16959
ns1.05
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/8 thread(s)
17375
ns18959
ns0.92
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/1 thread(s)
17125
ns17437.5
ns0.98
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/CUDA
308634
ns297375.5
ns1.04
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/2 thread(s)
625
ns583
ns1.07
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/4 thread(s)
666
ns542
ns1.23
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/8 thread(s)
583
ns625
ns0.93
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/1 thread(s)
625
ns541
ns1.16
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/CUDA
32320
ns32771
ns0.99
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/2 thread(s)
8541
ns8854.5
ns0.96
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/4 thread(s)
9167
ns9500
ns0.96
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/8 thread(s)
9500
ns8958
ns1.06
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/1 thread(s)
9479.5
ns8541.5
ns1.11
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/CUDA
159616
ns158837.5
ns1.00
dense(512, bias=false, act=identity)(512 x 128)/forward/CPU/2 thread(s)
64750
ns64542
ns1.00
dense(512, bias=false, act=identity)(512 x 128)/forward/CPU/4 thread(s)
64625
ns64417
ns1.00
dense(512, bias=false, act=identity)(512 x 128)/forward/CPU/8 thread(s)
64292
ns64542
ns1.00
dense(512, bias=false, act=identity)(512 x 128)/forward/CPU/1 thread(s)
64542
ns64750
ns1.00
dense(512, bias=false, act=identity)(512 x 128)/forward/GPU/CUDA
111041.5
ns111087
ns1.00
dense(512, bias=false, act=identity)(512 x 128)/zygote/CPU/2 thread(s)
292000
ns277792
ns1.05
dense(512, bias=false, act=identity)(512 x 128)/zygote/CPU/4 thread(s)
292084
ns284292
ns1.03
dense(512, bias=false, act=identity)(512 x 128)/zygote/CPU/8 thread(s)
275666
ns282125
ns0.98
dense(512, bias=false, act=identity)(512 x 128)/zygote/CPU/1 thread(s)
275708
ns286333.5
ns0.96
dense(512, bias=false, act=identity)(512 x 128)/zygote/GPU/CUDA
183441
ns185412.5
ns0.99
mlp7layer_bn(gelu)(32 x 256)/forward/CPU/2 thread(s)
3191791
ns3283437
ns0.97
mlp7layer_bn(gelu)(32 x 256)/forward/CPU/4 thread(s)
3043437.5
ns3018229
ns1.01
mlp7layer_bn(gelu)(32 x 256)/forward/CPU/8 thread(s)
3020437.5
ns3058917
ns0.99
mlp7layer_bn(gelu)(32 x 256)/forward/CPU/1 thread(s)
4089708
ns4032979
ns1.01
mlp7layer_bn(gelu)(32 x 256)/forward/GPU/CUDA
601857
ns618259
ns0.97
mlp7layer_bn(gelu)(32 x 256)/zygote/CPU/2 thread(s)
7582625
ns7620500
ns1.00
mlp7layer_bn(gelu)(32 x 256)/zygote/CPU/4 thread(s)
7473208.5
ns7434375
ns1.01
mlp7layer_bn(gelu)(32 x 256)/zygote/CPU/8 thread(s)
7437833
ns7258208
ns1.02
mlp7layer_bn(gelu)(32 x 256)/zygote/CPU/1 thread(s)
8187292
ns8312542
ns0.98
mlp7layer_bn(gelu)(32 x 256)/zygote/GPU/CUDA
1317154
ns1382144
ns0.95
mlp7layer_bn(gelu)(32 x 256)/enzyme/CPU/2 thread(s)
18957000
ns18771167
ns1.01
mlp7layer_bn(gelu)(32 x 256)/enzyme/CPU/4 thread(s)
19047250
ns19155875
ns0.99
mlp7layer_bn(gelu)(32 x 256)/enzyme/CPU/8 thread(s)
19104542
ns19055084
ns1.00
mlp7layer_bn(gelu)(32 x 256)/enzyme/CPU/1 thread(s)
15686625
ns16613000
ns0.94
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/forward/CPU/2 thread(s)
23902625
ns23424834
ns1.02
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/forward/CPU/4 thread(s)
34420458
ns34218917
ns1.01
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/forward/CPU/8 thread(s)
37002333
ns37348958
ns0.99
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/forward/CPU/1 thread(s)
34848770.5
ns35414708
ns0.98
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/forward/GPU/CUDA
1857006
ns1860633
ns1.00
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/zygote/CPU/2 thread(s)
191696375.5
ns188862542
ns1.02
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/zygote/CPU/4 thread(s)
164341792
ns164640583.5
ns1.00
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/zygote/CPU/8 thread(s)
152698167
ns152867000
ns1.00
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/zygote/CPU/1 thread(s)
439655916
ns449351167
ns0.98
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/zygote/GPU/CUDA
13895377
ns13884229
ns1.00
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/enzyme/CPU/2 thread(s)
292126520.5
ns289481604.5
ns1.01
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/enzyme/CPU/4 thread(s)
340023312
ns265154292
ns1.28
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/enzyme/CPU/8 thread(s)
298857875
ns299135959
ns1.00
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/enzyme/CPU/1 thread(s)
335240875
ns399738312
ns0.84
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
22250
ns21916
ns1.02
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
23083
ns23750
ns0.97
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
23959
ns25083
ns0.96
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
23417
ns22916.5
ns1.02
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA
96101
ns97130.5
ns0.99
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
103542
ns103125
ns1.00
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
103541
ns103667
ns1.00
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
104791
ns104771
ns1.00
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
113250
ns103396
ns1.10
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
512131
ns503270
ns1.02
layernorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/2 thread(s)
5834
ns6375
ns0.92
layernorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/4 thread(s)
6375
ns6750
ns0.94
layernorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/8 thread(s)
7000
ns6792
ns1.03
layernorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/1 thread(s)
6125
ns5584
ns1.10
layernorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/CUDA
68297.5
ns67691
ns1.01
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/2 thread(s)
15208
ns14875
ns1.02
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/4 thread(s)
15750
ns15895.5
ns0.99
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/8 thread(s)
16583
ns16520.5
ns1.00
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/1 thread(s)
15062.5
ns15416
ns0.98
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/CUDA
474148.5
ns474411
ns1.00
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/forward/CPU/2 thread(s)
3053958
ns2993375
ns1.02
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/forward/CPU/4 thread(s)
2089500
ns2048666.5
ns1.02
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/forward/CPU/8 thread(s)
2270042
ns2260292
ns1.00
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/forward/CPU/1 thread(s)
4804875
ns4882041
ns0.98
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/forward/GPU/CUDA
582756
ns586320.5
ns0.99
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/zygote/CPU/2 thread(s)
23872458.5
ns23515125
ns1.02
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/zygote/CPU/4 thread(s)
18056937.5
ns17982770.5
ns1.00
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/zygote/CPU/8 thread(s)
17766021
ns17986666
ns0.99
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/zygote/CPU/1 thread(s)
35515208
ns36296250
ns0.98
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/zygote/GPU/CUDA
3103295.5
ns3101860
ns1.00
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/enzyme/CPU/2 thread(s)
33801000
ns33484041.5
ns1.01
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/enzyme/CPU/4 thread(s)
27630916.5
ns27547583
ns1.00
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/enzyme/CPU/8 thread(s)
27435750
ns27396833.5
ns1.00
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/enzyme/CPU/1 thread(s)
41597458
ns42046625.5
ns0.99
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
74917
ns71917
ns1.04
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
72541
ns73625
ns0.99
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
76416
ns75834
ns1.01
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
74375
ns74042
ns1.00
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA
103583
ns103235
ns1.00
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
221146
ns206687.5
ns1.07
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
219166
ns320208.5
ns0.68
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
208875
ns208417
ns1.00
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
206542
ns205625
ns1.00
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
560403
ns548628.5
ns1.02
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/2 thread(s)
12166
ns11791
ns1.03
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/4 thread(s)
12208.5
ns13208
ns0.92
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/8 thread(s)
13167
ns12625
ns1.04
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/1 thread(s)
12042
ns11792
ns1.02
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/CUDA
71403
ns70856
ns1.01
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/2 thread(s)
26979.5
ns26083
ns1.03
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/4 thread(s)
27167
ns27104.5
ns1.00
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/8 thread(s)
27958.5
ns27584
ns1.01
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/1 thread(s)
26459
ns26958
ns0.98
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/CUDA
472464
ns471996
ns1.00
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/2 thread(s)
12437.5
ns12125
ns1.03
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/4 thread(s)
12979
ns12792
ns1.01
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/8 thread(s)
14167
ns13417
ns1.06
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/1 thread(s)
12125
ns12459
ns0.97
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/CUDA
53400
ns52898.5
ns1.01
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/2 thread(s)
25625
ns25416
ns1.01
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/4 thread(s)
26292
ns25833
ns1.02
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/8 thread(s)
26416
ns26500
ns1.00
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/1 thread(s)
26167
ns26208
ns1.00
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/CUDA
306626.5
ns303358.5
ns1.01
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
180729
ns180458
ns1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
182709
ns181792
ns1.01
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
183875
ns183125
ns1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
180833
ns179833
ns1.01
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA
56252.5
ns56401
ns1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
593541.5
ns582583.5
ns1.02
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
593916
ns583312.5
ns1.02
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
584021
ns583709
ns1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
582917
ns584896
ns1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
289288.5
ns286433.5
ns1.01
layernorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/2 thread(s)
6500
ns6084
ns1.07
layernorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/4 thread(s)
6125
ns6625
ns0.92
layernorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/8 thread(s)
7792
ns7209
ns1.08
layernorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/1 thread(s)
6145.5
ns5709
ns1.08
layernorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/CUDA
70132.5
ns70607
ns0.99
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/2 thread(s)
14271
ns13666
ns1.04
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/4 thread(s)
14916
ns14333
ns1.04
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/8 thread(s)
15500
ns15209
ns1.02
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/1 thread(s)
14000
ns14250
ns0.98
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/CUDA
460852.5
ns461794
ns1.00
batchedmm(512, Bsize=4)/forward/CPU/2 thread(s)
1175354
ns1174166.5
ns1.00
batchedmm(512, Bsize=4)/forward/CPU/4 thread(s)
1353000
ns1239604
ns1.09
batchedmm(512, Bsize=4)/forward/CPU/8 thread(s)
1269979
ns1267334
ns1.00
batchedmm(512, Bsize=4)/forward/CPU/1 thread(s)
1317500
ns1308146
ns1.01
batchedmm(512, Bsize=4)/forward/GPU/CUDA
302455
ns301138
ns1.00
batchedmm(512, Bsize=4)/zygote/CPU/2 thread(s)
4288500
ns4120792
ns1.04
batchedmm(512, Bsize=4)/zygote/CPU/4 thread(s)
4366958
ns4346770.5
ns1.00
batchedmm(512, Bsize=4)/zygote/CPU/8 thread(s)
4543917
ns4613625
ns0.98
batchedmm(512, Bsize=4)/zygote/CPU/1 thread(s)
4469000
ns4699020.5
ns0.95
batchedmm(512, Bsize=4)/zygote/GPU/CUDA
1030148
ns1054798
ns0.98
dense(2, bias=true, act=gelu)(2 x 128)/forward/CPU/2 thread(s)
1792
ns1792
ns1
dense(2, bias=true, act=gelu)(2 x 128)/forward/CPU/4 thread(s)
1875
ns1833
ns1.02
dense(2, bias=true, act=gelu)(2 x 128)/forward/CPU/8 thread(s)
1833
ns1833
ns1
dense(2, bias=true, act=gelu)(2 x 128)/forward/CPU/1 thread(s)
1875
ns1875
ns1
dense(2, bias=true, act=gelu)(2 x 128)/forward/GPU/CUDA
23497
ns24192
ns0.97
dense(2, bias=true, act=gelu)(2 x 128)/zygote/CPU/2 thread(s)
4834
ns4792
ns1.01
dense(2, bias=true, act=gelu)(2 x 128)/zygote/CPU/4 thread(s)
5041
ns4875
ns1.03
dense(2, bias=true, act=gelu)(2 x 128)/zygote/CPU/8 thread(s)
4875
ns4916
ns0.99
dense(2, bias=true, act=gelu)(2 x 128)/zygote/CPU/1 thread(s)
4875
ns4875
ns1
dense(2, bias=true, act=gelu)(2 x 128)/zygote/GPU/CUDA
185923.5
ns188688
ns0.99
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/2 thread(s)
5500
ns6125
ns0.90
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/4 thread(s)
6167
ns5833
ns1.06
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/8 thread(s)
6459
ns7146
ns0.90
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/1 thread(s)
5583
ns5500
ns1.02
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/CUDA
55454.5
ns55155.5
ns1.01
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/2 thread(s)
10667
ns10833
ns0.98
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/4 thread(s)
11750
ns11000
ns1.07
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/8 thread(s)
11458
ns11625
ns0.99
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/1 thread(s)
10667
ns10708
ns1.00
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/CUDA
337381
ns328957
ns1.03
dense(2, bias=false, act=relu)(2 x 128)/forward/CPU/2 thread(s)
292
ns292
ns1
dense(2, bias=false, act=relu)(2 x 128)/forward/CPU/4 thread(s)
375
ns333
ns1.13
dense(2, bias=false, act=relu)(2 x 128)/forward/CPU/8 thread(s)
333
ns333
ns1
dense(2, bias=false, act=relu)(2 x 128)/forward/CPU/1 thread(s)
292
ns334
ns0.87
dense(2, bias=false, act=relu)(2 x 128)/forward/GPU/CUDA
22737
ns23157
ns0.98
dense(2, bias=false, act=relu)(2 x 128)/zygote/CPU/2 thread(s)
2708
ns2750
ns0.98
dense(2, bias=false, act=relu)(2 x 128)/zygote/CPU/4 thread(s)
3000
ns2792
ns1.07
dense(2, bias=false, act=relu)(2 x 128)/zygote/CPU/8 thread(s)
3000
ns3042
ns0.99
dense(2, bias=false, act=relu)(2 x 128)/zygote/CPU/1 thread(s)
2750
ns2750
ns1
dense(2, bias=false, act=relu)(2 x 128)/zygote/GPU/CUDA
157057
ns159786.5
ns0.98
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/2 thread(s)
11625
ns11583
ns1.00
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/4 thread(s)
12250
ns11500
ns1.07
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/8 thread(s)
12708
ns12646
ns1.00
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/1 thread(s)
11417
ns11270.5
ns1.01
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/CUDA
56422
ns57234
ns0.99
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/2 thread(s)
24250
ns24750
ns0.98
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/4 thread(s)
25208
ns24833.5
ns1.02
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/8 thread(s)
25000
ns25500
ns0.98
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/1 thread(s)
25437.5
ns24500
ns1.04
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/CUDA
294376.5
ns295767.5
ns1.00
dense(32, bias=true, act=relu)(32 x 128)/forward/CPU/2 thread(s)
4167
ns4167
ns1
dense(32, bias=true, act=relu)(32 x 128)/forward/CPU/4 thread(s)
4208
ns4250
ns0.99
dense(32, bias=true, act=relu)(32 x 128)/forward/CPU/8 thread(s)
4167
ns4250
ns0.98
dense(32, bias=true, act=relu)(32 x 128)/forward/CPU/1 thread(s)
4208
ns4208
ns1
dense(32, bias=true, act=relu)(32 x 128)/forward/GPU/CUDA
24716
ns25133
ns0.98
dense(32, bias=true, act=relu)(32 x 128)/zygote/CPU/2 thread(s)
16042
ns16083
ns1.00
dense(32, bias=true, act=relu)(32 x 128)/zygote/CPU/4 thread(s)
16417
ns16250
ns1.01
dense(32, bias=true, act=relu)(32 x 128)/zygote/CPU/8 thread(s)
16250
ns16291
ns1.00
dense(32, bias=true, act=relu)(32 x 128)/zygote/CPU/1 thread(s)
16167
ns16000
ns1.01
dense(32, bias=true, act=relu)(32 x 128)/zygote/GPU/CUDA
193381
ns196801.5
ns0.98
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/2 thread(s)
5750
ns5750
ns1
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/4 thread(s)
6083
ns5792
ns1.05
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/8 thread(s)
5750
ns5792
ns0.99
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/1 thread(s)
5833
ns5791
ns1.01
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/CUDA
33569
ns33759
ns0.99
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/2 thread(s)
20479.5
ns21166.5
ns0.97
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/4 thread(s)
21000
ns20875
ns1.01
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/8 thread(s)
21208
ns21375
ns0.99
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/1 thread(s)
21104.5
ns20792
ns1.02
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/CUDA
174365.5
ns177483
ns0.98
batchedmm(16, Bsize=512)/forward/CPU/2 thread(s)
375416.5
ns400729.5
ns0.94
batchedmm(16, Bsize=512)/forward/CPU/4 thread(s)
374666.5
ns374229
ns1.00
batchedmm(16, Bsize=512)/forward/CPU/8 thread(s)
488312.5
ns489041
ns1.00
batchedmm(16, Bsize=512)/forward/CPU/1 thread(s)
524187.5
ns505209
ns1.04
batchedmm(16, Bsize=512)/forward/GPU/CUDA
66372.5
ns66692.5
ns1.00
batchedmm(16, Bsize=512)/zygote/CPU/2 thread(s)
931978.5
ns976604.5
ns0.95
batchedmm(16, Bsize=512)/zygote/CPU/4 thread(s)
880291.5
ns885854.5
ns0.99
batchedmm(16, Bsize=512)/zygote/CPU/8 thread(s)
1223791.5
ns1239959
ns0.99
batchedmm(16, Bsize=512)/zygote/CPU/1 thread(s)
1351833.5
ns1414417
ns0.96
batchedmm(16, Bsize=512)/zygote/GPU/CUDA
192149.5
ns190141.5
ns1.01
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
81312.5
ns81625
ns1.00
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
80750
ns81375
ns0.99
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
80792
ns81875
ns0.99
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
80937
ns82792
ns0.98
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA
192807
ns193437.5
ns1.00
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
1932917
ns1921958
ns1.01
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
1916542
ns1883688
ns1.02
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
1926479
ns1929792
ns1.00
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
1921042
ns1938584
ns0.99
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
394461
ns388434
ns1.02
dense(2, bias=true, act=identity)(2 x 128)/forward/CPU/2 thread(s)
291
ns292
ns1.00
dense(2, bias=true, act=identity)(2 x 128)/forward/CPU/4 thread(s)
292
ns333
ns0.88
dense(2, bias=true, act=identity)(2 x 128)/forward/CPU/8 thread(s)
292
ns292
ns1
dense(2, bias=true, act=identity)(2 x 128)/forward/CPU/1 thread(s)
333
ns333
ns1
dense(2, bias=true, act=identity)(2 x 128)/forward/GPU/CUDA
22118
ns22427.5
ns0.99
dense(2, bias=true, act=identity)(2 x 128)/zygote/CPU/2 thread(s)
1750
ns1792
ns0.98
dense(2, bias=true, act=identity)(2 x 128)/zygote/CPU/4 thread(s)
1834
ns1833
ns1.00
dense(2, bias=true, act=identity)(2 x 128)/zygote/CPU/8 thread(s)
1834
ns1833
ns1.00
dense(2, bias=true, act=identity)(2 x 128)/zygote/CPU/1 thread(s)
1792
ns1792
ns1
dense(2, bias=true, act=identity)(2 x 128)/zygote/GPU/CUDA
166019.5
ns170483
ns0.97
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/2 thread(s)
6250
ns6354.5
ns0.98
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/4 thread(s)
7208
ns6750
ns1.07
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/8 thread(s)
8166
ns7687.5
ns1.06
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/1 thread(s)
6312.5
ns6479
ns0.97
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/CUDA
57360.5
ns59523
ns0.96
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/2 thread(s)
8917
ns9041
ns0.99
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/4 thread(s)
9167
ns9083
ns1.01
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/8 thread(s)
9208
ns9333
ns0.99
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/1 thread(s)
9250
ns8833
ns1.05
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/CUDA
301535
ns308338.5
ns0.98
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/forward/CPU/2 thread(s)
156508063
ns119707062.5
ns1.31
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/forward/CPU/4 thread(s)
173937500
ns173955792
ns1.00
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/forward/CPU/8 thread(s)
148141208
ns148074917
ns1.00
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/forward/CPU/1 thread(s)
106478500
ns108269666
ns0.98
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/forward/GPU/CUDA
5474150
ns5474309.5
ns1.00
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/zygote/CPU/2 thread(s)
673237875
ns617752083
ns1.09
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/zygote/CPU/4 thread(s)
556883000
ns555432583
ns1.00
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/zygote/CPU/8 thread(s)
453960458.5
ns451206208
ns1.01
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/zygote/CPU/1 thread(s)
759297583
ns776597541.5
ns0.98
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/zygote/GPU/CUDA
38204722
ns34955587.5
ns1.09
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/enzyme/CPU/2 thread(s)
701496583
ns649274125
ns1.08
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/enzyme/CPU/4 thread(s)
667076166
ns665965354.5
ns1.00
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/enzyme/CPU/8 thread(s)
586800771
ns585624583.5
ns1.00
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/enzyme/CPU/1 thread(s)
744632000
ns749969750
ns0.99
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
56833
ns59208
ns0.96
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
48042
ns47875
ns1.00
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
47125
ns48166
ns0.98
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
84541
ns85167
ns0.99
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA
37576
ns37958
ns0.99
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
1935541
ns1929270.5
ns1.00
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
1985208
ns1968792
ns1.01
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
1979834
ns1987333
ns1.00
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
1893771
ns1916063
ns0.99
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
174934
ns175872
ns0.99
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
267875
ns267875
ns1
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
288042
ns265500
ns1.08
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
270229.5
ns269958
ns1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
267250
ns266125
ns1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA
128767
ns129983.5
ns0.99
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
665041
ns585834
ns1.14
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
668958
ns595458
ns1.12
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
589167
ns587916
ns1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
596209
ns585000
ns1.02
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
703647.5
ns697007.5
ns1.01
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
2205417
ns2148500
ns1.03
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
2188541
ns2209500
ns0.99
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
2100166.5
ns2103416
ns1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
2225499.5
ns2160708
ns1.03
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA
133307.5
ns133956
ns1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
5538625
ns5496208
ns1.01
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
5527958
ns5493000
ns1.01
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
5503250
ns5496042
ns1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
5491271
ns5572625
ns0.99
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
759584.5
ns737128.5
ns1.03
dense(512, bias=true, act=gelu)(512 x 128)/forward/CPU/2 thread(s)
638667
ns639417
ns1.00
dense(512, bias=true, act=gelu)(512 x 128)/forward/CPU/4 thread(s)
640458
ns657709
ns0.97
dense(512, bias=true, act=gelu)(512 x 128)/forward/CPU/8 thread(s)
648875
ns639917
ns1.01
dense(512, bias=true, act=gelu)(512 x 128)/forward/CPU/1 thread(s)
636167
ns638917
ns1.00
dense(512, bias=true, act=gelu)(512 x 128)/forward/GPU/CUDA
47137
ns47806
ns0.99
dense(512, bias=true, act=gelu)(512 x 128)/zygote/CPU/2 thread(s)
1796937.5
ns1824541
ns0.98
dense(512, bias=true, act=gelu)(512 x 128)/zygote/CPU/4 thread(s)
1724292
ns1726917
ns1.00
dense(512, bias=true, act=gelu)(512 x 128)/zygote/CPU/8 thread(s)
1720542
ns1719687.5
ns1.00
dense(512, bias=true, act=gelu)(512 x 128)/zygote/CPU/1 thread(s)
2104520.5
ns2101292
ns1.00
dense(512, bias=true, act=gelu)(512 x 128)/zygote/GPU/CUDA
218174.5
ns226913
ns0.96
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
57000
ns58583
ns0.97
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
46833
ns45167
ns1.04
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
47083
ns47750
ns0.99
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
84542
ns84958.5
ns1.00
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA
28335
ns29092
ns0.97
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
2047750
ns2034917
ns1.01
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
2077083
ns2064062.5
ns1.01
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
2092083
ns2093625
ns1.00
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
1939979
ns2025250
ns0.96
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
191381.5
ns192854.5
ns0.99
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/forward/CPU/2 thread(s)
13410020.5
ns13439541.5
ns1.00
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/forward/CPU/4 thread(s)
12472750
ns12486020.5
ns1.00
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/forward/CPU/8 thread(s)
12570979
ns12585000
ns1.00
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/forward/CPU/1 thread(s)
15234500
ns15058604.5
ns1.01
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/forward/GPU/CUDA
512740.5
ns514768
ns1.00
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/zygote/CPU/2 thread(s)
47584458
ns47224604.5
ns1.01
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/zygote/CPU/4 thread(s)
41911083
ns41768333
ns1.00
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/zygote/CPU/8 thread(s)
41152979.5
ns40759687.5
ns1.01
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/zygote/CPU/1 thread(s)
58152541
ns59312833
ns0.98
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/zygote/GPU/CUDA
3249099
ns3244631
ns1.00
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/enzyme/CPU/2 thread(s)
74313208.5
ns73979958
ns1.00
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/enzyme/CPU/4 thread(s)
91931958.5
ns68237542
ns1.35
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/enzyme/CPU/8 thread(s)
91156000
ns90322875
ns1.01
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/enzyme/CPU/1 thread(s)
76595709
ns77058250
ns0.99
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
57334
ns58792
ns0.98
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
47417
ns47083.5
ns1.01
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
47250
ns47625
ns0.99
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
84375
ns84834
ns0.99
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA
48075
ns47986
ns1.00
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
1930959
ns1919646
ns1.01
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
1977562.5
ns1965041.5
ns1.01
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
1977250
ns1977583
ns1.00
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
1816292
ns1902500
ns0.95
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
196217.5
ns194479
ns1.01
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/2 thread(s)
334
ns292
ns1.14
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/4 thread(s)
417
ns292
ns1.43
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/8 thread(s)
334
ns375
ns0.89
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/1 thread(s)
333
ns292
ns1.14
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/CUDA
32756
ns32641
ns1.00
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/2 thread(s)
6125
ns6000
ns1.02
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/4 thread(s)
6583
ns6083
ns1.08
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/8 thread(s)
6542
ns6542
ns1
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/1 thread(s)
6208
ns6083
ns1.02
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/CUDA
178147.5
ns173275
ns1.03
dense(2, bias=false, act=identity)(2 x 128)/forward/CPU/2 thread(s)
291
ns250
ns1.16
dense(2, bias=false, act=identity)(2 x 128)/forward/CPU/4 thread(s)
292
ns292
ns1
dense(2, bias=false, act=identity)(2 x 128)/forward/CPU/8 thread(s)
292
ns291
ns1.00
dense(2, bias=false, act=identity)(2 x 128)/forward/CPU/1 thread(s)
250
ns291
ns0.86
dense(2, bias=false, act=identity)(2 x 128)/forward/GPU/CUDA
31948
ns32425
ns0.99
dense(2, bias=false, act=identity)(2 x 128)/zygote/CPU/2 thread(s)
2625
ns2625
ns1
dense(2, bias=false, act=identity)(2 x 128)/zygote/CPU/4 thread(s)
2875
ns2666
ns1.08
dense(2, bias=false, act=identity)(2 x 128)/zygote/CPU/8 thread(s)
2834
ns2917
ns0.97
dense(2, bias=false, act=identity)(2 x 128)/zygote/CPU/1 thread(s)
2625
ns2584
ns1.02
dense(2, bias=false, act=identity)(2 x 128)/zygote/GPU/CUDA
164100
ns160394.5
ns1.02
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/forward/CPU/2 thread(s)
323244146
ns287480500
ns1.12
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/forward/CPU/4 thread(s)
340740458
ns339790334
ns1.00
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/forward/CPU/8 thread(s)
314512041.5
ns314236083.5
ns1.00
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/forward/CPU/1 thread(s)
271130916
ns270187875
ns1.00
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/forward/GPU/CUDA
7115553
ns7108297.5
ns1.00
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/zygote/CPU/2 thread(s)
1053603541.5
ns989833917
ns1.06
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/zygote/CPU/4 thread(s)
941056333
ns940591916
ns1.00
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/zygote/CPU/8 thread(s)
854610104
ns853322000.5
ns1.00
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/zygote/CPU/1 thread(s)
1162236250
ns1178549334
ns0.99
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/zygote/GPU/CUDA
33945165
ns34044401
ns1.00
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/enzyme/CPU/2 thread(s)
1364084083.5
ns1316176791.5
ns1.04
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/enzyme/CPU/4 thread(s)
1705661833
ns1348661437.5
ns1.26
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/enzyme/CPU/8 thread(s)
1621953875
ns1629837083
ns1.00
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/enzyme/CPU/1 thread(s)
1313183229.5
ns1293144333.5
ns1.02
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
1410000
ns1406584
ns1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
1408291.5
ns1404458.5
ns1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
1453645.5
ns1409375
ns1.03
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
1407209
ns1410334
ns1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA
127861
ns127864
ns1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
5051959
ns5021959
ns1.01
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
5013583.5
ns5007792
ns1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
5028416.5
ns5030667
ns1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
5027271
ns5052000
ns1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
604299
ns550210.5
ns1.10
vgg16(32, 32, 3, 32)/forward/CPU/2 thread(s)
161226250
ns174975458.5
ns0.92
vgg16(32, 32, 3, 32)/forward/CPU/4 thread(s)
131446875
ns131550875
ns1.00
vgg16(32, 32, 3, 32)/forward/CPU/8 thread(s)
127042083
ns129143375.5
ns0.98
vgg16(32, 32, 3, 32)/forward/CPU/1 thread(s)
155626750.5
ns161588000
ns0.96
vgg16(32, 32, 3, 32)/forward/GPU/CUDA
4974919.5
ns4877735
ns1.02
vgg16(32, 32, 3, 32)/zygote/CPU/2 thread(s)
850481958
ns666469042
ns1.28
vgg16(32, 32, 3, 32)/zygote/CPU/4 thread(s)
644255791
ns640200042
ns1.01
vgg16(32, 32, 3, 32)/zygote/CPU/8 thread(s)
496077667
ns534233208
ns0.93
vgg16(32, 32, 3, 32)/zygote/CPU/1 thread(s)
685984875
ns868077834
ns0.79
vgg16(32, 32, 3, 32)/zygote/GPU/CUDA
15948822
ns16128771
ns0.99
batchedmm(512, Bsize=32)/forward/CPU/2 thread(s)
9064833.5
ns8899521
ns1.02
batchedmm(512, Bsize=32)/forward/CPU/4 thread(s)
8770396
ns8695125
ns1.01
batchedmm(512, Bsize=32)/forward/CPU/8 thread(s)
7878104.5
ns7843000
ns1.00
batchedmm(512, Bsize=32)/forward/CPU/1 thread(s)
10163000
ns10351917
ns0.98
batchedmm(512, Bsize=32)/forward/GPU/CUDA
1608837.5
ns1610313.5
ns1.00
batchedmm(512, Bsize=32)/zygote/CPU/2 thread(s)
37348729
ns36519833
ns1.02
batchedmm(512, Bsize=32)/zygote/CPU/4 thread(s)
36970124.5
ns36646083
ns1.01
batchedmm(512, Bsize=32)/zygote/CPU/8 thread(s)
33623167
ns33248208.5
ns1.01
batchedmm(512, Bsize=32)/zygote/CPU/1 thread(s)
38875729.5
ns40043375
ns0.97
batchedmm(512, Bsize=32)/zygote/GPU/CUDA
6455570
ns6450575
ns1.00
bias_activation(32, act=tanh)(32 x 128)/forward/CPU/2 thread(s)
47375
ns47625
ns0.99
bias_activation(32, act=tanh)(32 x 128)/forward/CPU/4 thread(s)
47750
ns47375
ns1.01
bias_activation(32, act=tanh)(32 x 128)/forward/CPU/8 thread(s)
47583
ns47709
ns1.00
bias_activation(32, act=tanh)(32 x 128)/forward/CPU/1 thread(s)
47625
ns47292
ns1.01
bias_activation(32, act=tanh)(32 x 128)/forward/GPU/CUDA
18855
ns19138
ns0.99
bias_activation(32, act=tanh)(32 x 128)/zygote/CPU/2 thread(s)
50250
ns50458
ns1.00
bias_activation(32, act=tanh)(32 x 128)/zygote/CPU/4 thread(s)
50750
ns50292
ns1.01
bias_activation(32, act=tanh)(32 x 128)/zygote/CPU/8 thread(s)
50416
ns53334
ns0.95
bias_activation(32, act=tanh)(32 x 128)/zygote/CPU/1 thread(s)
50292
ns50416
ns1.00
bias_activation(32, act=tanh)(32 x 128)/zygote/GPU/CUDA
202264
ns174370
ns1.16
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/2 thread(s)
6375
ns6834
ns0.93
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/4 thread(s)
7187.5
ns6584
ns1.09
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/8 thread(s)
8417
ns7875.5
ns1.07
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/1 thread(s)
6708
ns6687.5
ns1.00
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/CUDA
108599.5
ns86131.5
ns1.26
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/2 thread(s)
9604.5
ns9958
ns0.96
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/4 thread(s)
10209
ns9584
ns1.07
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/8 thread(s)
10292
ns10250
ns1.00
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/1 thread(s)
10583
ns10042
ns1.05
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/CUDA
610519
ns515499.5
ns1.18
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/2 thread(s)
5958
ns6500
ns0.92
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/4 thread(s)
6375
ns5708
ns1.12
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/8 thread(s)
7583
ns7208
ns1.05
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/1 thread(s)
5542
ns5333
ns1.04
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/CUDA
131186.5
ns95565.5
ns1.37
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/2 thread(s)
12875
ns12645.5
ns1.02
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/4 thread(s)
13208
ns12542
ns1.05
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/8 thread(s)
13583
ns13750
ns0.99
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/1 thread(s)
12875
ns12937.5
ns1.00
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/CUDA
530393
ns469414.5
ns1.13
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/2 thread(s)
1000
ns958
ns1.04
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/4 thread(s)
1167
ns1083
ns1.08
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/8 thread(s)
1042
ns1042
ns1
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/1 thread(s)
1042
ns1042
ns1
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/CUDA
32479.5
ns32890
ns0.99
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/2 thread(s)
7833.5
ns7562.5
ns1.04
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/4 thread(s)
8042
ns7750
ns1.04
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/8 thread(s)
8083
ns8208
ns0.98
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/1 thread(s)
7916
ns8000
ns0.99
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/CUDA
216406.5
ns199158.5
ns1.09
bias_activation(32, act=gelu)(32 x 128)/forward/CPU/2 thread(s)
23042
ns23083
ns1.00
bias_activation(32, act=gelu)(32 x 128)/forward/CPU/4 thread(s)
23542
ns23083.5
ns1.02
bias_activation(32, act=gelu)(32 x 128)/forward/CPU/8 thread(s)
23333
ns23375
ns1.00
bias_activation(32, act=gelu)(32 x 128)/forward/CPU/1 thread(s)
23375
ns23250
ns1.01
bias_activation(32, act=gelu)(32 x 128)/forward/GPU/CUDA
19066
ns18687
ns1.02
bias_activation(32, act=gelu)(32 x 128)/zygote/CPU/2 thread(s)
52291.5
ns52292
ns1.00
bias_activation(32, act=gelu)(32 x 128)/zygote/CPU/4 thread(s)
52500
ns52375
ns1.00
bias_activation(32, act=gelu)(32 x 128)/zygote/CPU/8 thread(s)
53166.5
ns52750
ns1.01
bias_activation(32, act=gelu)(32 x 128)/zygote/CPU/1 thread(s)
52125
ns52375
ns1.00
bias_activation(32, act=gelu)(32 x 128)/zygote/GPU/CUDA
309714.5
ns257093.5
ns1.20
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
1413917
ns1406750
ns1.01
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
1401104
ns1401479.5
ns1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
1457583.5
ns1402562.5
ns1.04
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
1402271
ns1408270.5
ns1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA
196285
ns196328
ns1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
5045083
ns5008146
ns1.01
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
4724458
ns4702667
ns1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
5023021
ns5024104
ns1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
4706104.5
ns5039750
ns0.93
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
644560.5
ns556806.5
ns1.16
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/forward/CPU/2 thread(s)
3086125.5
ns3030084
ns1.02
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/forward/CPU/4 thread(s)
2087104.5
ns2079354
ns1.00
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/forward/CPU/8 thread(s)
2281125
ns2282729.5
ns1.00
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/forward/CPU/1 thread(s)
4848375
ns4945062.5
ns0.98
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/forward/GPU/CUDA
580262
ns581296
ns1.00
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/zygote/CPU/2 thread(s)
24765000.5
ns24403958.5
ns1.01
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/zygote/CPU/4 thread(s)
18889791.5
ns18897937.5
ns1.00
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/zygote/CPU/8 thread(s)
19005084
ns18907333
ns1.01
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/zygote/CPU/1 thread(s)
36681292
ns37159666
ns0.99
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/zygote/GPU/CUDA
3253871.5
ns3184896
ns1.02
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/enzyme/CPU/2 thread(s)
34537875
ns34104791.5
ns1.01
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/enzyme/CPU/4 thread(s)
28314500
ns28272250
ns1.00
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/enzyme/CPU/8 thread(s)
27967000
ns27994771
ns1.00
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/enzyme/CPU/1 thread(s)
41702500
ns42199625
ns0.99
batchedmm(512, Bsize=512)/forward/CPU/2 thread(s)
144041208
ns142261750
ns1.01
batchedmm(512, Bsize=512)/forward/CPU/4 thread(s)
143168583
ns143002917
ns1.00
batchedmm(512, Bsize=512)/forward/CPU/8 thread(s)
124247521
ns125056229
ns0.99
batchedmm(512, Bsize=512)/forward/CPU/1 thread(s)
173506729
ns168210729
ns1.03
batchedmm(512, Bsize=512)/forward/GPU/CUDA
22768605
ns22549101
ns1.01
batchedmm(512, Bsize=512)/zygote/CPU/2 thread(s)
957619479
ns924197146
ns1.04
batchedmm(512, Bsize=512)/zygote/CPU/4 thread(s)
1175957479.5
ns881679833
ns1.33
batchedmm(512, Bsize=512)/zygote/CPU/8 thread(s)
739734292
ns679210937
ns1.09
batchedmm(512, Bsize=512)/zygote/CPU/1 thread(s)
672317125
ns691445167
ns0.97
batchedmm(512, Bsize=512)/zygote/GPU/CUDA
118020449
ns118243896
ns1.00
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
73979
ns78875
ns0.94
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
75750
ns73916.5
ns1.02
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
75416
ns76792
ns0.98
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
72854.5
ns74292
ns0.98
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA
300521.5
ns204610.5
ns1.47
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
287875
ns191000
ns1.51
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
285333
ns189917
ns1.50
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
204208
ns192291
ns1.06
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
287375
ns285687.5
ns1.01
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
1342742
ns1149282.5
ns1.17
batchedmm(512, Bsize=128)/forward/CPU/2 thread(s)
36185500
ns35495792
ns1.02
batchedmm(512, Bsize=128)/forward/CPU/4 thread(s)
35466000.5
ns35648750
ns0.99
batchedmm(512, Bsize=128)/forward/CPU/8 thread(s)
32336688
ns32319292
ns1.00
batchedmm(512, Bsize=128)/forward/CPU/1 thread(s)
40972250
ns41619250.5
ns0.98
batchedmm(512, Bsize=128)/forward/GPU/CUDA
5837876
ns5843597
ns1.00
batchedmm(512, Bsize=128)/zygote/CPU/2 thread(s)
151179834
ns147692250
ns1.02
batchedmm(512, Bsize=128)/zygote/CPU/4 thread(s)
151456979
ns153061583
ns0.99
batchedmm(512, Bsize=128)/zygote/CPU/8 thread(s)
136606104
ns133656084
ns1.02
batchedmm(512, Bsize=128)/zygote/CPU/1 thread(s)
287372208
ns228263645.5
ns1.26
batchedmm(512, Bsize=128)/zygote/GPU/CUDA
34877857
ns34880956
ns1.00
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/forward/CPU/2 thread(s)
155986916
ns120866228.5
ns1.29
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/forward/CPU/4 thread(s)
174507459
ns174040042
ns1.00
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/forward/CPU/8 thread(s)
148111416.5
ns147840500
ns1.00
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/forward/CPU/1 thread(s)
102908562.5
ns102334125
ns1.01
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/forward/GPU/CUDA
5463707
ns5477094
ns1.00
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/zygote/CPU/2 thread(s)
520380250
ns470548750
ns1.11
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/zygote/CPU/4 thread(s)
465489750
ns467155625
ns1.00
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/zygote/CPU/8 thread(s)
439138000
ns436800021
ns1.01
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/zygote/CPU/1 thread(s)
742252417
ns762586062.5
ns0.97
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/zygote/GPU/CUDA
35175845
ns32255358.5
ns1.09
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/enzyme/CPU/2 thread(s)
698201250
ns650080584
ns1.07
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/enzyme/CPU/4 thread(s)
654820792
ns654363708.5
ns1.00
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/enzyme/CPU/8 thread(s)
571273229.5
ns577159563
ns0.99
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/enzyme/CPU/1 thread(s)
850215250
ns870785625
ns0.98
mlp7layer_bn(relu)(32 x 256)/forward/CPU/2 thread(s)
1101520.5
ns1344042
ns0.82
mlp7layer_bn(relu)(32 x 256)/forward/CPU/4 thread(s)
970208.5
ns906229
ns1.07
mlp7layer_bn(relu)(32 x 256)/forward/CPU/8 thread(s)
920500
ns903583.5
ns1.02
mlp7layer_bn(relu)(32 x 256)/forward/CPU/1 thread(s)
1945375.5
ns2053417
ns0.95
mlp7layer_bn(relu)(32 x 256)/forward/GPU/CUDA
580245.5
ns582678.5
ns1.00
mlp7layer_bn(relu)(32 x 256)/zygote/CPU/2 thread(s)
2907896
ns2957750
ns0.98
mlp7layer_bn(relu)(32 x 256)/zygote/CPU/4 thread(s)
2595708
ns2598375
ns1.00
mlp7layer_bn(relu)(32 x 256)/zygote/CPU/8 thread(s)
2606333
ns2617083.5
ns1.00
mlp7layer_bn(relu)(32 x 256)/zygote/CPU/1 thread(s)
3655000
ns3768041.5
ns0.97
mlp7layer_bn(relu)(32 x 256)/zygote/GPU/CUDA
1734207
ns1806655.5
ns0.96
mlp7layer_bn(relu)(32 x 256)/enzyme/CPU/2 thread(s)
6744875
ns6643083
ns1.02
mlp7layer_bn(relu)(32 x 256)/enzyme/CPU/4 thread(s)
6498208
ns6464750
ns1.01
mlp7layer_bn(relu)(32 x 256)/enzyme/CPU/8 thread(s)
6503854.5
ns6500583.5
ns1.00
mlp7layer_bn(relu)(32 x 256)/enzyme/CPU/1 thread(s)
4423604.5
ns4561542
ns0.97
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
7208
ns7167
ns1.01
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
6083
ns6208
ns0.98
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
5958.5
ns6167
ns0.97
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
9959
ns10375
ns0.96
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA
25201
ns26058
ns0.97
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
212291
ns213208.5
ns1.00
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
220750
ns220271
ns1.00
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
220125
ns220708
ns1.00
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
206792
ns206000
ns1.00
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
262467.5
ns261050.5
ns1.01
vgg16(32, 32, 3, 64)/forward/CPU/2 thread(s)
316552750
ns311672646
ns1.02
vgg16(32, 32, 3, 64)/forward/CPU/4 thread(s)
221682708
ns221886520.5
ns1.00
vgg16(32, 32, 3, 64)/forward/CPU/8 thread(s)
187257688
ns182666958
ns1.03
vgg16(32, 32, 3, 64)/forward/CPU/1 thread(s)
311596375
ns306867104.5
ns1.02
vgg16(32, 32, 3, 64)/forward/GPU/CUDA
7676203
ns7678016
ns1.00
vgg16(32, 32, 3, 64)/zygote/CPU/2 thread(s)
1093022833.5
ns1080144500
ns1.01
vgg16(32, 32, 3, 64)/zygote/CPU/4 thread(s)
911616145.5
ns906381437.5
ns1.01
vgg16(32, 32, 3, 64)/zygote/CPU/8 thread(s)
815656375
ns825800000
ns0.99
vgg16(32, 32, 3, 64)/zygote/CPU/1 thread(s)
1161401125
ns1190620500
ns0.98
vgg16(32, 32, 3, 64)/zygote/GPU/CUDA
26547253
ns26457172
ns1.00
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/2 thread(s)
5292
ns5458
ns0.97
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/4 thread(s)
5667
ns5125
ns1.11
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/8 thread(s)
6625
ns6333
ns1.05
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/1 thread(s)
5125
ns5083
ns1.01
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/CUDA
167889.5
ns152413
ns1.10
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/2 thread(s)
7083
ns7292
ns0.97
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/4 thread(s)
7375
ns7208.5
ns1.02
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/8 thread(s)
7459
ns7458
ns1.00
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/1 thread(s)
7437.5
ns6875
ns1.08
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/CUDA
650263
ns620099.5
ns1.05
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/2 thread(s)
542
ns541
ns1.00
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/4 thread(s)
709
ns542
ns1.31
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/8 thread(s)
667
ns625
ns1.07
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/1 thread(s)
542
ns500
ns1.08
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/CUDA
23809
ns24884
ns0.96
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/2 thread(s)
9041.5
ns8833
ns1.02
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/4 thread(s)
9791
ns8875
ns1.10
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/8 thread(s)
9208.5
ns9708
ns0.95
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/1 thread(s)
9042
ns9250
ns0.98
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/CUDA
233459
ns225409.5
ns1.04
bias_activation(512, act=gelu)(512 x 128)/forward/CPU/2 thread(s)
351417
ns356375
ns0.99
bias_activation(512, act=gelu)(512 x 128)/forward/CPU/4 thread(s)
352250
ns354208
ns0.99
bias_activation(512, act=gelu)(512 x 128)/forward/CPU/8 thread(s)
353063
ns353708.5
ns1.00
bias_activation(512, act=gelu)(512 x 128)/forward/CPU/1 thread(s)
353333
ns352374.5
ns1.00
bias_activation(512, act=gelu)(512 x 128)/forward/GPU/CUDA
21613
ns21669
ns1.00
bias_activation(512, act=gelu)(512 x 128)/zygote/CPU/2 thread(s)
791250
ns827958
ns0.96
bias_activation(512, act=gelu)(512 x 128)/zygote/CPU/4 thread(s)
808979
ns775125
ns1.04
bias_activation(512, act=gelu)(512 x 128)/zygote/CPU/8 thread(s)
773625
ns828062.5
ns0.93
bias_activation(512, act=gelu)(512 x 128)/zygote/CPU/1 thread(s)
824084
ns830458.5
ns0.99
bias_activation(512, act=gelu)(512 x 128)/zygote/GPU/CUDA
305844
ns270211
ns1.13
batchedmm(16, Bsize=32)/forward/CPU/2 thread(s)
314958
ns335583
ns0.94
batchedmm(16, Bsize=32)/forward/CPU/4 thread(s)
333625
ns334417
ns1.00
batchedmm(16, Bsize=32)/forward/CPU/8 thread(s)
448667
ns452750
ns0.99
batchedmm(16, Bsize=32)/forward/CPU/1 thread(s)
331833
ns308896
ns1.07
batchedmm(16, Bsize=32)/forward/GPU/CUDA
17811
ns17938
ns0.99
batchedmm(16, Bsize=32)/zygote/CPU/2 thread(s)
682125
ns685917
ns0.99
batchedmm(16, Bsize=32)/zygote/CPU/4 thread(s)
746791.5
ns740625
ns1.01
batchedmm(16, Bsize=32)/zygote/CPU/8 thread(s)
1029167
ns1037042
ns0.99
batchedmm(16, Bsize=32)/zygote/CPU/1 thread(s)
700937.5
ns694791
ns1.01
batchedmm(16, Bsize=32)/zygote/GPU/CUDA
273907.5
ns231915
ns1.18
batchedmm(16, Bsize=128)/forward/CPU/2 thread(s)
328083
ns350459
ns0.94
batchedmm(16, Bsize=128)/forward/CPU/4 thread(s)
348979
ns349333.5
ns1.00
batchedmm(16, Bsize=128)/forward/CPU/8 thread(s)
424375
ns428792
ns0.99
batchedmm(16, Bsize=128)/forward/CPU/1 thread(s)
370666
ns351729
ns1.05
batchedmm(16, Bsize=128)/forward/GPU/CUDA
22237
ns22606
ns0.98
batchedmm(16, Bsize=128)/zygote/CPU/2 thread(s)
743604
ns750375
ns0.99
batchedmm(16, Bsize=128)/zygote/CPU/4 thread(s)
750229
ns744250
ns1.01
batchedmm(16, Bsize=128)/zygote/CPU/8 thread(s)
1076375
ns1079542
ns1.00
batchedmm(16, Bsize=128)/zygote/CPU/1 thread(s)
822541
ns825979.5
ns1.00
batchedmm(16, Bsize=128)/zygote/GPU/CUDA
220485.5
ns213391.5
ns1.03
bias_activation(2, act=tanh)(2 x 128)/forward/CPU/2 thread(s)
3334
ns3542
ns0.94
bias_activation(2, act=tanh)(2 x 128)/forward/CPU/4 thread(s)
3792
ns3583
ns1.06
bias_activation(2, act=tanh)(2 x 128)/forward/CPU/8 thread(s)
3625
ns3666
ns0.99
bias_activation(2, act=tanh)(2 x 128)/forward/CPU/1 thread(s)
3583
ns3520.5
ns1.02
bias_activation(2, act=tanh)(2 x 128)/forward/GPU/CUDA
18068
ns17871
ns1.01
bias_activation(2, act=tanh)(2 x 128)/zygote/CPU/2 thread(s)
4166
ns4104.5
ns1.01
bias_activation(2, act=tanh)(2 x 128)/zygote/CPU/4 thread(s)
4542
ns4167
ns1.09
bias_activation(2, act=tanh)(2 x 128)/zygote/CPU/8 thread(s)
4250
ns4417
ns0.96
bias_activation(2, act=tanh)(2 x 128)/zygote/CPU/1 thread(s)
4334
ns4083
ns1.06
bias_activation(2, act=tanh)(2 x 128)/zygote/GPU/CUDA
278097
ns238254.5
ns1.17
layernorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/2 thread(s)
3292
ns3792
ns0.87
layernorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/4 thread(s)
3645.5
ns4125
ns0.88
layernorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/8 thread(s)
4708
ns4625
ns1.02
layernorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/1 thread(s)
4042
ns3500
ns1.15
layernorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/CUDA
212235.5
ns178544.5
ns1.19
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/2 thread(s)
8042
ns8417
ns0.96
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/4 thread(s)
8417
ns7958
ns1.06
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/8 thread(s)
8792
ns8666
ns1.01
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/1 thread(s)
8167
ns8458
ns0.97
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/CUDA
1255478
ns1052090.5
ns1.19
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
204000
ns203583
ns1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
211375
ns214542
ns0.99
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
211042
ns210292
ns1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
200541
ns201875
ns0.99
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA
34367
ns34516
ns1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
605708.5
ns607937.5
ns1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
625021
ns667208.5
ns0.94
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
620792
ns667479
ns0.93
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
582583
ns631687
ns0.92
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
361289.5
ns291952
ns1.24
batchedmm(128, Bsize=128)/forward/CPU/2 thread(s)
973333
ns972042
ns1.00
batchedmm(128, Bsize=128)/forward/CPU/4 thread(s)
950209
ns932792
ns1.02
batchedmm(128, Bsize=128)/forward/CPU/8 thread(s)
955541
ns955812.5
ns1.00
batchedmm(128, Bsize=128)/forward/CPU/1 thread(s)
1286000.5
ns1334188
ns0.96
batchedmm(128, Bsize=128)/forward/GPU/CUDA
207830
ns207894
ns1.00
batchedmm(128, Bsize=128)/zygote/CPU/2 thread(s)
4594084
ns4516458.5
ns1.02
batchedmm(128, Bsize=128)/zygote/CPU/4 thread(s)
4500750.5
ns4463146
ns1.01
batchedmm(128, Bsize=128)/zygote/CPU/8 thread(s)
4304583
ns4308209
ns1.00
batchedmm(128, Bsize=128)/zygote/CPU/1 thread(s)
6304625
ns6464792
ns0.98
batchedmm(128, Bsize=128)/zygote/GPU/CUDA
925479
ns938833.5
ns0.99
layernorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/2 thread(s)
3333
ns3500
ns0.95
layernorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/4 thread(s)
3583
ns3583
ns1
layernorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/8 thread(s)
4250
ns4083
ns1.04
layernorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/1 thread(s)
3541
ns3125
ns1.13
layernorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/CUDA
240989.5
ns174208
ns1.38
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/2 thread(s)
6875
ns7292
ns0.94
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/4 thread(s)
7542
ns7084
ns1.06
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/8 thread(s)
7375
ns7667
ns0.96
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/1 thread(s)
7042
ns6959
ns1.01
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/CUDA
1039649.5
ns935667
ns1.11
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/forward/CPU/2 thread(s)
1636792
ns1654291.5
ns0.99
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/forward/CPU/4 thread(s)
1175749.5
ns1178000
ns1.00
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/forward/CPU/8 thread(s)
1347167
ns1375667
ns0.98
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/forward/CPU/1 thread(s)
2463271
ns2330125
ns1.06
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/forward/GPU/CUDA
213096
ns212833
ns1.00
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/zygote/CPU/2 thread(s)
12388416
ns12374250
ns1.00
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/zygote/CPU/4 thread(s)
9551437.5
ns9567834
ns1.00
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/zygote/CPU/8 thread(s)
9305937.5
ns9311229.5
ns1.00
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/zygote/CPU/1 thread(s)
18088000
ns18171313
ns1.00
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/zygote/GPU/CUDA
1951605
ns1941291
ns1.01
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/enzyme/CPU/2 thread(s)
17398084
ns17396166
ns1.00
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/enzyme/CPU/4 thread(s)
14348854.5
ns14397645.5
ns1.00
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/enzyme/CPU/8 thread(s)
14347271
ns14397458
ns1.00
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/enzyme/CPU/1 thread(s)
21112104
ns21079729.5
ns1.00
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
94729.5
ns87646.5
ns1.08
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
90667
ns96250
ns0.94
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
92375
ns94125
ns0.98
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
114395.5
ns133000
ns0.86
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA
125574
ns125997
ns1.00
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
2039792
ns2029500
ns1.01
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
1808208.5
ns2013042
ns0.90
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
2033666.5
ns2027313
ns1.00
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
2022500
ns2051291
ns0.99
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
1052869
ns959259
ns1.10
batchedmm(2, Bsize=4)/forward/CPU/2 thread(s)
326041.5
ns347583
ns0.94
batchedmm(2, Bsize=4)/forward/CPU/4 thread(s)
344833
ns344146
ns1.00
batchedmm(2, Bsize=4)/forward/CPU/8 thread(s)
396416
ns399083
ns0.99
batchedmm(2, Bsize=4)/forward/CPU/1 thread(s)
314708
ns286854
ns1.10
batchedmm(2, Bsize=4)/forward/GPU/CUDA
15677
ns16054
ns0.98
batchedmm(2, Bsize=4)/zygote/CPU/2 thread(s)
701042
ns705709
ns0.99
batchedmm(2, Bsize=4)/zygote/CPU/4 thread(s)
733209
ns728500
ns1.01
batchedmm(2, Bsize=4)/zygote/CPU/8 thread(s)
1020500
ns1019208
ns1.00
batchedmm(2, Bsize=4)/zygote/CPU/1 thread(s)
656250
ns649146
ns1.01
batchedmm(2, Bsize=4)/zygote/GPU/CUDA
196145.5
ns186898.5
ns1.05
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
7084
ns7250
ns0.98
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
5541
ns5834
ns0.95
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
6084
ns5916
ns1.03
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
10000
ns10375
ns0.96
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA
34060
ns34237
ns0.99
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
221166.5
ns222854
ns0.99
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
220916.5
ns224917
ns0.98
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
220167
ns233500
ns0.94
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
217124.5
ns214667
ns1.01
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
344547
ns287423
ns1.20
dense(32, bias=true, act=identity)(32 x 128)/forward/CPU/2 thread(s)
3750
ns3708
ns1.01
dense(32, bias=true, act=identity)(32 x 128)/forward/CPU/4 thread(s)
3709
ns3708
ns1.00
dense(32, bias=true, act=identity)(32 x 128)/forward/CPU/8 thread(s)
3708
ns3708
ns1
dense(32, bias=true, act=identity)(32 x 128)/forward/CPU/1 thread(s)
3667
ns3709
ns0.99
dense(32, bias=true, act=identity)(32 x 128)/forward/GPU/CUDA
22568
ns22887
ns0.99
dense(32, bias=true, act=identity)(32 x 128)/zygote/CPU/2 thread(s)
14167
ns14417
ns0.98
dense(32, bias=true, act=identity)(32 x 128)/zygote/CPU/4 thread(s)
14375
ns14417
ns1.00
dense(32, bias=true, act=identity)(32 x 128)/zygote/CPU/8 thread(s)
14458
ns14375
ns1.01
dense(32, bias=true, act=identity)(32 x 128)/zygote/CPU/1 thread(s)
14416
ns14250
ns1.01
dense(32, bias=true, act=identity)(32 x 128)/zygote/GPU/CUDA
487124.5
ns433354.5
ns1.12
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
97500
ns138895.5
ns0.70
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
93417
ns93687.5
ns1.00
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
96687.5
ns100229.5
ns0.96
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
91875
ns94083
ns0.98
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA
124929
ns125360
ns1.00
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
1940875
ns1922979
ns1.01
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
1919916.5
ns1925979.5
ns1.00
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
1931229.5
ns1923562.5
ns1.00
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
1917271.5
ns1922291
ns1.00
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
955641
ns884217
ns1.08
lenet(28, 28, 1, 32)/forward/CPU/2 thread(s)
854084
ns877312.5
ns0.97
lenet(28, 28, 1, 32)/forward/CPU/4 thread(s)
826333
ns821375
ns1.01
lenet(28, 28, 1, 32)/forward/CPU/8 thread(s)
1211000
ns1222916.5
ns0.99
lenet(28, 28, 1, 32)/forward/CPU/1 thread(s)
955354.5
ns940166
ns1.02
lenet(28, 28, 1, 32)/forward/GPU/CUDA
272141
ns270283.5
ns1.01
lenet(28, 28, 1, 32)/zygote/CPU/2 thread(s)
2801124.5
ns2811896
ns1.00
lenet(28, 28, 1, 32)/zygote/CPU/4 thread(s)
2515333
ns2435875
ns1.03
lenet(28, 28, 1, 32)/zygote/CPU/8 thread(s)
3309625
ns3368479
ns0.98
lenet(28, 28, 1, 32)/zygote/CPU/1 thread(s)
3416625
ns3411708.5
ns1.00
lenet(28, 28, 1, 32)/zygote/GPU/CUDA
1612126.5
ns1507174
ns1.07
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
17062.5
ns17667
ns0.97
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
16708.5
ns16271
ns1.03
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
18937
ns18375
ns1.03
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
15167
ns16937.5
ns0.90
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA
142123.5
ns141332.5
ns1.01
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
223437.5
ns256250
ns0.87
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
215958
ns216042
ns1.00
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
216125
ns257583
ns0.84
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
255708.5
ns221917
ns1.15
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
644779
ns582084
ns1.11
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
222292
ns222250
ns1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
221750
ns221750
ns1
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
222542
ns222250
ns1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
220917
ns219458
ns1.01
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA
271274.5
ns260776.5
ns1.04
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
509083
ns509000
ns1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
501292
ns553542
ns0.91
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
496750
ns559708.5
ns0.89
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
550583
ns503250
ns1.09
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
1401190
ns1236203
ns1.13
batchedmm(16, Bsize=4)/forward/CPU/2 thread(s)
304437.5
ns337667
ns0.90
batchedmm(16, Bsize=4)/forward/CPU/4 thread(s)
331687.5
ns332458
ns1.00
batchedmm(16, Bsize=4)/forward/CPU/8 thread(s)
376292
ns376750
ns1.00
batchedmm(16, Bsize=4)/forward/CPU/1 thread(s)
321812.5
ns297895.5
ns1.08
batchedmm(16, Bsize=4)/forward/GPU/CUDA
16554
ns16751
ns0.99
batchedmm(16, Bsize=4)/zygote/CPU/2 thread(s)
708875
ns715896
ns0.99
batchedmm(16, Bsize=4)/zygote/CPU/4 thread(s)
736875
ns727792
ns1.01
batchedmm(16, Bsize=4)/zygote/CPU/8 thread(s)
1020209
ns1021333
ns1.00
batchedmm(16, Bsize=4)/zygote/CPU/1 thread(s)
668458
ns663750
ns1.01
batchedmm(16, Bsize=4)/zygote/GPU/CUDA
196065
ns191390
ns1.02
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
17854
ns18500
ns0.97
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
18520.5
ns18000
ns1.03
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
19667
ns19125
ns1.03
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
16209
ns17520.5
ns0.93
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA
146750.5
ns144715.5
ns1.01
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
247604
ns221250
ns1.12
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
212500
ns211979
ns1.00
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
212917
ns225167
ns0.95
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
211750.5
ns230917
ns0.92
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
1011803
ns877142.5
ns1.15
layernorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/2 thread(s)
4125
ns4208
ns0.98
layernorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/4 thread(s)
4125
ns4875
ns0.85
layernorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/8 thread(s)
5187.5
ns5250
ns0.99
layernorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/1 thread(s)
4084
ns4020.5
ns1.02
layernorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/CUDA
201325
ns180911
ns1.11
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/2 thread(s)
10667
ns10417
ns1.02
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/4 thread(s)
10875
ns10479.5
ns1.04
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/8 thread(s)
10500
ns11167
ns0.94
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/1 thread(s)
10375
ns10208.5
ns1.02
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/CUDA
1050725
ns1008910
ns1.04
layernorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/2 thread(s)
3375
ns3375
ns1
layernorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/4 thread(s)
3625
ns3541
ns1.02
layernorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/8 thread(s)
4167
ns3958
ns1.05
layernorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/1 thread(s)
3291
ns3312.5
ns0.99
layernorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/CUDA
242454
ns218262
ns1.11
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/2 thread(s)
7542
ns7292
ns1.03
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/4 thread(s)
7666
ns7292
ns1.05
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/8 thread(s)
7750
ns7959
ns0.97
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/1 thread(s)
7333
ns7292
ns1.01
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/CUDA
1067571
ns1066214
ns1.00
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/forward/CPU/2 thread(s)
24057353.5
ns23448500
ns1.03
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/forward/CPU/4 thread(s)
34753459
ns35001375
ns0.99
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/forward/CPU/8 thread(s)
37792125
ns37680292
ns1.00
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/forward/CPU/1 thread(s)
34828583.5
ns35380500
ns0.98
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/forward/GPU/CUDA
1854184
ns1853791.5
ns1.00
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/zygote/CPU/2 thread(s)
187222542
ns184430542
ns1.02
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/zygote/CPU/4 thread(s)
160010375
ns159371667
ns1.00
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/zygote/CPU/8 thread(s)
146721854.5
ns146466937
ns1.00
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/zygote/CPU/1 thread(s)
412776417
ns422477750
ns0.98
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/zygote/GPU/CUDA
16508303
ns16507463.5
ns1.00
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/enzyme/CPU/2 thread(s)
437495583
ns426064458
ns1.03
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/enzyme/CPU/4 thread(s)
253838438
ns254339271
ns1.00
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/enzyme/CPU/8 thread(s)
232343979.5
ns232745062.5
ns1.00
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/enzyme/CPU/1 thread(s)
483540875
ns496585354
ns0.97
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
183854
ns183208
ns1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
183625
ns186646
ns0.98
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
185334
ns185333
ns1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
184167
ns183875
ns1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA
220968
ns200485.5
ns1.10
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
594000
ns597479
ns0.99
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
632437.5
ns587062.5
ns1.08
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
586084
ns635083.5
ns0.92
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
628500
ns621062
ns1.01
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
1061303.5
ns1047688.5
ns1.01
batchedmm(128, Bsize=512)/forward/CPU/2 thread(s)
3892042
ns3838792
ns1.01
batchedmm(128, Bsize=512)/forward/CPU/4 thread(s)
3642708
ns3832229
ns0.95
batchedmm(128, Bsize=512)/forward/CPU/8 thread(s)
3572042
ns3508709
ns1.02
batchedmm(128, Bsize=512)/forward/CPU/1 thread(s)
5353250
ns5482833
ns0.98
batchedmm(128, Bsize=512)/forward/GPU/CUDA
549368
ns553997
ns0.99
batchedmm(128, Bsize=512)/zygote/CPU/2 thread(s)
17901624.5
ns17434062.5
ns1.03
batchedmm(128, Bsize=512)/zygote/CPU/4 thread(s)
17281292
ns17172083.5
ns1.01
batchedmm(128, Bsize=512)/zygote/CPU/8 thread(s)
16574875
ns16682312
ns0.99
batchedmm(128, Bsize=512)/zygote/CPU/1 thread(s)
22050250
ns23187875
ns0.95
batchedmm(128, Bsize=512)/zygote/GPU/CUDA
2630980
ns2617230
ns1.01
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/2 thread(s)
541
ns542
ns1.00
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/4 thread(s)
625
ns583
ns1.07
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/8 thread(s)
584
ns584
ns1
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/1 thread(s)
584
ns542
ns1.08
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/CUDA
31762
ns32155
ns0.99
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/2 thread(s)
9145.5
ns8729.5
ns1.05
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/4 thread(s)
9208
ns8958
ns1.03
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/8 thread(s)
9417
ns9875
ns0.95
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/1 thread(s)
9208
ns9000
ns1.02
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/CUDA
262912.5
ns263867.5
ns1.00
vgg16(32, 32, 3, 128)/forward/CPU/2 thread(s)
505346750
ns497911458
ns1.01
vgg16(32, 32, 3, 128)/forward/CPU/4 thread(s)
429818666.5
ns429219083.5
ns1.00
vgg16(32, 32, 3, 128)/forward/CPU/8 thread(s)
433256333.5
ns375191709
ns1.15
vgg16(32, 32, 3, 128)/forward/CPU/1 thread(s)
677373875
ns681622604
ns0.99
vgg16(32, 32, 3, 128)/forward/GPU/CUDA
12487373
ns12475655
ns1.00
vgg16(32, 32, 3, 128)/zygote/CPU/2 thread(s)
2066713500
ns2053970374.5
ns1.01
vgg16(32, 32, 3, 128)/zygote/CPU/4 thread(s)
1635890000
ns1638205959
ns1.00
vgg16(32, 32, 3, 128)/zygote/CPU/8 thread(s)
1494391792
ns1496730145.5
ns1.00
vgg16(32, 32, 3, 128)/zygote/CPU/1 thread(s)
2208031208.5
ns2238026229
ns0.99
vgg16(32, 32, 3, 128)/zygote/GPU/CUDA
49163495.5
ns49241901
ns1.00
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/forward/CPU/2 thread(s)
1632500.5
ns1664062.5
ns0.98
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/forward/CPU/4 thread(s)
1173583
ns1174542
ns1.00
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/forward/CPU/8 thread(s)
1383958
ns1401625.5
ns0.99
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/forward/CPU/1 thread(s)
2483292
ns2450125
ns1.01
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/forward/GPU/CUDA
214736
ns217292.5
ns0.99
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/zygote/CPU/2 thread(s)
12776042
ns12720875
ns1.00
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/zygote/CPU/4 thread(s)
9939062.5
ns9941146
ns1.00
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/zygote/CPU/8 thread(s)
9686917
ns9680542
ns1.00
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/zygote/CPU/1 thread(s)
18349375
ns18500792
ns0.99
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/zygote/GPU/CUDA
2056758
ns2013409
ns1.02
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/enzyme/CPU/2 thread(s)
17758729.5
ns17699916.5
ns1.00
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/enzyme/CPU/4 thread(s)
14689958
ns14725083
ns1.00
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/enzyme/CPU/8 thread(s)
14551125
ns14579208
ns1.00
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/enzyme/CPU/1 thread(s)
21399666
ns22341458
ns0.96
dense(32, bias=false, act=gelu)(32 x 128)/forward/CPU/2 thread(s)
26250
ns26208
ns1.00
dense(32, bias=false, act=gelu)(32 x 128)/forward/CPU/4 thread(s)
26292
ns26250
ns1.00
dense(32, bias=false, act=gelu)(32 x 128)/forward/CPU/8 thread(s)
26333
ns26208
ns1.00
dense(32, bias=false, act=gelu)(32 x 128)/forward/CPU/1 thread(s)
26250
ns26291
ns1.00
dense(32, bias=false, act=gelu)(32 x 128)/forward/GPU/CUDA
24146
ns24007
ns1.01
dense(32, bias=false, act=gelu)(32 x 128)/zygote/CPU/2 thread(s)
66791
ns66917
ns1.00
dense(32, bias=false, act=gelu)(32 x 128)/zygote/CPU/4 thread(s)
67292
ns67500
ns1.00
dense(32, bias=false, act=gelu)(32 x 128)/zygote/CPU/8 thread(s)
68417
ns67334
ns1.02
dense(32, bias=false, act=gelu)(32 x 128)/zygote/CPU/1 thread(s)
66709
ns66750
ns1.00
dense(32, bias=false, act=gelu)(32 x 128)/zygote/GPU/CUDA
391053.5
ns393891.5
ns0.99
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
204333
ns204125
ns1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
210125
ns211667
ns0.99
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
209458
ns209208
ns1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
198792
ns199083
ns1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA
26289
ns26011
ns1.01
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
642083
ns646792
ns0.99
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
624354.5
ns622583
ns1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
621729.5
ns632375
ns0.98
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
627000.5
ns636417
ns0.99
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
357106
ns350976
ns1.02
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
645625
ns651500
ns0.99
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
636292
ns633916.5
ns1.00
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
602667
ns637250
ns0.95
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
672375
ns643083.5
ns1.05
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA
132245.5
ns132301.5
ns1.00
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
2294979
ns2256000
ns1.02
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
2157208
ns2122875.5
ns1.02
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
2246208
ns2253625
ns1.00
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
2249458
ns2306375
ns0.98
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
1236985
ns1180087
ns1.05
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
17937.5
ns17958
ns1.00
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
18416.5
ns18208
ns1.01
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
20083
ns20541
ns0.98
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
18895.5
ns19458
ns0.97
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA
145580
ns146027
ns1.00
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
259583
ns262166.5
ns0.99
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
261791
ns219542
ns1.19
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
219084
ns231375
ns0.95
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
257520.5
ns229229.5
ns1.12
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
1034996
ns996296
ns1.04
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/2 thread(s)
542
ns583
ns0.93
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/4 thread(s)
667
ns584
ns1.14
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/8 thread(s)
625
ns625
ns1
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/1 thread(s)
583
ns500
ns1.17
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/CUDA
23604
ns23586
ns1.00
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/2 thread(s)
9750
ns9708
ns1.00
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/4 thread(s)
10292
ns9542
ns1.08
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/8 thread(s)
10250
ns9875
ns1.04
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/1 thread(s)
9333
ns9709
ns0.96
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/CUDA
260113.5
ns259286.5
ns1.00
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/2 thread(s)
5083.5
ns5500
ns0.92
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/4 thread(s)
5792
ns5875
ns0.99
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/8 thread(s)
6833
ns7000
ns0.98
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/1 thread(s)
5375
ns4792
ns1.12
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/CUDA
229273.5
ns225853.5
ns1.02
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/2 thread(s)
6709
ns6875
ns0.98
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/4 thread(s)
7667
ns7250
ns1.06
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/8 thread(s)
7583
ns7542
ns1.01
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/1 thread(s)
6937.5
ns7125
ns0.97
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/CUDA
777061.5
ns770392.5
ns1.01
bias_activation(2, act=gelu)(2 x 128)/forward/CPU/2 thread(s)
1917
ns2000
ns0.96
bias_activation(2, act=gelu)(2 x 128)/forward/CPU/4 thread(s)
2500
ns2125
ns1.18
bias_activation(2, act=gelu)(2 x 128)/forward/CPU/8 thread(s)
2208
ns2250
ns0.98
bias_activation(2, act=gelu)(2 x 128)/forward/CPU/1 thread(s)
2250
ns2312.5
ns0.97
bias_activation(2, act=gelu)(2 x 128)/forward/GPU/CUDA
18340
ns18125
ns1.01
bias_activation(2, act=gelu)(2 x 128)/zygote/CPU/2 thread(s)
6542
ns6333
ns1.03
bias_activation(2, act=gelu)(2 x 128)/zygote/CPU/4 thread(s)
6667
ns6584
ns1.01
bias_activation(2, act=gelu)(2 x 128)/zygote/CPU/8 thread(s)
6666
ns6542
ns1.02
bias_activation(2, act=gelu)(2 x 128)/zygote/CPU/1 thread(s)
6584
ns6416
ns1.03
bias_activation(2, act=gelu)(2 x 128)/zygote/GPU/CUDA
320616.5
ns321687.5
ns1.00
bias_activation(512, act=tanh)(512 x 128)/forward/CPU/2 thread(s)
750542
ns746833.5
ns1.00
bias_activation(512, act=tanh)(512 x 128)/forward/CPU/4 thread(s)
746792
ns747125
ns1.00
bias_activation(512, act=tanh)(512 x 128)/forward/CPU/8 thread(s)
746916
ns749895.5
ns1.00
bias_activation(512, act=tanh)(512 x 128)/forward/CPU/1 thread(s)
750584
ns761667
ns0.99
bias_activation(512, act=tanh)(512 x 128)/forward/GPU/CUDA
21795
ns21408
ns1.02
bias_activation(512, act=tanh)(512 x 128)/zygote/CPU/2 thread(s)
805145.5
ns791208
ns1.02
bias_activation(512, act=tanh)(512 x 128)/zygote/CPU/4 thread(s)
791604
ns793145.5
ns1.00
bias_activation(512, act=tanh)(512 x 128)/zygote/CPU/8 thread(s)
772584
ns791625
ns0.98
bias_activation(512, act=tanh)(512 x 128)/zygote/CPU/1 thread(s)
810645.5
ns794917
ns1.02
bias_activation(512, act=tanh)(512 x 128)/zygote/GPU/CUDA
302046.5
ns294715
ns1.02
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
6959
ns7250
ns0.96
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
5917
ns5875
ns1.01
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
6000
ns5833
ns1.03
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
10167
ns10584
ns0.96
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA
32896
ns32998
ns1.00
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
228770.5
ns262333
ns0.87
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
227709
ns228875
ns0.99
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
228084
ns237083.5
ns0.96
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
225625.5
ns257792
ns0.88
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
359979
ns357007.5
ns1.01
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/2 thread(s)
10250
ns10334
ns0.99
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/4 thread(s)
10208
ns10208
ns1
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/8 thread(s)
11042
ns11000
ns1.00
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/1 thread(s)
9958
ns9875
ns1.01
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/CUDA
245976
ns239662.5
ns1.03
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/2 thread(s)
24896
ns24417
ns1.02
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/4 thread(s)
24000
ns24292
ns0.99
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/8 thread(s)
25416.5
ns26292
ns0.97
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/1 thread(s)
24625
ns24958
ns0.99
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/CUDA
1114734
ns1082647
ns1.03
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/forward/CPU/2 thread(s)
106794687
ns107049500
ns1.00
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/forward/CPU/4 thread(s)
118367979
ns117776354.5
ns1.01
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/forward/CPU/8 thread(s)
120992291
ns120966209
ns1.00
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/forward/CPU/1 thread(s)
118045833
ns118076875
ns1.00
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/forward/GPU/CUDA
2655666
ns2634519
ns1.01
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/zygote/CPU/2 thread(s)
397097667
ns392901583
ns1.01
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/zygote/CPU/4 thread(s)
368138875
ns366575334
ns1.00
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/zygote/CPU/8 thread(s)
357737125
ns425037083
ns0.84
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/zygote/CPU/1 thread(s)
483722209
ns488336500
ns0.99
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/zygote/GPU/CUDA
15195689
ns15175401
ns1.00
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/enzyme/CPU/2 thread(s)
769405854
ns759144667
ns1.01
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/enzyme/CPU/4 thread(s)
762934333
ns580473834
ns1.31
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/enzyme/CPU/8 thread(s)
748099729.5
ns745822521
ns1.00
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/enzyme/CPU/1 thread(s)
772112770.5
ns776881791.5
ns0.99
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/2 thread(s)
6417
ns6792
ns0.94
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/4 thread(s)
7375
ns7334
ns1.01
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/8 thread(s)
8187
ns7916
ns1.03
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/1 thread(s)
8708.5
ns7771
ns1.12
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/CUDA
243458.5
ns232079
ns1.05
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/2 thread(s)
13625
ns13667
ns1.00
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/4 thread(s)
14834
ns13875
ns1.07
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/8 thread(s)
14834
ns14666
ns1.01
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/1 thread(s)
14000
ns13958
ns1.00
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/CUDA
1081512.5
ns1036141
ns1.04
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/2 thread(s)
5500
ns6042
ns0.91
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/4 thread(s)
6083.5
ns6083
ns1.00
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/8 thread(s)
7500
ns7500
ns1
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/1 thread(s)
5625
ns5875
ns0.96
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/CUDA
236881
ns228105
ns1.04
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/2 thread(s)
12583
ns12208
ns1.03
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/4 thread(s)
12750
ns12292
ns1.04
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/8 thread(s)
13000
ns12792
ns1.02
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/1 thread(s)
12542
ns12208
ns1.03
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/CUDA
792100
ns752122.5
ns1.05
batchedmm(2, Bsize=128)/forward/CPU/2 thread(s)
328937.5
ns352541.5
ns0.93
batchedmm(2, Bsize=128)/forward/CPU/4 thread(s)
345250
ns343312.5
ns1.01
batchedmm(2, Bsize=128)/forward/CPU/8 thread(s)
398625
ns401083
ns0.99
batchedmm(2, Bsize=128)/forward/CPU/1 thread(s)
315687.5
ns289708
ns1.09
batchedmm(2, Bsize=128)/forward/GPU/CUDA
17026
ns16741
ns1.02
batchedmm(2, Bsize=128)/zygote/CPU/2 thread(s)
701750
ns707250
ns0.99
batchedmm(2, Bsize=128)/zygote/CPU/4 thread(s)
734417
ns723354
ns1.02
batchedmm(2, Bsize=128)/zygote/CPU/8 thread(s)
1025666
ns1029646
ns1.00
batchedmm(2, Bsize=128)/zygote/CPU/1 thread(s)
663750
ns651000
ns1.02
batchedmm(2, Bsize=128)/zygote/GPU/CUDA
202330
ns196251.5
ns1.03
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/2 thread(s)
292
ns292
ns1
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/4 thread(s)
417
ns333
ns1.25
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/8 thread(s)
416
ns375
ns1.11
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/1 thread(s)
292
ns292
ns1
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/CUDA
23795
ns23593
ns1.01
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/2 thread(s)
6250
ns6166.5
ns1.01
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/4 thread(s)
6750
ns6334
ns1.07
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/8 thread(s)
6500
ns6542
ns0.99
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/1 thread(s)
6104.5
ns6145.5
ns0.99
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/CUDA
242897.5
ns238191.5
ns1.02
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/2 thread(s)
5875
ns5792
ns1.01
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/4 thread(s)
6042
ns5834
ns1.04
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/8 thread(s)
5917
ns5917
ns1
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/1 thread(s)
5875
ns5833
ns1.01
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/CUDA
24778
ns24230
ns1.02
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/2 thread(s)
21834
ns21291
ns1.03
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/4 thread(s)
21542
ns21167
ns1.02
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/8 thread(s)
21750
ns21750
ns1
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/1 thread(s)
21417
ns21312.5
ns1.00
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/CUDA
265364.5
ns261937.5
ns1.01
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
184375
ns173854.5
ns1.06
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
185000
ns144500
ns1.28
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
149541
ns150042
ns1.00
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
190750
ns186208
ns1.02
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA
168165
ns167345
ns1.00
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
1361667
ns1326917
ns1.03
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
1306875.5
ns1312688
ns1.00
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
1318541.5
ns1318666
ns1.00
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
1332084
ns1368667
ns0.97
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
1372553
ns1291618
ns1.06
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
24458
ns24520.5
ns1.00
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
22729
ns22979.5
ns0.99
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
25000
ns23500
ns1.06
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
22374.5
ns22542
ns0.99
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA
355948
ns288056
ns1.24
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
176958
ns177417
ns1.00
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
131167
ns127375
ns1.03
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
126166.5
ns128042
ns0.99
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
177542
ns183542
ns0.97
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
1491511
ns1415779
ns1.05
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/2 thread(s)
292
ns292
ns1
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/4 thread(s)
417
ns375
ns1.11
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/8 thread(s)
375
ns375
ns1
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/1 thread(s)
333
ns292
ns1.14
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/CUDA
23138
ns23671
ns0.98
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/2 thread(s)
6125
ns6083.5
ns1.01
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/4 thread(s)
6917
ns6541
ns1.06
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/8 thread(s)
6667
ns6625
ns1.01
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/1 thread(s)
6250
ns6084
ns1.03
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/CUDA
259300
ns259434
ns1.00
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/2 thread(s)
4458
ns4750
ns0.94
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/4 thread(s)
4875
ns5000
ns0.97
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/8 thread(s)
5708.5
ns5292
ns1.08
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/1 thread(s)
4833
ns4750
ns1.02
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/CUDA
258768.5
ns245363
ns1.05
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/2 thread(s)
9709
ns9917
ns0.98
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/4 thread(s)
10083
ns9667
ns1.04
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/8 thread(s)
10417
ns10375
ns1.00
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/1 thread(s)
10041.5
ns10292
ns0.98
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/CUDA
1358754
ns1314812
ns1.03
dense(2, bias=false, act=gelu)(2 x 128)/forward/CPU/2 thread(s)
1625
ns1583
ns1.03
dense(2, bias=false, act=gelu)(2 x 128)/forward/CPU/4 thread(s)
1666
ns1584
ns1.05
dense(2, bias=false, act=gelu)(2 x 128)/forward/CPU/8 thread(s)
1667
ns1625
ns1.03
dense(2, bias=false, act=gelu)(2 x 128)/forward/CPU/1 thread(s)
1583
ns1625
ns0.97
dense(2, bias=false, act=gelu)(2 x 128)/forward/GPU/CUDA
23306
ns24016
ns0.97
dense(2, bias=false, act=gelu)(2 x 128)/zygote/CPU/2 thread(s)
5625
ns5666
ns0.99
dense(2, bias=false, act=gelu)(2 x 128)/zygote/CPU/4 thread(s)
6125
ns5750
ns1.07
dense(2, bias=false, act=gelu)(2 x 128)/zygote/CPU/8 thread(s)
6041
ns6041
ns1
dense(2, bias=false, act=gelu)(2 x 128)/zygote/CPU/1 thread(s)
5625
ns5625
ns1
dense(2, bias=false, act=gelu)(2 x 128)/zygote/GPU/CUDA
275587
ns277849.5
ns0.99
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/forward/CPU/2 thread(s)
6813916.5
ns6854521
ns0.99
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/forward/CPU/4 thread(s)
6428416
ns6386541.5
ns1.01
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/forward/CPU/8 thread(s)
6554167
ns6525687
ns1.00
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/forward/CPU/1 thread(s)
7571104.5
ns7618792
ns0.99
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/forward/GPU/CUDA
213811
ns215416
ns0.99
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/zygote/CPU/2 thread(s)
24163500
ns24090500
ns1.00
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/zygote/CPU/4 thread(s)
21359167
ns21303500
ns1.00
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/zygote/CPU/8 thread(s)
21066083
ns21036500
ns1.00
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/zygote/CPU/1 thread(s)
29670209
ns29890395.5
ns0.99
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/zygote/GPU/CUDA
2101483
ns2106254.5
ns1.00
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/enzyme/CPU/2 thread(s)
37462416
ns37262000
ns1.01
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/enzyme/CPU/4 thread(s)
45862833.5
ns34088667
ns1.35
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/enzyme/CPU/8 thread(s)
45876667
ns45642416
ns1.01
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/enzyme/CPU/1 thread(s)
38235959
ns38194208
ns1.00
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/2 thread(s)
5459
ns5625
ns0.97
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/4 thread(s)
6250
ns6250
ns1
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/8 thread(s)
6958
ns7354.5
ns0.95
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/1 thread(s)
5292
ns5959
ns0.89
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/CUDA
238588.5
ns230775.5
ns1.03
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/2 thread(s)
7959
ns7750
ns1.03
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/4 thread(s)
8334
ns8750
ns0.95
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/8 thread(s)
8250
ns8917
ns0.93
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/1 thread(s)
8250
ns8000
ns1.03
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/CUDA
1068264.5
ns1028275
ns1.04
lenet(28, 28, 1, 128)/forward/CPU/2 thread(s)
1529292
ns1567584
ns0.98
lenet(28, 28, 1, 128)/forward/CPU/4 thread(s)
1266666.5
ns1259666
ns1.01
lenet(28, 28, 1, 128)/forward/CPU/8 thread(s)
1623709
ns1635083
ns0.99
lenet(28, 28, 1, 128)/forward/CPU/1 thread(s)
2163750
ns2158312.5
ns1.00
lenet(28, 28, 1, 128)/forward/GPU/CUDA
279544
ns279134
ns1.00
lenet(28, 28, 1, 128)/zygote/CPU/2 thread(s)
7968292
ns7896167
ns1.01
lenet(28, 28, 1, 128)/zygote/CPU/4 thread(s)
6533250
ns6584917
ns0.99
lenet(28, 28, 1, 128)/zygote/CPU/8 thread(s)
7125792
ns7159250
ns1.00
lenet(28, 28, 1, 128)/zygote/CPU/1 thread(s)
10479375
ns10512687
ns1.00
lenet(28, 28, 1, 128)/zygote/GPU/CUDA
1874497
ns1840167
ns1.02
batchedmm(128, Bsize=4)/forward/CPU/2 thread(s)
320667
ns345250
ns0.93
batchedmm(128, Bsize=4)/forward/CPU/4 thread(s)
346291
ns346354
ns1.00
batchedmm(128, Bsize=4)/forward/CPU/8 thread(s)
428584
ns390416
ns1.10
batchedmm(128, Bsize=4)/forward/CPU/1 thread(s)
345375
ns318708
ns1.08
batchedmm(128, Bsize=4)/forward/GPU/CUDA
46619.5
ns47055.5
ns0.99
batchedmm(128, Bsize=4)/zygote/CPU/2 thread(s)
745958.5
ns727521
ns1.03
batchedmm(128, Bsize=4)/zygote/CPU/4 thread(s)
791666.5
ns784708.5
ns1.01
batchedmm(128, Bsize=4)/zygote/CPU/8 thread(s)
1073208.5
ns1082625
ns0.99
batchedmm(128, Bsize=4)/zygote/CPU/1 thread(s)
776479
ns770875
ns1.01
batchedmm(128, Bsize=4)/zygote/GPU/CUDA
311670
ns301195.5
ns1.03
dense(512, bias=true, act=identity)(512 x 128)/forward/CPU/2 thread(s)
396708.5
ns396958
ns1.00
dense(512, bias=true, act=identity)(512 x 128)/forward/CPU/4 thread(s)
287917
ns288000
ns1.00
dense(512, bias=true, act=identity)(512 x 128)/forward/CPU/8 thread(s)
288250
ns288083
ns1.00
dense(512, bias=true, act=identity)(512 x 128)/forward/CPU/1 thread(s)
753417
ns749292
ns1.01
dense(512, bias=true, act=identity)(512 x 128)/forward/GPU/CUDA
44556
ns43886
ns1.02
dense(512, bias=true, act=identity)(512 x 128)/zygote/CPU/2 thread(s)
645167
ns662916
ns0.97
dense(512, bias=true, act=identity)(512 x 128)/zygote/CPU/4 thread(s)
527667
ns527125
ns1.00
dense(512, bias=true, act=identity)(512 x 128)/zygote/CPU/8 thread(s)
532000
ns530875
ns1.00
dense(512, bias=true, act=identity)(512 x 128)/zygote/CPU/1 thread(s)
974292
ns974417
ns1.00
dense(512, bias=true, act=identity)(512 x 128)/zygote/GPU/CUDA
190424
ns189409.5
ns1.01
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
668958
ns654312
ns1.02
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
629749.5
ns666542
ns0.94
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
544375
ns634583.5
ns0.86
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
643396
ns668771
ns0.96
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA
132592.5
ns131826.5
ns1.01
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
2485646
ns2477937.5
ns1.00
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
2448562.5
ns2464542
ns0.99
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
2450292
ns2457437.5
ns1.00
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
2461146
ns2485229.5
ns0.99
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
1408688
ns1437399.5
ns0.98
batchedmm(2, Bsize=32)/forward/CPU/2 thread(s)
324000.5
ns343542
ns0.94
batchedmm(2, Bsize=32)/forward/CPU/4 thread(s)
344459
ns340729.5
ns1.01
batchedmm(2, Bsize=32)/forward/CPU/8 thread(s)
396583
ns397417
ns1.00
batchedmm(2, Bsize=32)/forward/CPU/1 thread(s)
314083.5
ns287750
ns1.09
batchedmm(2, Bsize=32)/forward/GPU/CUDA
16193
ns16030
ns1.01
batchedmm(2, Bsize=32)/zygote/CPU/2 thread(s)
700875
ns705459
ns0.99
batchedmm(2, Bsize=32)/zygote/CPU/4 thread(s)
734292
ns728458
ns1.01
batchedmm(2, Bsize=32)/zygote/CPU/8 thread(s)
1020625
ns1024291
ns1.00
batchedmm(2, Bsize=32)/zygote/CPU/1 thread(s)
656584
ns650375
ns1.01
batchedmm(2, Bsize=32)/zygote/GPU/CUDA
201017
ns194160
ns1.04
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
1461042
ns1463292
ns1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
1503750
ns1505458
ns1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
1504625
ns1502125
ns1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
1442917
ns1441292
ns1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA
40991
ns39983.5
ns1.03
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
5155750
ns5129459
ns1.01
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
5279833.5
ns5289916
ns1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
5308333.5
ns5304042
ns1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
4987604
ns5024750
ns0.99
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
200839
ns197042
ns1.02
dense(32, bias=false, act=identity)(32 x 128)/forward/CPU/2 thread(s)
3750
ns3667
ns1.02
dense(32, bias=false, act=identity)(32 x 128)/forward/CPU/4 thread(s)
3709
ns3708
ns1.00
dense(32, bias=false, act=identity)(32 x 128)/forward/CPU/8 thread(s)
3667
ns3708
ns0.99
dense(32, bias=false, act=identity)(32 x 128)/forward/CPU/1 thread(s)
3708
ns3708
ns1
dense(32, bias=false, act=identity)(32 x 128)/forward/GPU/CUDA
33187
ns33420
ns0.99
dense(32, bias=false, act=identity)(32 x 128)/zygote/CPU/2 thread(s)
14958
ns15042
ns0.99
dense(32, bias=false, act=identity)(32 x 128)/zygote/CPU/4 thread(s)
15395.5
ns15042
ns1.02
dense(32, bias=false, act=identity)(32 x 128)/zygote/CPU/8 thread(s)
15375
ns15500
ns0.99
dense(32, bias=false, act=identity)(32 x 128)/zygote/CPU/1 thread(s)
15083
ns14916
ns1.01
dense(32, bias=false, act=identity)(32 x 128)/zygote/GPU/CUDA
379072.5
ns365311.5
ns1.04
dense(512, bias=false, act=relu)(512 x 128)/forward/CPU/2 thread(s)
71541
ns71250
ns1.00
dense(512, bias=false, act=relu)(512 x 128)/forward/CPU/4 thread(s)
71542
ns71042
ns1.01
dense(512, bias=false, act=relu)(512 x 128)/forward/CPU/8 thread(s)
71270.5
ns71125
ns1.00
dense(512, bias=false, act=relu)(512 x 128)/forward/CPU/1 thread(s)
71083
ns71375
ns1.00
dense(512, bias=false, act=relu)(512 x 128)/forward/GPU/CUDA
112914
ns113406.5
ns1.00
dense(512, bias=false, act=relu)(512 x 128)/zygote/CPU/2 thread(s)
325333
ns318209
ns1.02
dense(512, bias=false, act=relu)(512 x 128)/zygote/CPU/4 thread(s)
320729.5
ns319583.5
ns1.00
dense(512, bias=false, act=relu)(512 x 128)/zygote/CPU/8 thread(s)
318792
ns318959
ns1.00
dense(512, bias=false, act=relu)(512 x 128)/zygote/CPU/1 thread(s)
317333
ns324583
ns0.98
dense(512, bias=false, act=relu)(512 x 128)/zygote/GPU/CUDA
193733
ns193354.5
ns1.00
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/2 thread(s)
1000
ns958
ns1.04
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/4 thread(s)
1125
ns1042
ns1.08
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/8 thread(s)
1083
ns1083
ns1
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/1 thread(s)
1000
ns958
ns1.04
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/CUDA
23845
ns23463
ns1.02
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/2 thread(s)
7750
ns8000
ns0.97
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/4 thread(s)
8583
ns7875
ns1.09
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/8 thread(s)
8500
ns8417
ns1.01
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/1 thread(s)
7750
ns7792
ns0.99
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/CUDA
262768.5
ns259238.5
ns1.01
batchedmm(128, Bsize=32)/forward/CPU/2 thread(s)
456417
ns463917
ns0.98
batchedmm(128, Bsize=32)/forward/CPU/4 thread(s)
472584
ns473271
ns1.00
batchedmm(128, Bsize=32)/forward/CPU/8 thread(s)
554479
ns552125
ns1.00
batchedmm(128, Bsize=32)/forward/CPU/1 thread(s)
550167
ns532500
ns1.03
batchedmm(128, Bsize=32)/forward/GPU/CUDA
128330
ns129678.5
ns0.99
batchedmm(128, Bsize=32)/zygote/CPU/2 thread(s)
1408750
ns1387812.5
ns1.02
batchedmm(128, Bsize=32)/zygote/CPU/4 thread(s)
1380958
ns1377541
ns1.00
batchedmm(128, Bsize=32)/zygote/CPU/8 thread(s)
1632666.5
ns1616416.5
ns1.01
batchedmm(128, Bsize=32)/zygote/CPU/1 thread(s)
1597604
ns1628791.5
ns0.98
batchedmm(128, Bsize=32)/zygote/GPU/CUDA
274089
ns274924
ns1.00
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/2 thread(s)
334
ns333
ns1.00
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/4 thread(s)
417
ns375
ns1.11
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/8 thread(s)
375
ns375
ns1
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/1 thread(s)
333
ns333
ns1
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/CUDA
31588
ns31727
ns1.00
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/2 thread(s)
6083
ns6292
ns0.97
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/4 thread(s)
6750
ns6333
ns1.07
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/8 thread(s)
6458
ns6542
ns0.99
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/1 thread(s)
6125
ns6125
ns1
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/CUDA
263587.5
ns261595.5
ns1.01
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
1767792
ns1723041.5
ns1.03
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
1726375
ns1728375
ns1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
1725708
ns1726646
ns1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
1773250
ns1770333
ns1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA
168887
ns168725.5
ns1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
4406958
ns4375500
ns1.01
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
4358916
ns4362459
ns1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
4369792
ns4361583.5
ns1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
4367125
ns4405833
ns0.99
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
1241756.5
ns1247425
ns1.00
bias_activation(512, act=relu)(512 x 128)/forward/CPU/2 thread(s)
6750
ns6770.5
ns1.00
bias_activation(512, act=relu)(512 x 128)/forward/CPU/4 thread(s)
7000
ns6834
ns1.02
bias_activation(512, act=relu)(512 x 128)/forward/CPU/8 thread(s)
6792
ns9458
ns0.72
bias_activation(512, act=relu)(512 x 128)/forward/CPU/1 thread(s)
6750
ns6791
ns0.99
bias_activation(512, act=relu)(512 x 128)/forward/GPU/CUDA
19512
ns19612
ns0.99
bias_activation(512, act=relu)(512 x 128)/zygote/CPU/2 thread(s)
51584
ns35416.5
ns1.46
bias_activation(512, act=relu)(512 x 128)/zygote/CPU/4 thread(s)
48771
ns51458
ns0.95
bias_activation(512, act=relu)(512 x 128)/zygote/CPU/8 thread(s)
33250
ns64791
ns0.51
bias_activation(512, act=relu)(512 x 128)/zygote/CPU/1 thread(s)
52958
ns51041
ns1.04
bias_activation(512, act=relu)(512 x 128)/zygote/GPU/CUDA
210086
ns292950.5
ns0.72
batchedmm(2, Bsize=512)/forward/CPU/2 thread(s)
328750
ns357250
ns0.92
batchedmm(2, Bsize=512)/forward/CPU/4 thread(s)
344958
ns346916
ns0.99
batchedmm(2, Bsize=512)/forward/CPU/8 thread(s)
408250
ns408708
ns1.00
batchedmm(2, Bsize=512)/forward/CPU/1 thread(s)
323500
ns296583
ns1.09
batchedmm(2, Bsize=512)/forward/GPU/CUDA
18058
ns18608
ns0.97
batchedmm(2, Bsize=512)/zygote/CPU/2 thread(s)
719583.5
ns717145.5
ns1.00
batchedmm(2, Bsize=512)/zygote/CPU/4 thread(s)
735666.5
ns736354
ns1.00
batchedmm(2, Bsize=512)/zygote/CPU/8 thread(s)
1034250
ns1037916
ns1.00
batchedmm(2, Bsize=512)/zygote/CPU/1 thread(s)
684646
ns678583
ns1.01
batchedmm(2, Bsize=512)/zygote/GPU/CUDA
345041
ns330667.5
ns1.04
dense(512, bias=true, act=relu)(512 x 128)/forward/CPU/2 thread(s)
75459
ns75250
ns1.00
dense(512, bias=true, act=relu)(512 x 128)/forward/CPU/4 thread(s)
75292
ns74125
ns1.02
dense(512, bias=true, act=relu)(512 x 128)/forward/CPU/8 thread(s)
75167
ns75125
ns1.00
dense(512, bias=true, act=relu)(512 x 128)/forward/CPU/1 thread(s)
75333
ns75417
ns1.00
dense(512, bias=true, act=relu)(512 x 128)/forward/GPU/CUDA
46969
ns47294
ns0.99
dense(512, bias=true, act=relu)(512 x 128)/zygote/CPU/2 thread(s)
332833
ns323916
ns1.03
dense(512, bias=true, act=relu)(512 x 128)/zygote/CPU/4 thread(s)
325833
ns328833
ns0.99
dense(512, bias=true, act=relu)(512 x 128)/zygote/CPU/8 thread(s)
324583
ns324333
ns1.00
dense(512, bias=true, act=relu)(512 x 128)/zygote/CPU/1 thread(s)
323834
ns330666.5
ns0.98
dense(512, bias=true, act=relu)(512 x 128)/zygote/GPU/CUDA
207979
ns212001.5
ns0.98
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
1487708
ns1489375
ns1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
1530375
ns1532667
ns1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
1530750
ns1529750
ns1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
1466417
ns1466083
ns1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA
51505.5
ns52463
ns0.98
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
5146312.5
ns5117041.5
ns1.01
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
5151604.5
ns5286709
ns0.97
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
5003270.5
ns5278500
ns0.95
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
4984709
ns5018959
ns0.99
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
205494.5
ns207470
ns0.99
dense(32, bias=true, act=gelu)(32 x 128)/forward/CPU/2 thread(s)
28250
ns28125
ns1.00
dense(32, bias=true, act=gelu)(32 x 128)/forward/CPU/4 thread(s)
28334
ns28250
ns1.00
dense(32, bias=true, act=gelu)(32 x 128)/forward/CPU/8 thread(s)
28333
ns28208
ns1.00
dense(32, bias=true, act=gelu)(32 x 128)/forward/CPU/1 thread(s)
28167
ns28292
ns1.00
dense(32, bias=true, act=gelu)(32 x 128)/forward/GPU/CUDA
24407
ns25137
ns0.97
dense(32, bias=true, act=gelu)(32 x 128)/zygote/CPU/2 thread(s)
66500
ns66375
ns1.00
dense(32, bias=true, act=gelu)(32 x 128)/zygote/CPU/4 thread(s)
66375
ns66417
ns1.00
dense(32, bias=true, act=gelu)(32 x 128)/zygote/CPU/8 thread(s)
67458
ns66333
ns1.02
dense(32, bias=true, act=gelu)(32 x 128)/zygote/CPU/1 thread(s)
66417
ns66417
ns1
dense(32, bias=true, act=gelu)(32 x 128)/zygote/GPU/CUDA
525547
ns515009.5
ns1.02
mlp7layer_bn(tanh)(32 x 256)/forward/CPU/2 thread(s)
1383749.5
ns1500833
ns0.92
mlp7layer_bn(tanh)(32 x 256)/forward/CPU/4 thread(s)
1059771
ns1126208
ns0.94
mlp7layer_bn(tanh)(32 x 256)/forward/CPU/8 thread(s)
1061458
ns1072167
ns0.99
mlp7layer_bn(tanh)(32 x 256)/forward/CPU/1 thread(s)
2248687.5
ns2256270.5
ns1.00
mlp7layer_bn(tanh)(32 x 256)/forward/GPU/CUDA
581876.5
ns588519
ns0.99
mlp7layer_bn(tanh)(32 x 256)/zygote/CPU/2 thread(s)
3035479
ns3091229.5
ns0.98
mlp7layer_bn(tanh)(32 x 256)/zygote/CPU/4 thread(s)
2745250
ns2749250
ns1.00
mlp7layer_bn(tanh)(32 x 256)/zygote/CPU/8 thread(s)
2740958
ns2739000
ns1.00
mlp7layer_bn(tanh)(32 x 256)/zygote/CPU/1 thread(s)
3811500
ns3873500
ns0.98
mlp7layer_bn(tanh)(32 x 256)/zygote/GPU/CUDA
2064611
ns2038545
ns1.01
mlp7layer_bn(tanh)(32 x 256)/enzyme/CPU/2 thread(s)
8921042
ns8835083.5
ns1.01
mlp7layer_bn(tanh)(32 x 256)/enzyme/CPU/4 thread(s)
8776625
ns8781542
ns1.00
mlp7layer_bn(tanh)(32 x 256)/enzyme/CPU/8 thread(s)
8768729.5
ns8779291.5
ns1.00
mlp7layer_bn(tanh)(32 x 256)/enzyme/CPU/1 thread(s)
6359583
ns6522313
ns0.98
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
82083.5
ns80792
ns1.02
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
81562.5
ns80833
ns1.01
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
83125
ns83792
ns0.99
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
80583
ns130000.5
ns0.62
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA
192403.5
ns193569
ns0.99
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
2040625
ns2027583
ns1.01
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
1935354.5
ns1958291
ns0.99
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
2023083
ns2028042
ns1.00
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
2003562.5
ns2038833.5
ns0.98
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
805958
ns794524
ns1.01
This comment was automatically generated by workflow using github-action-benchmark.