-
Notifications
You must be signed in to change notification settings - Fork 233
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Move strided batch pointer conversion to GPU #2608
Conversation
9efb28b
to
6d9b85d
Compare
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #2608 +/- ##
===========================================
+ Coverage 9.27% 73.51% +64.23%
===========================================
Files 157 157
Lines 15025 15220 +195
===========================================
+ Hits 1394 11189 +9795
+ Misses 13631 4031 -9600 ☔ View full report in Codecov by Sentry. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
CUDA.jl Benchmarks
Benchmark suite | Current: 8bac8cc | Previous: a0c2f4b | Ratio |
---|---|---|---|
latency/precompile |
45289278120.5 ns |
45297385295 ns |
1.00 |
latency/ttfp |
6403157132.5 ns |
6375596178 ns |
1.00 |
latency/import |
3034963126.5 ns |
3036561495 ns |
1.00 |
integration/volumerhs |
9566982 ns |
9567419 ns |
1.00 |
integration/byval/slices=1 |
146872 ns |
146746 ns |
1.00 |
integration/byval/slices=3 |
425466 ns |
425517.5 ns |
1.00 |
integration/byval/reference |
144701 ns |
145010 ns |
1.00 |
integration/byval/slices=2 |
286086 ns |
286216 ns |
1.00 |
integration/cudadevrt |
103540.5 ns |
103513 ns |
1.00 |
kernel/indexing |
14046 ns |
14419 ns |
0.97 |
kernel/indexing_checked |
15243 ns |
15499 ns |
0.98 |
kernel/occupancy |
683.4090909090909 ns |
748.2734375 ns |
0.91 |
kernel/launch |
2197.5555555555557 ns |
2194.6666666666665 ns |
1.00 |
kernel/rand |
15096 ns |
17335 ns |
0.87 |
array/reverse/1d |
19407.5 ns |
19412 ns |
1.00 |
array/reverse/2d |
24582 ns |
24576 ns |
1.00 |
array/reverse/1d_inplace |
10291 ns |
11029 ns |
0.93 |
array/reverse/2d_inplace |
12294 ns |
13223 ns |
0.93 |
array/copy |
20667 ns |
20740 ns |
1.00 |
array/iteration/findall/int |
158515 ns |
158179 ns |
1.00 |
array/iteration/findall/bool |
138431.5 ns |
138583 ns |
1.00 |
array/iteration/findfirst/int |
153794 ns |
153423 ns |
1.00 |
array/iteration/findfirst/bool |
154565 ns |
154821 ns |
1.00 |
array/iteration/scalar |
76025.5 ns |
77451 ns |
0.98 |
array/iteration/logical |
212805 ns |
216735 ns |
0.98 |
array/iteration/findmin/1d |
40734.5 ns |
41556.5 ns |
0.98 |
array/iteration/findmin/2d |
93523.5 ns |
94128 ns |
0.99 |
array/reductions/reduce/1d |
40736.5 ns |
42013 ns |
0.97 |
array/reductions/reduce/2d |
45975.5 ns |
51911 ns |
0.89 |
array/reductions/mapreduce/1d |
36066.5 ns |
39275 ns |
0.92 |
array/reductions/mapreduce/2d |
51367.5 ns |
49505.5 ns |
1.04 |
array/broadcast |
21420 ns |
21668 ns |
0.99 |
array/copyto!/gpu_to_gpu |
13403 ns |
11569 ns |
1.16 |
array/copyto!/cpu_to_gpu |
211626 ns |
211873 ns |
1.00 |
array/copyto!/gpu_to_cpu |
244722.5 ns |
245423 ns |
1.00 |
array/accumulate/1d |
109041.5 ns |
108388.5 ns |
1.01 |
array/accumulate/2d |
79870 ns |
79823 ns |
1.00 |
array/construct |
1190.05 ns |
1208.35 ns |
0.98 |
array/random/randn/Float32 |
47714 ns |
43873.5 ns |
1.09 |
array/random/randn!/Float32 |
26376 ns |
25937 ns |
1.02 |
array/random/rand!/Int64 |
26999 ns |
27271 ns |
0.99 |
array/random/rand!/Float32 |
8518.333333333334 ns |
8766.666666666666 ns |
0.97 |
array/random/rand/Int64 |
37559 ns |
29637 ns |
1.27 |
array/random/rand/Float32 |
12986 ns |
12723 ns |
1.02 |
array/permutedims/4d |
66577 ns |
66923 ns |
0.99 |
array/permutedims/2d |
56402 ns |
56439 ns |
1.00 |
array/permutedims/3d |
59002 ns |
58867 ns |
1.00 |
array/sorting/1d |
2920806 ns |
2933352 ns |
1.00 |
array/sorting/by |
3485108.5 ns |
3500830 ns |
1.00 |
array/sorting/2d |
1084755 ns |
1085059 ns |
1.00 |
cuda/synchronization/stream/auto |
1013.5 ns |
1038.4 ns |
0.98 |
cuda/synchronization/stream/nonblocking |
6452 ns |
6432 ns |
1.00 |
cuda/synchronization/stream/blocking |
789.0515463917526 ns |
807.5918367346939 ns |
0.98 |
cuda/synchronization/context/auto |
1173.7 ns |
1194.1 ns |
0.98 |
cuda/synchronization/context/nonblocking |
6578.2 ns |
6649.8 ns |
0.99 |
cuda/synchronization/context/blocking |
917.6136363636364 ns |
886.6415094339623 ns |
1.03 |
This comment was automatically generated by workflow using github-action-benchmark.
464b2bf
to
db22eea
Compare
db22eea
to
8bac8cc
Compare
Thanks! Couple of clean-ups and micro-optimizations:
This improves kernel execution time from 5us to 3.5us on your example. |
Co-authored-by: Tim Besard <[email protected]>
Replacement of #2601 due to botched rebase.