-
-
Notifications
You must be signed in to change notification settings - Fork 5.6k
A faster copyto_unaliased! #41434
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
A faster copyto_unaliased! #41434
Conversation
julia> a = randn(50,50); b = randn(50,50); # 2D case # before julia> @Btime $a[1:end,1:end] .= $b; 4.900 μs (0 allocations: 0 bytes) julia> @Btime $a[:,:] .= $b; 721.642 ns (0 allocations: 0 bytes) # after julia> @Btime $a[1:end,1:end] .= $b; 355.238 ns (0 allocations: 0 bytes) julia> @Btime $a[:,:] .= $b; 351.643 ns (0 allocations: 0 bytes) julia> a = randn(50*50); b = randn(50*50); #1D case # before julia> @Btime $a .= $b; 406.566 ns (0 allocations: 0 bytes) # after julia> @Btime $a .= $b; 270.607 ns (0 allocations: 0 bytes)
The failure in mpfr seems unrelated? |
Can you add a 3d benchmark? Other than that, this looks good. |
It seems the speed of the general broadcast kernal is strongly affected by the array's shape. |
Why did you close this? |
Just for reverting the branch, and try to speed up |
Opps, it seems we can't use |
base/broadcast.jl
Outdated
@@ -991,11 +991,17 @@ preprocess_args(dest, args::Tuple{}) = () | |||
# Specialize this method if all you want to do is specialize on typeof(dest) | |||
@inline function copyto!(dest::AbstractArray, bc::Broadcasted{Nothing}) | |||
axes(dest) == axes(bc) || throwdm(axes(dest), axes(bc)) | |||
# Performance optimization: broadcast!(identity, dest, A) is equivalent to copyto!(dest, A) if indices match | |||
# Performance optimization: broadcast!(identity, dest, A) is equivalent to copyto!(dest, A) if indices match. | |||
# However copyto!(dest, A) is very slow in many cases, implement a faster version here. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wouldn't it be better to speed up copyto!(dest, A)
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In fact, I tried, but the problem seems to be a little complicated with the following facts:
copyto!(::Array,::Array)
call C'smemmove
, but it seems slower than a single loop:
julia> a = randn(1000); b = similar(a);
julia> @btime copyto!($b,$a); # this call C's memmove
120.524 ns (0 allocations: 0 bytes)
julia> @btime copyto!(IndexLinear(),$b,IndexLinear(),$a); # this call Julia's copyto_unalias!
74.512 ns (0 allocations: 0 bytes)
It seems we can use C's memcpy instead of memmove to solve this.
- For
abstractarray
, I can't find a way to acceleratecopyto_unalias!(dest, A)
without@simd
, as:
a = randn(40,40); b = similar(a);
f1!(a,b) = @inbounds for i in eachindex(IndexCartesian(), a)
b[i] = a[i]
end
f2!(a,b) = @inbounds @simd for i in eachindex(IndexCartesian(), a)
b[i] = a[i]
end
@btime f1!($a, $b) #247.132 ns (0 allocations: 0 bytes)
@btime f2!($a, $b) #220.548 ns (0 allocations: 0 bytes) a little faster
@btime @views f1!($a[1:end,1:end], $b[1:end,1:end]) # 1.710 μs (0 allocations: 0 bytes)
@btime @views f2!($a[1:end,1:end], $b[1:end,1:end]) # 262.500 ns (0 allocations: 0 bytes) much faster
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why can't you use @simd
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The buildbot throw an error that @simd
is not defined. I‘m not familiar with the build system, but it seems to be a world age problem?(abstractarray.jl is included before simdloop.jl in Base.jl)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds just like a bootstrapping problem, i.e. @simd
is not defined until later in the build process. That should be fixable by e.g. by moving the copyto!
definition to a file loaded later.
Put SimdLoop in advance to avoid bootstrapping (I hope this work)
Focus on copyto_unalias!, put SimdLoop in advance.
Move some copyto! definitions to other place.
I think this is #38073 cc @kimikage On 1.5:
On 1.6:
|
Some more benchmarks, It seems 1.6.1 is stably slower than 1.5.4 even with a julia> a = randn(4,4,4,4,4,4); b = similar(a); a = view(a,axes(a)...); b = view(b,axes(b)...);
julia> @btime foo!($b,$a); # 1.6.1: 9.200 μs 1.5.4: 3.112 μs
julia> @btime simdfoo!($b,$a); # 1.6.1: 3.763 μs 1.5.4: 6.040 μs
julia> a = randn(5,5,5,5,5); b = similar(a); a = view(a,axes(a)...); b = view(b,axes(b)...);
julia> @btime foo!($b,$a); # 1.6.1: 6.000 μs 1.5.4: 2.144 μs
julia> @btime simdfoo!($b,$a); # 1.6.1: 2.378 μs 1.5.4: 3.287 μs
julia> a = randn(8,8,8,8); b = similar(a); a = view(a,axes(a)...); b = view(b,axes(b)...);
julia> @btime foo!($b,$a); # 1.6.1: 6.420 μs 1.5.4: 1.840 μs
julia> @btime simdfoo!($b,$a); # 1.6.1: 1.990 μs 1.5.4: 2.678 μs
julia> a = randn(16,16,16); b = similar(a); a = view(a,axes(a)...); b = view(b,axes(b)...);
julia> @btime foo!($b,$a); # 1.6.1: 5.833 μs 1.5.4: 961.857 ns
julia> @btime simdfoo!($b,$a); # 1.6.1: 747.619 ns 1.5.4: 999.900 ns
julia> a = randn(64,64); b = similar(a); a = view(a,axes(a)...); b = view(b,axes(b)...);
julia> @btime foo!($b,$a); # 1.6.1: 3.737 μs 1.5.4: 5.650 μs
julia> @btime simdfoo!($b,$a); # 1.6.1: 512.500 ns 1.5.4: 699.306 ns |
I tried to remove every @simd safe for i in 1:10
end which only expands the loop but not marks it as SIMD. Some benchmark: size:(8, 8, 8, 8) -> size:(16, 16, 16)
Car -> Car : before: 23.8μs -> safesimd: 8.27μs -> simd: 7.6μs
Car -> Lin : before: 11.8μs -> safesimd: 1.91μs -> simd: 1.9μs
Lin -> Car : before: 10.5μs -> safesimd: 0.917μs -> simd: 1.03μs
Lin -> Lin : before: 1.25μs -> safesimd: 0.535μs -> simd: 0.535μs
size:(8, 8, 8, 7) -> size:(16, 16, 16)
Car -> Car : before: 20.9μs -> safesimd: 7.38μs -> simd: 6.78μs
Car -> Lin : before: 10.5μs -> safesimd: 1.7μs -> simd: 1.7μs
Lin -> Car : before: 9.2μs -> safesimd: 0.926μs -> simd: 0.928μs
Lin -> Lin : before: 1.04μs -> safesimd: 0.489μs -> simd: 0.493μs
size:(15, 16, 16) -> size:(8, 8, 8, 8)
Car -> Car : before: 20.5μs -> safesimd: 8.33μs -> simd: 7.08μs
Car -> Lin : before: 9μs -> safesimd: 1.62μs -> simd: 1.63μs
Lin -> Car : before: 11.8μs -> safesimd: 2.29μs -> simd: 2.3μs
Lin -> Lin : before: 1.11μs -> safesimd: 0.511μs -> simd: 0.514μs
size:(4, 4, 4, 4, 4, 4) -> size:(4, 4, 4, 4, 4, 4)
Car -> Car : before: 9.6μs -> safesimd: 3.96μs -> simd: 3.98μs
Car -> Lin : before: 18.9μs -> safesimd: 3.38μs -> simd: 3.58μs
Lin -> Car : before: 19.2μs -> safesimd: 5.57μs -> simd: 5.68μs
Lin -> Lin : before: 1.24μs -> safesimd: 0.539μs -> simd: 0.541μs
size:(5, 5, 5, 5, 5) -> size:(5, 5, 5, 5, 5)
Car -> Car : before: 6.24μs -> safesimd: 2.42μs -> simd: 2.56μs
Car -> Lin : before: 13μs -> safesimd: 2.19μs -> simd: 2.33μs
Lin -> Car : before: 11.1μs -> safesimd: 2.74μs -> simd: 2.73μs
Lin -> Lin : before: 0.913μs -> safesimd: 0.414μs -> simd: 0.413μs
size:(8, 8, 8, 8) -> size:(8, 8, 8, 8)
Car -> Car : before: 7.45μs -> safesimd: 2.1μs -> simd: 2.1μs
Car -> Lin : before: 12.1μs -> safesimd: 2.03μs -> simd: 1.92μs
Lin -> Car : before: 12.9μs -> safesimd: 2.49μs -> simd: 2.61μs
Lin -> Lin : before: 1.93μs -> safesimd: 0.55μs -> simd: 0.546μs
size:(16, 16, 16) -> size:(16, 16, 16)
Car -> Car : before: 4.3μs -> safesimd: 0.83μs -> simd: 0.822μs
Car -> Lin : before: 9.6μs -> safesimd: 0.914μs -> simd: 0.737μs
Lin -> Car : before: 11.4μs -> safesimd: 0.983μs -> simd: 1μs
Lin -> Lin : before: 1.93μs -> safesimd: 0.541μs -> simd: 0.538μs
size:(64, 64) -> size:(64, 64)
Car -> Car : before: 2.19μs -> safesimd: 0.54μs -> simd: 0.537μs
Car -> Lin : before: 7.82μs -> safesimd: 0.534μs -> simd: 0.538μs
Lin -> Car : before: 9.7μs -> safesimd: 0.585μs -> simd: 0.586μs
Lin -> Lin : before: 1.24μs -> safesimd: 0.537μs -> simd: 0.53μs |
remove `@simd`
fix for 0d Cartesian AbstractArray. This version should be fast enough if the size of Cartesian array's first dim is larger than 16 (eltype Float64).
fix for 0d Cartesian AbstractArray. This version should be fast enough if the size of Cartesian array's first dim is larger than 16 (eltype Float64).
white space
white space
Fix for other IndexStyle. Only use manually expanded version when the size of 1st dim >=16
fix typo error.
add test for expanded version.
white space
fix white space, typo error; add test;
At present,a .= b
falls tocopyto!(a, b)
ifaxes(a) == axes(b)
for performance optimization.However, as mentioned in #40962 and #39345, the performance of
copyto!(a, b)
is very low in many cases. Thus, I think we can implement a faster version here, based on@simd
, to solve the problem to some extent.(I tried to to optimize
copyto_unalias!
, but many generated codes are not vectorized without@simd
.)This PR tries to speed up
copyto_unaliased!
in many cases:Base.OneTo
as the iterartor instead ofLinearIndices
for faster speed.zip()
based iterator is used to speed up the most general case.CartesianIndices
is separated mannully like@simd
(but add noExpr(:loopinfo
) to speed up if the 1st dim's length is larger than 16.The above Benchmarks, using
SubArray{Float64}
, are done in 1.7.0-beta3, where the performance regression onCartesianIndices
's iteration(#38073) seems to be fixed.