Skip to content

~7x performance regression of BioSequence on Julia 0.5 (RC2) #18135

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
bicycle1885 opened this issue Aug 19, 2016 · 12 comments
Closed

~7x performance regression of BioSequence on Julia 0.5 (RC2) #18135

bicycle1885 opened this issue Aug 19, 2016 · 12 comments
Labels
help wanted Indicates that a maintainer wants help on an issue or pull request performance Must go faster regression Regression in behavior compared to a previous version
Milestone

Comments

@bicycle1885
Copy link
Member

bicycle1885 commented Aug 19, 2016

I'm now working on supporting Julia 0.5 in the Bio.jl package. However, I found a significant performance regression in a very simple benchmark shown below:

benchmark.jl:

using Bio.Seq
using BenchmarkTools

function count_a(seq)
    n = 0
    for nt in seq
        n += nt == DNA_A
    end
    return n
end

srand(1234)
seq = randdnaseq(1_000_000)
println(@benchmark count_a(seq))

println("--- baseline ---")
seq = collect(seq)
println(@benchmark count_a(seq))

Julia 0.4.6 (BioJulia/Bio.jl@f5481fe):

~/.j/v/Bio (julia-v0.5|…) $ julia benchmark.jl
BenchmarkTools.Trial:
  samples:          2526
  evals/sample:     1
  time tolerance:   5.00%
  memory tolerance: 1.00%
  memory estimate:  16.00 bytes
  allocs estimate:  1
  minimum time:     1.79 ms (0.00% GC)
  median time:      1.81 ms (0.00% GC)
  mean time:        1.98 ms (0.00% GC)
  maximum time:     4.12 ms (0.00% GC)
--- baseline ---
BenchmarkTools.Trial:
  samples:          5666
  evals/sample:     1
  time tolerance:   5.00%
  memory tolerance: 1.00%
  memory estimate:  16.00 bytes
  allocs estimate:  1
  minimum time:     806.94 μs (0.00% GC)
  median time:      809.80 μs (0.00% GC)
  mean time:        880.51 μs (0.00% GC)
  maximum time:     3.19 ms (0.00% GC)

Julia 0.5 RC2 (BioJulia/Bio.jl@b50b425):

~/.j/v/Bio (julia-v0.5|…) $ julia5 benchmark.jl
WARNING: Method definition require(Symbol) in module Base at loading.jl:317 overwritten in
 module Main at /Users/kenta/.julia/v0.5/Requires/src/require.jl:12.
BenchmarkTools.Trial:
  samples:          367
  evals/sample:     1
  time tolerance:   5.00%
  memory tolerance: 1.00%
  memory estimate:  16.00 bytes
  allocs estimate:  1
  minimum time:     12.13 ms (0.00% GC)
  median time:      13.39 ms (0.00% GC)
  mean time:        13.62 ms (0.00% GC)
  maximum time:     19.61 ms (0.00% GC)
--- baseline ---
BenchmarkTools.Trial:
  samples:          5004
  evals/sample:     1
  time tolerance:   5.00%
  memory tolerance: 1.00%
  memory estimate:  16.00 bytes
  allocs estimate:  1
  minimum time:     860.65 μs (0.00% GC)
  median time:      953.10 μs (0.00% GC)
  mean time:        997.09 μs (0.00% GC)
  maximum time:     2.83 ms (0.00% GC)

This benchmark calls Bio.Seq.inbound_getindex for each element in the loop. So, I guess the problem is in the optimization of this function. The generated machine code on Julia 0.5 looks much more complicated than that of Julia 0.4.

Julia 0.4.6:

julia> using Bio.Seq

julia> @code_native Bio.Seq.inbounds_getindex(randdnaseq(4), 1)
        .section        __TEXT,__text,regular,pure_instructions
Filename: /Users/kenta/.julia/v0.4/Bio/src/seq/bioseq.jl
Source line: 307
        pushq   %rbp
        movq    %rsp, %rbp
Source line: 307
        addq    8(%rdi), %rsi
        leaq    -8(,%rsi,4), %rcx
Source line: 308
        movq    %rcx, %rax
        sarq    $6, %rax
        movq    (%rdi), %rdx
        movq    (%rdx), %rdx
        movq    (%rdx,%rax,8), %rax
        andb    $60, %cl
        shrq    %cl, %rax
        andb    $15, %al
        popq    %rbp
        ret

Julia 0.5 RC2:

julia> using Bio.Seq
WARNING: Method definition require(Symbol) in module Base at loading.jl:317 overwritten in module Main at /Users/kenta/.julia/v0.5/Requires/src/require.jl:12.

julia> @code_native Bio.Seq.inbounds_getindex(randdnaseq(4), 1)
        .section        __TEXT,__text,regular,pure_instructions
Filename: bioseq.jl
        pushq   %rbp
        movq    %rsp, %rbp
        pushq   %r15
        pushq   %r14
        pushq   %r12
        pushq   %rbx
        subq    $32, %rsp
        movq    %rsi, %rbx
        movq    %rdi, %r12
        movabsq $4368469648, %r14       ## imm = 0x104618E90
        movabsq $jl_get_ptls_states_fast, %rax
        callq   *%rax
        movq    %rax, %r15
        movq    $0, -40(%rbp)
        movq    $2, -56(%rbp)
        movq    (%r15), %rax
        movq    %rax, -48(%rbp)
        leaq    -56(%rbp), %rax
        movq    %rax, (%r15)
Source line: 301
        addq    8(%r12), %rbx
Source line: 186
        leaq    -8(,%rbx,4), %rax
Source line: 305
        movq    %rax, -64(%rbp)
Source line: 306
        movabsq $index, %rax
        leaq    -64(%rbp), %rdi
        callq   *%rax
        movq    (%r12), %rdx

I may still need to narrow down the problem and create a smaller reproducible case. If you need it, I will try it later.

@bicycle1885
Copy link
Member Author

Here is more precise environment information.

Julia 0.4.6:

julia> versioninfo()
Julia Version 0.4.6
Commit 2e358ce* (2016-06-19 17:16 UTC)
Platform Info:
  System: Darwin (x86_64-apple-darwin14.5.0)
  CPU: Intel(R) Core(TM) i5-4288U CPU @ 2.60GHz
  WORD_SIZE: 64
  BLAS: libopenblas (USE64BITINT DYNAMIC_ARCH NO_AFFINITY Haswell)
  LAPACK: libopenblas64_
  LIBM: libopenlibm
  LLVM: libLLVM-3.3

Julia 0.5 RC2:

julia> versioninfo()
Julia Version 0.5.0-rc2+0
Commit 0350e57* (2016-08-12 11:25 UTC)
Platform Info:
  System: Darwin (x86_64-apple-darwin14.5.0)
  CPU: Intel(R) Core(TM) i5-4288U CPU @ 2.60GHz
  WORD_SIZE: 64
  BLAS: libopenblas (USE64BITINT DYNAMIC_ARCH NO_AFFINITY Haswell)
  LAPACK: libopenblas64_
  LIBM: libopenlibm
  LLVM: libLLVM-3.7.1 (ORCJIT, haswell)

@yuyichao
Copy link
Contributor

Seems to be caused by the inlining of indexing function.

Note that the assembly code looks incomplete and it's usually more useful to post llvm ir instead

@bicycle1885
Copy link
Member Author

Thank you. This is LLVM code:

Julia 0.4.6:

julia> @code_llvm Bio.Seq.inbounds_getindex(randdnaseq(4), 1)

define i8 @julia_inbounds_getindex_21682(%jl_value_t*, i64) {
L9:
  %2 = getelementptr %jl_value_t* %0, i64 1
  %3 = bitcast %jl_value_t* %2 to i64*
  %4 = load i64* %3, align 8
  %5 = add i64 %4, %1
  %6 = shl i64 %5, 2
  %7 = add i64 %6, -8
  %8 = getelementptr inbounds %jl_value_t* %0, i64 0, i32 0
  %9 = load %jl_value_t** %8, align 8
  %10 = ashr i64 %7, 6
  %11 = bitcast %jl_value_t* %9 to i8**
  %12 = load i8** %11, align 8
  %13 = bitcast i8* %12 to i64*
  %14 = getelementptr i64* %13, i64 %10
  %15 = load i64* %14, align 8
  %16 = and i64 %7, 60
  %17 = lshr i64 %15, %16
  %.tr = trunc i64 %17 to i8
  %18 = and i8 %.tr, 15
  ret i8 %18
}

Julia 0.5 RC2:

julia> @code_llvm Bio.Seq.inbounds_getindex(randdnaseq(4), 1)

define i8 @julia_inbounds_getindex_65982(%jl_value_t*, i64) #0 {
L5:
  %2 = call %jl_value_t*** @jl_get_ptls_states() #1
  %3 = alloca [3 x %jl_value_t*], align 8
  %.sub = getelementptr inbounds [3 x %jl_value_t*], [3 x %jl_value_t*]* %3, i64 0, i64 0
  %4 = getelementptr [3 x %jl_value_t*], [3 x %jl_value_t*]* %3, i64 0, i64 2
  store %jl_value_t* null, %jl_value_t** %4, align 8
  %5 = bitcast [3 x %jl_value_t*]* %3 to i64*
  store i64 2, i64* %5, align 8
  %6 = bitcast %jl_value_t*** %2 to i64*
  %7 = load i64, i64* %6, align 8
  %8 = getelementptr [3 x %jl_value_t*], [3 x %jl_value_t*]* %3, i64 0, i64 1
  %9 = bitcast %jl_value_t** %8 to i64*
  store i64 %7, i64* %9, align 8
  store %jl_value_t** %.sub, %jl_value_t*** %2, align 8
  %j = alloca %BitIndex, align 8
  %10 = getelementptr %jl_value_t, %jl_value_t* %0, i64 1
  %11 = bitcast %jl_value_t* %10 to i64*
  %12 = load i64, i64* %11, align 8
  %13 = add i64 %12, %1
  %14 = shl i64 %13, 2
  %15 = add i64 %14, -8
  %16 = getelementptr inbounds %BitIndex, %BitIndex* %j, i64 0, i32 0
  store i64 %15, i64* %16, align 8
  %17 = call i64 @julia_index_65974(%BitIndex* nonnull %j) #0
  %18 = getelementptr inbounds %jl_value_t, %jl_value_t* %0, i64 0, i32 0
  %19 = load %jl_value_t*, %jl_value_t** %18, align 8
  store %jl_value_t* %19, %jl_value_t** %4, align 8
  %20 = load i64, i64* %16, align 8
  %21 = and i64 %20, 63
  %22 = bitcast %jl_value_t* %19 to i64**
  %23 = load i64*, i64** %22, align 8
  %24 = add i64 %17, -1
  %25 = getelementptr i64, i64* %23, i64 %24
  %26 = load i64, i64* %25, align 8
  %27 = lshr i64 %26, %21
  %28 = call i64 @julia_mask_65975(i64 4) #0
  %29 = and i64 %27, %28
  %30 = and i64 %29, 255
  %31 = icmp eq i64 %30, %29
  br i1 %31, label %L7, label %fail

L7:                                               ; preds = %L5
  %32 = trunc i64 %29 to i8
  %33 = icmp ult i8 %32, 16
  br i1 %33, label %L8, label %if13

L8:                                               ; preds = %L7
  %34 = load i64, i64* %9, align 8
  store i64 %34, i64* %6, align 8
  ret i8 %32

fail:                                             ; preds = %L5
  call void @jl_throw(%jl_value_t* inttoptr (i64 4586147472 to %jl_value_t*))
  unreachable

if13:                                             ; preds = %L7
  %35 = bitcast %jl_value_t*** %2 to i8*
  %36 = call %jl_value_t* @jl_gc_pool_alloc(i8* %35, i32 1384, i32 16)
  %37 = getelementptr inbounds %jl_value_t, %jl_value_t* %36, i64 -1, i32 0
  store %jl_value_t* inttoptr (i64 4787371024 to %jl_value_t*), %jl_value_t** %37, align 8
  %38 = bitcast %jl_value_t* %36 to i8*
  store i8 %32, i8* %38, align 8
  call void @jl_throw(%jl_value_t* %36)
  unreachable
}

@ViralBShah ViralBShah added performance Must go faster potential benchmark Could make a good benchmark in BaseBenchmarks labels Aug 19, 2016
@ViralBShah ViralBShah added this to the 0.5.x milestone Aug 19, 2016
@JeffBezanson
Copy link
Member

I believe this is due to the introduction of branches into the bit shift operators in 5a9717b. Well, now you know why languages like to have undefined behaviors :)

I'll try putting inline declarations on them, and/or switching the branch to ifelse.

@JeffBezanson
Copy link
Member

Using ifelse instead of a branch helps a lot, but it looks like we still have an unnecessary gc frame (#15369):

julia> @code_llvm Bio.Seq.inbounds_getindex(randdnaseq(4), 1)

define i8 @julia_inbounds_getindex_67106(%jl_value_t*, i64) #0 {
L2:
  %ptls_i8 = call i8* asm "movq %fs:0, $0;\0Aaddq $$-2672, $0", "=r,~{dirflag},~{fpsr},~{flags}"() #3
  %ptls = bitcast i8* %ptls_i8 to %jl_value_t***
  %2 = alloca [3 x %jl_value_t*], align 8
  %.sub = getelementptr inbounds [3 x %jl_value_t*], [3 x %jl_value_t*]* %2, i64 0, i64 0
  %3 = getelementptr [3 x %jl_value_t*], [3 x %jl_value_t*]* %2, i64 0, i64 2
  store %jl_value_t* null, %jl_value_t** %3, align 8
  %4 = bitcast [3 x %jl_value_t*]* %2 to i64*
  store i64 2, i64* %4, align 8
  %5 = getelementptr [3 x %jl_value_t*], [3 x %jl_value_t*]* %2, i64 0, i64 1
  %6 = bitcast i8* %ptls_i8 to i64*
  %7 = load i64, i64* %6, align 8
  %8 = bitcast %jl_value_t** %5 to i64*
  store i64 %7, i64* %8, align 8
  store %jl_value_t** %.sub, %jl_value_t*** %ptls, align 8
  %9 = getelementptr %jl_value_t, %jl_value_t* %0, i64 1
  %10 = bitcast %jl_value_t* %9 to i64*
  %11 = load i64, i64* %10, align 8
  %12 = add i64 %11, %1
  %13 = shl i64 %12, 2
  %14 = add i64 %13, -8
  %15 = getelementptr inbounds %jl_value_t, %jl_value_t* %0, i64 0, i32 0
  %16 = load %jl_value_t*, %jl_value_t** %15, align 8
  store %jl_value_t* %16, %jl_value_t** %3, align 8
  %17 = bitcast %jl_value_t* %16 to i64**
  %18 = load i64*, i64** %17, align 8
  %19 = ashr i64 %14, 6
  %20 = getelementptr i64, i64* %18, i64 %19
  %21 = load i64, i64* %20, align 8
  %22 = and i64 %14, 60
  %23 = lshr i64 %21, %22
  %.tr = trunc i64 %23 to i8
  %24 = and i8 %.tr, 15
  %25 = load i64, i64* %8, align 8
  store i64 %25, i64* %6, align 8
  ret i8 %24
}

@timholy
Copy link
Member

timholy commented Aug 19, 2016

If you can use Unsigned numbers in the shift operation, that would sidestep the issue.

@JeffBezanson JeffBezanson added the regression Regression in behavior compared to a previous version label Aug 19, 2016
JeffBezanson added a commit that referenced this issue Aug 19, 2016
replace branch in bit shift operators, helps #18135
tkelman pushed a commit that referenced this issue Aug 20, 2016
@bicycle1885
Copy link
Member Author

bicycle1885 commented Aug 20, 2016

RC3 branch (tk/backports-0.5.0-rc3) including e02692f improved the performance significantly, but still I see the performance regression about 4x slow down:

~/.j/v/Bio (julia-v0.5|…) $ ~/vendor/julia/julia benchmark.jl
WARNING: Method definition require(Symbol) in module Base at loading.jl:317 overwritten in
 module Main at /Users/kenta/.julia/v0.5/Requires/src/require.jl:12.
BenchmarkTools.Trial:
  samples:          672
  evals/sample:     1
  time tolerance:   5.00%
  memory tolerance: 1.00%
  memory estimate:  16.00 bytes
  allocs estimate:  1
  minimum time:     6.82 ms (0.00% GC)
  median time:      6.97 ms (0.00% GC)
  mean time:        7.44 ms (0.00% GC)
  maximum time:     11.49 ms (0.00% GC)
--- baseline ---
BenchmarkTools.Trial:
  samples:          5313
  evals/sample:     1
  time tolerance:   5.00%
  memory tolerance: 1.00%
  memory estimate:  16.00 bytes
  allocs estimate:  1
  minimum time:     860.54 μs (0.00% GC)
  median time:      864.82 μs (0.00% GC)
  mean time:        940.22 μs (0.00% GC)
  maximum time:     3.43 ms (0.00% GC)

@ViralBShah
Copy link
Member

@JeffBezanson Are we going to be able to do something here for the 0.5 release?

@timholy
Copy link
Member

timholy commented Aug 29, 2016

I don't think it's (any longer) the bitshift operators in base:

Julia 0.4:

julia> using BenchmarkTools

julia> function foo(r, n)
           s = zero(eltype(r))
           for i in r
               s += i<<n
           end
           s
       end
foo (generic function with 1 method)

julia> @benchmark foo(1:10^6, UInt(3))
BenchmarkTools.Trial: 
  samples:          4606
  evals/sample:     1
  time tolerance:   5.00%
  memory tolerance: 1.00%
  memory estimate:  0.00 bytes
  allocs estimate:  0
  minimum time:     1.00 ms (0.00% GC)
  median time:      1.04 ms (0.00% GC)
  mean time:        1.08 ms (0.00% GC)
  maximum time:     1.95 ms (0.00% GC)

julia> @benchmark foo(1:10^6, 3)
BenchmarkTools.Trial: 
  samples:          4789
  evals/sample:     1
  time tolerance:   5.00%
  memory tolerance: 1.00%
  memory estimate:  0.00 bytes
  allocs estimate:  0
  minimum time:     1.00 ms (0.00% GC)
  median time:      1.04 ms (0.00% GC)
  mean time:        1.04 ms (0.00% GC)
  maximum time:     1.63 ms (0.00% GC)

julia> @benchmark foo(UInt(1):UInt(10^6), UInt(3))
BenchmarkTools.Trial: 
  samples:          4885
  evals/sample:     1
  time tolerance:   5.00%
  memory tolerance: 1.00%
  memory estimate:  0.00 bytes
  allocs estimate:  0
  minimum time:     1.00 ms (0.00% GC)
  median time:      1.02 ms (0.00% GC)
  mean time:        1.02 ms (0.00% GC)
  maximum time:     2.09 ms (0.00% GC)

Master:

julia> @benchmark foo(1:10^6, UInt(3))
BenchmarkTools.Trial: 
  samples:          10000
  evals/sample:     1
  time tolerance:   5.00%
  memory tolerance: 1.00%
  memory estimate:  0.00 bytes
  allocs estimate:  0
  minimum time:     163.74 μs (0.00% GC)
  median time:      169.65 μs (0.00% GC)
  mean time:        172.10 μs (0.00% GC)
  maximum time:     245.19 μs (0.00% GC)

julia> @benchmark foo(1:10^6, 3)
BenchmarkTools.Trial: 
  samples:          10000
  evals/sample:     1
  time tolerance:   5.00%
  memory tolerance: 1.00%
  memory estimate:  0.00 bytes
  allocs estimate:  0
  minimum time:     125.74 μs (0.00% GC)
  median time:      130.16 μs (0.00% GC)
  mean time:        132.26 μs (0.00% GC)
  maximum time:     164.10 μs (0.00% GC)

julia> @benchmark foo(UInt(1):UInt(10^6), UInt(3))
BenchmarkTools.Trial: 
  samples:          10000
  evals/sample:     1
  time tolerance:   5.00%
  memory tolerance: 1.00%
  memory estimate:  0.00 bytes
  allocs estimate:  0
  minimum time:     163.74 μs (0.00% GC)
  median time:      175.87 μs (0.00% GC)
  mean time:        174.41 μs (0.00% GC)
  maximum time:     460.21 μs (0.00% GC)

I suspect this is a package-level detail; perhaps one needs to dispatch to a different version of the operator?

@bicycle1885
Copy link
Member Author

I'll check it again on RC3 or 4.

mfasi pushed a commit to mfasi/julia that referenced this issue Sep 5, 2016
@StefanKarpinski StefanKarpinski added help wanted Indicates that a maintainer wants help on an issue or pull request and removed help wanted Indicates that a maintainer wants help on an issue or pull request labels Oct 27, 2016
@JeffBezanson
Copy link
Member

Just tried this quickly on 0.6 and the numbers seem pretty close. Is this still an issue?

@bicycle1885
Copy link
Member Author

No. Some benchmarks support there is no performance degradation.

@KristofferC KristofferC removed the potential benchmark Could make a good benchmark in BaseBenchmarks label Oct 31, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
help wanted Indicates that a maintainer wants help on an issue or pull request performance Must go faster regression Regression in behavior compared to a previous version
Projects
None yet
Development

No branches or pull requests

7 participants