Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Non-contiguous tensor iteration optimization #659

Merged
merged 6 commits into from
Jul 17, 2024

Conversation

darkestpigeon
Copy link
Contributor

@darkestpigeon darkestpigeon commented Jul 10, 2024

What?

Speeding up iteration over a single non-contiguous tensor by looping explicitly over the last two dimensions. (also changed reshape to use map_inline instead of apply2_inline)

Why?

This operation is key to making non-contiguous tensors contiguous, and all other operations are typically much faster for contiguous tensors. The performance difference before and after optimization can be 10x and more in some cases.

How?

Calling advanceStridedIteration each step prevents proper vectorization, so instead we loop explicitly over the last two axes. This change is almost trivial when we iterate over a complete tensor, but is a bit tricky when iter_offset != 0 or iter_size < t.size. Most of the code handles the "ragged" ends of the tensor.

We also reduce the rank of the tensor by coalescing axes together if possible. Contiguous and uniformly-strided tensors become 1-rank, non-contiguous with non-uniform strides are at least 2-rank, so they always have two axes to loop over. Also, coalescing axes makes sure that the last two axes are as large as possible, so the gain from looping is maximal.

When last axes have very small number of elements, specialization is used to remove the loops completely during compilation.

Benchmark

Code

bench.nim
import algorithm
import std/times
import std/strutils
import arraymancer


template echo_code(body: untyped) =
  echo astToStr(body)
  body

template timeit(body: untyped): untyped =
  block:
    var t = 0.0
    var t_values: seq[float]
    while t_values.len < 10 or t < 1:
      let start_t = cpuTime()
      block:
        body
      let end_t = cpuTime()
      t_values.add(end_t - start_t)
      t += end_t - start_t

    t_values.sort()

    let low_p = (t_values.len.float * 0.1).round.int
    let high_p = (t_values.len.float * 0.9).round.int - 1

    let mean = 0.5*(t_values[low_p] + t_values[high_p])
    let err = 0.5*(t_values[high_p] - t_values[low_p])

    echo astToStr(body)
    echo '\t',
      (1e3*mean).formatFloat(ffDecimal, 5), " \u00B1 ",
      (1e3*err).formatFloat(ffDecimal, 5), " ms"

block:
  echo_code:
    let x = randomTensor(1_000_000, max=1000).astype(float)
    let y = sin(x)
    assert block:
      var equal = true
      for i in 0..<1_000_000:
        equal = equal and (y[i] == sin(x[i]))
      equal

  timeit:
    discard sin(x)

block:
  echo_code:
    let x = randomTensor(1920, 1080, 3, max=255).astype(uint8)
    assert x == x.clone()

  timeit:
    discard x.clone()

block:
  echo_code:
    let x = randomTensor(1920, 1080, 3, max=255).astype(uint8)
    let y = x[1..999, 1..999]
    assert y == y.asContiguous()

  timeit:
    discard y.asContiguous()

block:
  echo_code:
    let x = randomTensor(1920, 1080, 3, max=255).astype(uint8)
    let y = x[1..999, 1..999, 0..1]
    assert y == y.asContiguous()

  timeit:
    discard y.asContiguous()

block:
  echo_code:
    let x = randomTensor(1920, 1080, 3, 3, max=255).astype(uint8)
    let y = x[1..999, 1..999, 0..1, 1..2]
    assert y == y.asContiguous()

  timeit:
    discard y.asContiguous()

block:
  echo_code:
    let x = randomTensor(1920, 1080, 5, max=255).astype(uint8)
    let y = x[1..999, 1..999, 1..3]
    assert y == y.asContiguous()

  timeit:
    discard y.asContiguous()

block:
  echo_code:
    let x = randomTensor(1920, 1080, 20, max=255).astype(uint8)
    let y = x[1..999, 1..999, 3..6]
    assert y == y.asContiguous()

  timeit:
    discard y.asContiguous()

block:
  echo_code:
    let x = randomTensor(1920, 1080, 3, max=255).astype(uint8)
    let y = x.permute(2, 0, 1)
    assert y == y.asContiguous()

  timeit:
    discard y.asContiguous()

block:
  echo_code:
    let x = randomTensor(1920, 1080, 3, max=255).astype(uint8)
    let y = x[_, _, _.._|-1]
    assert y == y.asContiguous()

  timeit:
    discard y.asContiguous()

block:
  echo_code:
    let x = arange(0, 15*411*44).astype(float32).reshape(15, 411, 44).permute(1, 0, 2) #(411, 15, 44)
    let y = x.reshape(411, 5, 132) # now contiguous

    assert x == y.reshape(411, 15, 44) # non-copying, stridedIteration not involved

  timeit:
    discard x.reshape(411, 5, 132)

Results (for reference)

With -d:release -d:danger

original
let x = randomTensor(1000000, max = 1000).astype(float)
let y = sin(x)
assert block:
  var equal = true
  for i in 0 ..< 1000000:
    equal = equal and (y[i] == sin(x[i]))
  equal

discard sin(x)
        10.31089 ± 0.05542 ms

let x = randomTensor(1920, 1080, 3, max = 255).astype(uint8)
assert x == x.clone()

discard x.clone()
        0.16996 ± 0.00206 ms

let x = randomTensor(1920, 1080, 3, max = 255).astype(uint8)
let y = x[1 .. 999, 1 .. 999]
assert y == y.asContiguous()

discard y.asContiguous()
        1.86853 ± 0.02510 ms

let x = randomTensor(1920, 1080, 3, max = 255).astype(uint8)
let y = x[1 .. 999, 1 .. 999, 0 .. 1]
assert y == y.asContiguous()

discard y.asContiguous()
        1.32417 ± 0.01723 ms

let x = randomTensor(1920, 1080, 3, 3, max = 255).astype(uint8)
let y = x[1 .. 999, 1 .. 999, 0 .. 1, 1 .. 2]
assert y == y.asContiguous()

discard y.asContiguous()
        2.86708 ± 0.04316 ms

let x = randomTensor(1920, 1080, 5, max = 255).astype(uint8)
let y = x[1 .. 999, 1 .. 999, 1 .. 3]
assert y == y.asContiguous()

discard y.asContiguous()
        1.87422 ± 0.02434 ms

let x = randomTensor(1920, 1080, 20, max = 255).astype(uint8)
let y = x[1 .. 999, 1 .. 999, 3 .. 6]
assert y == y.asContiguous()

discard y.asContiguous()
        2.40025 ± 0.09084 ms

let x = randomTensor(1920, 1080, 3, max = 255).astype(uint8)
let y = x.permute(2, 0, 1)
assert y == y.asContiguous()

discard y.asContiguous()
        7.95384 ± 0.05741 ms

let x = randomTensor(1920, 1080, 3, max = 255).astype(uint8)
let y = x[_, _, _ .. _ |- 1]
assert y == y.asContiguous()

discard y.asContiguous()
        4.89026 ± 0.04802 ms

let x = arange(0, 15 * 411 * 44).astype(float32).reshape(15, 411, 44).permute(1, 0, 2)
let y = x.reshape(411, 5, 132)
assert x == y.reshape(411, 15, 44)

discard x.reshape(411, 5, 132)
        0.33438 ± 0.02463 ms
optimized
let x = randomTensor(1000000, max = 1000).astype(float)
let y = sin(x)
assert block:
  var equal = true
  for i in 0 ..< 1000000:
    equal = equal and (y[i] == sin(x[i]))
  equal

discard sin(x)
        10.46194 ± 0.09679 ms

let x = randomTensor(1920, 1080, 3, max = 255).astype(uint8)
assert x == x.clone()

discard x.clone()
        0.17786 ± 0.00987 ms

let x = randomTensor(1920, 1080, 3, max = 255).astype(uint8)
let y = x[1 .. 999, 1 .. 999]
assert y == y.asContiguous()

discard y.asContiguous()
        0.08712 ± 0.00139 ms

let x = randomTensor(1920, 1080, 3, max = 255).astype(uint8)
let y = x[1 .. 999, 1 .. 999, 0 .. 1]
assert y == y.asContiguous()

discard y.asContiguous()
        0.24955 ± 0.01521 ms

let x = randomTensor(1920, 1080, 3, 3, max = 255).astype(uint8)
let y = x[1 .. 999, 1 .. 999, 0 .. 1, 1 .. 2]
assert y == y.asContiguous()

discard y.asContiguous()
        1.41385 ± 0.00954 ms

let x = randomTensor(1920, 1080, 5, max = 255).astype(uint8)
let y = x[1 .. 999, 1 .. 999, 1 .. 3]
assert y == y.asContiguous()

discard y.asContiguous()
        0.36874 ± 0.01580 ms

let x = randomTensor(1920, 1080, 20, max = 255).astype(uint8)
let y = x[1 .. 999, 1 .. 999, 3 .. 6]
assert y == y.asContiguous()

discard y.asContiguous()
        1.53922 ± 0.04587 ms

let x = randomTensor(1920, 1080, 3, max = 255).astype(uint8)
let y = x.permute(2, 0, 1)
assert y == y.asContiguous()

discard y.asContiguous()
        1.14373 ± 0.04919 ms

let x = randomTensor(1920, 1080, 3, max = 255).astype(uint8)
let y = x[_, _, _ .. _ |- 1]
assert y == y.asContiguous()

discard y.asContiguous()
        0.84792 ± 0.03781 ms

let x = arange(0, 15 * 411 * 44).astype(float32).reshape(15, 411, 44).permute(1, 0, 2)
let y = x.reshape(411, 5, 132)
assert x == y.reshape(411, 15, 44)

discard x.reshape(411, 5, 132)
        0.02671 ± 0.00100 ms

With -d:release -d:danger -d:openmp --exceptions:setjmp

original
let x = randomTensor(1000000, max = 1000).astype(float)
let y = sin(x)
assert block:
  var equal = true
  for i in 0 ..< 1000000:
    equal = equal and (y[i] == sin(x[i]))
  equal

discard sin(x)
        0.72820 ± 0.00724 ms

let x = randomTensor(1920, 1080, 3, max = 255).astype(uint8)
assert x == x.clone()

discard x.clone()
        0.02334 ± 0.00027 ms

let x = randomTensor(1920, 1080, 3, max = 255).astype(uint8)
let y = x[1 .. 999, 1 .. 999]
assert y == y.asContiguous()

discard y.asContiguous()
        0.19307 ± 0.00054 ms

let x = randomTensor(1920, 1080, 3, max = 255).astype(uint8)
let y = x[1 .. 999, 1 .. 999, 0 .. 1]
assert y == y.asContiguous()

discard y.asContiguous()
        0.12059 ± 0.00033 ms

let x = randomTensor(1920, 1080, 3, 3, max = 255).astype(uint8)
let y = x[1 .. 999, 1 .. 999, 0 .. 1, 1 .. 2]
assert y == y.asContiguous()

discard y.asContiguous()
        0.31961 ± 0.00707 ms

let x = randomTensor(1920, 1080, 5, max = 255).astype(uint8)
let y = x[1 .. 999, 1 .. 999, 1 .. 3]
assert y == y.asContiguous()

discard y.asContiguous()
        0.19341 ± 0.00065 ms

let x = randomTensor(1920, 1080, 20, max = 255).astype(uint8)
let y = x[1 .. 999, 1 .. 999, 3 .. 6]
assert y == y.asContiguous()

discard y.asContiguous()
        0.28808 ± 0.01241 ms

let x = randomTensor(1920, 1080, 3, max = 255).astype(uint8)
let y = x.permute(2, 0, 1)
assert y == y.asContiguous()

discard y.asContiguous()
        0.42130 ± 0.01610 ms

let x = randomTensor(1920, 1080, 3, max = 255).astype(uint8)
let y = x[_, _, _ .. _ |- 1]
assert y == y.asContiguous()

discard y.asContiguous()
        0.39559 ± 0.00167 ms

let x = arange(0, 15 * 411 * 44).astype(float32).reshape(15, 411, 44).permute(1, 0, 2)
let y = x.reshape(411, 5, 132)
assert x == y.reshape(411, 15, 44)

discard x.reshape(411, 5, 132)
        0.02266 ± 0.00045 ms
optimized
let x = randomTensor(1000000, max = 1000).astype(float)
let y = sin(x)
assert block:
  var equal = true
  for i in 0 ..< 1000000:
    equal = equal and (y[i] == sin(x[i]))
  equal

discard sin(x)
        0.72824 ± 0.00665 ms

let x = randomTensor(1920, 1080, 3, max = 255).astype(uint8)
assert x == x.clone()

discard x.clone()
        0.02336 ± 0.00024 ms

let x = randomTensor(1920, 1080, 3, max = 255).astype(uint8)
let y = x[1 .. 999, 1 .. 999]
assert y == y.asContiguous()

discard y.asContiguous()
        0.01309 ± 0.00022 ms

let x = randomTensor(1920, 1080, 3, max = 255).astype(uint8)
let y = x[1 .. 999, 1 .. 999, 0 .. 1]
assert y == y.asContiguous()

discard y.asContiguous()
        0.02233 ± 0.00099 ms

let x = randomTensor(1920, 1080, 3, 3, max = 255).astype(uint8)
let y = x[1 .. 999, 1 .. 999, 0 .. 1, 1 .. 2]
assert y == y.asContiguous()

discard y.asContiguous()
        0.08743 ± 0.00033 ms

let x = randomTensor(1920, 1080, 5, max = 255).astype(uint8)
let y = x[1 .. 999, 1 .. 999, 1 .. 3]
assert y == y.asContiguous()

discard y.asContiguous()
        0.03073 ± 0.00067 ms

let x = randomTensor(1920, 1080, 20, max = 255).astype(uint8)
let y = x[1 .. 999, 1 .. 999, 3 .. 6]
assert y == y.asContiguous()

discard y.asContiguous()
        0.18865 ± 0.00334 ms

let x = randomTensor(1920, 1080, 3, max = 255).astype(uint8)
let y = x.permute(2, 0, 1)
assert y == y.asContiguous()

discard y.asContiguous()
        0.10158 ± 0.00160 ms

let x = randomTensor(1920, 1080, 3, max = 255).astype(uint8)
let y = x[_, _, _ .. _ |- 1]
assert y == y.asContiguous()

discard y.asContiguous()
        0.07126 ± 0.00276 ms

let x = arange(0, 15 * 411 * 44).astype(float32).reshape(15, 411, 44).permute(1, 0, 2)
let y = x.reshape(411, 5, 132)
assert x == y.reshape(411, 15, 44)

discard x.reshape(411, 5, 132)
        0.00654 ± 0.00026 ms

Copy link
Owner

@mratsim mratsim left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is very nice, thank you!

I think we need in-code comment to describe the algorith, which would also be helpful for future maintenance.

@@ -166,25 +207,133 @@ template stridedIterationYield*(strider: IterKind, data, i, iter_pos: typed) =
elif strider == IterKind.Iter_Values: yield (i, data[iter_pos])
elif strider == IterKind.Offset_Values: yield (iter_pos, data[iter_pos]) ## TODO: remove workaround for C++ backend

template stridedIterationLoop*(strider: IterKind, data, t, iter_offset, iter_size, prev_d, last_d: typed) =
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This probably needs a comment describing the algorithm

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Refactored the code a bit and added a comment

@darkestpigeon
Copy link
Contributor Author

darkestpigeon commented Jul 17, 2024

By the way, the same optimization can be applied to dualStridedIteration and tripleStridedIteration in the case when the tensor shapes can be broadcast to the same shape. This should bring similar speed-ups to operations like apply2_inline when non-contiguous tensors are involved.

Also, are 0-rank tensors supported? And can they be iterated upon? Right now I added an assert to make sure that the rank is >0.

@mratsim
Copy link
Owner

mratsim commented Jul 17, 2024

Also, are 0-rank tensors supported? And can they be iterated upon? Right now I added an assert to make sure that the rank is >0.

iirc I tried to make them work and got involved into nasty compiler issues, but that was 7+ years ago. It's fine to assume unsupported for now.

@mratsim
Copy link
Owner

mratsim commented Jul 17, 2024

By the way, the same optimization can be applied to dualStridedIteration and tripleStridedIteration in the case when the tensor shapes can be broadcast to the same shape. This should bring similar speed-ups to operations like apply2_inline when non-contiguous tensors are involved.

Note that I have a refactoring to allow parallel iteration on a variadic number of tensors here: https://github.com/mratsim/Arraymancer/blob/v0.7.32/src/arraymancer/laser/strided_iteration/foreach_common.nim#L101-L119

Instead of doing the dual/triple it would be more future forward to modify those and then replace the old iterations proc.

This was motivated by GRU / LSTM needing iterations on 4 tensors at once: https://github.com/mratsim/Arraymancer/blob/v0.7.32/src/arraymancer/nn_primitives/nnp_gru.nim#L138-L142

@darkestpigeon
Copy link
Contributor Author

Cool, I'll check out the variadic version. Might take a while, I'm new to nim and I see that the code is macro-heavy.
Also, regarding the iteration. Is it safe to assume that all the tensors we're iterating on have compatible shapes (e.g. can be broadcasted to a common shape)?

@mratsim
Copy link
Owner

mratsim commented Jul 17, 2024

Is it safe to assume that all the tensors we're iterating on have compatible shapes (e.g. can be broadcasted to a common shape)?

No, we can add a check and fallback to a slow-path.

Might take a while, I'm new to nim and I see that the code is macro-heavy.

No worries, unfortunately that was iirc the cleanest solution.

The reference code I used while developing generalizing the macros is this one: https://github.com/mratsim/laser/blob/master/benchmarks/loop_iteration/iter05_fusedpertensor.nim#L9-L143

@mratsim mratsim merged commit 35adfc1 into mratsim:master Jul 17, 2024
@darkestpigeon darkestpigeon deleted the non-cont-iteration-optimization branch August 20, 2024 19:58
Vindaar added a commit that referenced this pull request Sep 20, 2024
By not resetting the offset here, operating on a Tensor view without
cloning could cause undefined behavior, because we would be accessing
elements outside the tensor buffer.
Vindaar added a commit that referenced this pull request Sep 20, 2024
@Vindaar
Copy link
Collaborator

Vindaar commented Sep 20, 2024

This PR caused a small regression related to reshape that was not caught, due to the broken CI. I fixed it in #666. The issue was that the new reshape_with_copy implementation did not reset the offset of the input tensor.

Vindaar added a commit that referenced this pull request Sep 20, 2024
* explicitly allow `openArray` in `[]`, `[]=` for tensors

This was simply an oversight obviously

* fix CI by compiling tests with `-d:ssl`

* need a space, duh

* use AWS mirror from PyTorch for MNIST download

* fix regression caused by PR #659

By not resetting the offset here, operating on a Tensor view without
cloning could cause undefined behavior, because we would be accessing
elements outside the tensor buffer.

* add test case for regression of #659
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants