Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Select function on CPU takes 10% of time on tiny students, can be optimized #684

Open
kpu opened this issue Jul 25, 2020 · 2 comments · May be fixed by #687
Open

Select function on CPU takes 10% of time on tiny students, can be optimized #684

kpu opened this issue Jul 25, 2020 · 2 comments · May be fixed by #687
Assignees

Comments

@kpu
Copy link
Member

kpu commented Jul 25, 2020

Just ran a profiler on the enes student http://statmt.org/bergamot/models/ . This was in the intgemm_reintegrated_computestats branch, but the Select function is the same in master.

Each sample counts as 0.01 seconds.
  %   cumulative   self              self     total           
 time   seconds   seconds    calls  Ts/call  Ts/call  name    
 35.25     21.98    21.98                             void intgemm::OMPParallelWrap<intgemm::callbacks::UnquantizeAndAddBiasAndWrite, intgemm::AVX512VNNI_8bit, signed char>(signed char const*, signed char const*, unsigned int, unsigned int, unsigned int, intgemm::callbacks::UnquantizeAndAddBiasAndWrite)
 10.79     28.71     6.73                             marian::cpu::Select(IntrusivePtr<marian::TensorBase>, IntrusivePtr<marian::TensorBase>, IntrusivePtr<marian::TensorBase>, int)
  8.52     34.02     5.31                             mkl_blas_avx512_sgemm_kernel_0_b0
  7.84     38.91     4.89                             mkl_blas_avx512_sgemm_scopy_right48_ea
  5.79     42.52     3.61                             marian::cpu::LayerNormalization(IntrusivePtr<marian::TensorBase>, IntrusivePtr<marian::TensorBase>, IntrusivePtr<marian::TensorBase>, IntrusivePtr<marian::TensorBase>, float)
  5.20     45.76     3.24                             mkl_blas_avx512_xsgemv

That Select function is 10.79% of the time:

void Select(Tensor out,

And as the comments say, an optimized version is TODO. All of the calls in the student are axis=2. There was an attempt at optimization for this case but it's been disabled and the comments claim it doesn't work.

#if 0 // buggy but not really used?
if(axisCPU == 2 && outShape == idxShape) // specialization for axis==2 when there is no broadcasting, @TODO to be removed once we have a faster implementation below
return SelectAxis2(out, in, indices);
#endif

Also in all the student calls, outShape != inShape so indeed that version wouldn't be called.

@emjotde
Copy link
Member

emjotde commented Jul 25, 2020

I think the Element(...) function might be a good template how to approach this. There I am using template meta-programming to unroll over N (template parameter) number of dimensions. Computation of the current index is progressive, therefor a lot cheaper than re-calculating from the elemental dimensions at each time step.

https://github.com/marian-nmt/marian-dev/blob/master/src/tensors/cpu/element.h

Using that approach it should also be possible to write one solution for all possibilities.

@kpu
Copy link
Member Author

kpu commented Jul 25, 2020

If I'm allowed to assume indices is 1x1x...x (axis) x1x1... and consecutive memory then

void Select(Tensor out,
            const Tensor in,
            const Tensor indices,
            int axis) {
  matchOrAbort<IndexType>(indices->type());
  
  functional::Shape outShape = out->shape();
  functional::Shape inShape  = in->shape();
  functional::Shape idxShape = indices->shape();
  
  int axisCPU = (int)(axis + functional::Shape::size() - out->shape().size());
  
  // Reduce the problem to beforeAxis x idxShape[axisCPU] x afterAxis.
  int beforeAxis = 1;
  for (int i = 0; i < axisCPU; ++i) {
    beforeAxis *= outShape[i];
  }
  int afterAxis = 1;
  for (int i = axisCPU + 1; i < functional::Shape::size(); ++i) {
    afterAxis *= outShape[i];
  }
  // Stride to use for the beforeAxis dimension in the input and output tensors.
  int inBeforeStride = axisCPU ? inShape.stride(axisCPU - 1) : inShape.elements();
  int outBeforeStride = axisCPU ? outShape.stride(axisCPU - 1) : outShape.elements();

  for (int beforeIdx = 0; beforeIdx < beforeAxis; ++beforeIdx) {
    for (int axisIdx = 0; axisIdx < idxShape[axisCPU]; ++axisIdx) {
      // This is the value to read along.
      int index = indices->data<IndexType>()[axisIdx];
      auto inBase = in->data() + beforeIdx * inBeforeStride + index * afterAxis;
      auto outBase = out->data() + beforeIdx * outBeforeStride + axisIdx * afterAxis;
      std::copy(inBase, inBase + afterAxis, outBase);
    }
  }
}

In the profiler it reduces to:

  %   cumulative   self              self     total           
 time   seconds   seconds    calls  Ts/call  Ts/call  name    
  0.02     55.35     0.01                             marian::cpu::Select(IntrusivePtr<marian::TensorBase>, IntrusivePtr<marian::TensorBase>, IntrusivePtr<marian::TensorBase>, int)

What assumptions are allowed here?

kpu added a commit that referenced this issue Jul 26, 2020
Fixes #684

Measured:
enes.student.tiny11
xzcat sources.shuf.xz |head -n 10000
var (Cascade Lake) single core
based on intgemm_reintegrated_computestats branch

Before Total time: 66.69077s wall
After Total time: 61.20206s wall
@kpu kpu linked a pull request Jul 26, 2020 that will close this issue
4 tasks
@kpu kpu assigned kpu and unassigned afaji Jul 26, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants