Select function on CPU takes 10% of time on tiny students, can be optimized #684

kpu · 2020-07-25T11:27:00Z

Just ran a profiler on the enes student http://statmt.org/bergamot/models/ . This was in the intgemm_reintegrated_computestats branch, but the Select function is the same in master.

Each sample counts as 0.01 seconds.
  %   cumulative   self              self     total           
 time   seconds   seconds    calls  Ts/call  Ts/call  name    
 35.25     21.98    21.98                             void intgemm::OMPParallelWrap<intgemm::callbacks::UnquantizeAndAddBiasAndWrite, intgemm::AVX512VNNI_8bit, signed char>(signed char const*, signed char const*, unsigned int, unsigned int, unsigned int, intgemm::callbacks::UnquantizeAndAddBiasAndWrite)
 10.79     28.71     6.73                             marian::cpu::Select(IntrusivePtr<marian::TensorBase>, IntrusivePtr<marian::TensorBase>, IntrusivePtr<marian::TensorBase>, int)
  8.52     34.02     5.31                             mkl_blas_avx512_sgemm_kernel_0_b0
  7.84     38.91     4.89                             mkl_blas_avx512_sgemm_scopy_right48_ea
  5.79     42.52     3.61                             marian::cpu::LayerNormalization(IntrusivePtr<marian::TensorBase>, IntrusivePtr<marian::TensorBase>, IntrusivePtr<marian::TensorBase>, IntrusivePtr<marian::TensorBase>, float)
  5.20     45.76     3.24                             mkl_blas_avx512_xsgemv

That Select function is 10.79% of the time:

marian-dev/src/tensors/cpu/tensor_operators.cpp

Line 692 in b28905a

void Select(Tensor out,

And as the comments say, an optimized version is TODO. All of the calls in the student are axis=2. There was an attempt at optimization for this case but it's been disabled and the comments claim it doesn't work.

marian-dev/src/tensors/cpu/tensor_operators.cpp

Lines 708 to 711 in b28905a

    
           #if 0 // buggy but not really used? 
        
             if(axisCPU == 2 && outShape == idxShape) // specialization for axis==2 when there is no broadcasting, @TODO to be removed once we have a faster implementation below 
        
               return SelectAxis2(out, in, indices); 
        
           #endif

Also in all the student calls, outShape != inShape so indeed that version wouldn't be called.

The text was updated successfully, but these errors were encountered:

emjotde · 2020-07-25T22:54:10Z

I think the Element(...) function might be a good template how to approach this. There I am using template meta-programming to unroll over N (template parameter) number of dimensions. Computation of the current index is progressive, therefor a lot cheaper than re-calculating from the elemental dimensions at each time step.

https://github.com/marian-nmt/marian-dev/blob/master/src/tensors/cpu/element.h

Using that approach it should also be possible to write one solution for all possibilities.

kpu · 2020-07-25T23:24:48Z

If I'm allowed to assume indices is 1x1x...x (axis) x1x1... and consecutive memory then

void Select(Tensor out,
            const Tensor in,
            const Tensor indices,
            int axis) {
  matchOrAbort<IndexType>(indices->type());
  
  functional::Shape outShape = out->shape();
  functional::Shape inShape  = in->shape();
  functional::Shape idxShape = indices->shape();
  
  int axisCPU = (int)(axis + functional::Shape::size() - out->shape().size());
  
  // Reduce the problem to beforeAxis x idxShape[axisCPU] x afterAxis.
  int beforeAxis = 1;
  for (int i = 0; i < axisCPU; ++i) {
    beforeAxis *= outShape[i];
  }
  int afterAxis = 1;
  for (int i = axisCPU + 1; i < functional::Shape::size(); ++i) {
    afterAxis *= outShape[i];
  }
  // Stride to use for the beforeAxis dimension in the input and output tensors.
  int inBeforeStride = axisCPU ? inShape.stride(axisCPU - 1) : inShape.elements();
  int outBeforeStride = axisCPU ? outShape.stride(axisCPU - 1) : outShape.elements();

  for (int beforeIdx = 0; beforeIdx < beforeAxis; ++beforeIdx) {
    for (int axisIdx = 0; axisIdx < idxShape[axisCPU]; ++axisIdx) {
      // This is the value to read along.
      int index = indices->data<IndexType>()[axisIdx];
      auto inBase = in->data() + beforeIdx * inBeforeStride + index * afterAxis;
      auto outBase = out->data() + beforeIdx * outBeforeStride + axisIdx * afterAxis;
      std::copy(inBase, inBase + afterAxis, outBase);
    }
  }
}

In the profiler it reduces to:

  %   cumulative   self              self     total           
 time   seconds   seconds    calls  Ts/call  Ts/call  name    
  0.02     55.35     0.01                             marian::cpu::Select(IntrusivePtr<marian::TensorBase>, IntrusivePtr<marian::TensorBase>, IntrusivePtr<marian::TensorBase>, int)

What assumptions are allowed here?

Fixes #684 Measured: enes.student.tiny11 xzcat sources.shuf.xz |head -n 10000 var (Cascade Lake) single core based on intgemm_reintegrated_computestats branch Before Total time: 66.69077s wall After Total time: 61.20206s wall

kpu added the performance label Jul 25, 2020

kpu assigned afaji Jul 25, 2020

kpu linked a pull request Jul 26, 2020 that will close this issue

Fast implementation of Select for most cases on CPU #687

Open

4 tasks

kpu assigned kpu and unassigned afaji Jul 26, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Select function on CPU takes 10% of time on tiny students, can be optimized #684

Select function on CPU takes 10% of time on tiny students, can be optimized #684

kpu commented Jul 25, 2020 •

edited

Loading

emjotde commented Jul 25, 2020 •

edited

Loading

kpu commented Jul 25, 2020 •

edited

Loading

Select function on CPU takes 10% of time on tiny students, can be optimized #684

Select function on CPU takes 10% of time on tiny students, can be optimized #684

Comments

kpu commented Jul 25, 2020 • edited Loading

emjotde commented Jul 25, 2020 • edited Loading

kpu commented Jul 25, 2020 • edited Loading

kpu commented Jul 25, 2020 •

edited

Loading

emjotde commented Jul 25, 2020 •

edited

Loading

kpu commented Jul 25, 2020 •

edited

Loading