Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fast implementation of Select for most cases on CPU #687

Open
wants to merge 4 commits into
base: master
Choose a base branch
from

Conversation

kpu
Copy link
Member

@kpu kpu commented Jul 26, 2020

The existing implementation of Select for CPU is very slow.

Reimplemented the Select function for CPU using std::copy for the case when stuff to copy after the axis is contiguous, which it always is in practice.

Fixes #684

Measured:
enes.student.tiny11
xzcat sources.shuf.xz |head -n 10000
var (Cascade Lake) single core
based on intgemm_reintegrated_computestats branch

Before Total time: 66.69077s wall
After Total time: 61.20206s wall

Testing

Ran Marian unit tests (which already includes a tesf tor Select)
Ran enes student on Linux with cmake .. -DCMAKE_BUILD_TYPE=Debug -DCOMPILE_CUDA=OFF -DUSE_FBGEMM=ON -DUSE_SENTENCEPIECE=ON -DCOMPILE_TESTS=ON

Checklist

  • I have tested the code manually
  • I have run regression tests
  • I have read and followed CONTRIBUTING.md
  • I have updated CHANGELOG.md

kpu added 4 commits July 26, 2020 19:26
Fixes #684

Measured:
enes.student.tiny11
xzcat sources.shuf.xz |head -n 10000
var (Cascade Lake) single core
based on intgemm_reintegrated_computestats branch

Before Total time: 66.69077s wall
After Total time: 61.20206s wall
@emjotde emjotde self-assigned this Jul 27, 2020
@emjotde
Copy link
Member

emjotde commented Jul 27, 2020

@snukky Cool, the checks are already doing their job.

idxBase += idxStride;
}
});
return;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's do an if {} else {} here instead of the empty return. Could you also mention that the code below is execute if the above conditions are not met and that this code is generic but slow.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe let's also add a positive and a negative example where each branch is used? There are a couple of them in tests_operators, I think at least one will still go to the default version?

const functional::Shape &inShape,
const functional::Shape &idxShape,
int axisCPU,
Backend backend) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's use a different name than Backend. We use backend for CPU/GPU device things, had me confused for a sec.

@emjotde
Copy link
Member

emjotde commented Aug 1, 2020

Nice change. You mentioned in the issue that this improves things for your tiny models, but not for the WNGT models? Can you briefly say here what the difference is?

Copy link
Contributor

@frankseide frankseide left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice change indeed. LGTM!

@snukky
Copy link
Member

snukky commented Aug 26, 2020

Ping @kpu

@emjotde
Copy link
Member

emjotde commented Aug 31, 2020

@kpu if you address the review comments this is ready to merge.

@kpu
Copy link
Member Author

kpu commented Aug 31, 2020

I know, will fix it this week.

@emjotde
Copy link
Member

emjotde commented Oct 26, 2020

Ping :P

@emjotde
Copy link
Member

emjotde commented Mar 23, 2021

@kpu Beep bop boop

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Select function on CPU takes 10% of time on tiny students, can be optimized
4 participants