Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Small optimizations #768

Open
wants to merge 14 commits into
base: master
Choose a base branch
from

Conversation

rhenry-nv
Copy link
Contributor

@rhenry-nv rhenry-nv commented Dec 2, 2020

Description

This PR splits out some small optimizations from PR #743.

Performance improvements from this PR as measured on a Titan V using a proxy transformer model:

Times with 1 Stream

Batch Initial Time (s) Current Time(s) % Runtime reduction Speedup factor
1 204.41 166.74 0.184286483 1.225920595
2 136.77 114.46 0.163120567 1.194915254
4 84.76 69.39 0.181335536 1.221501657
8 50.32 41.52 0.174880763 1.21194605
16 29.94 25.07 0.162658651 1.194256083
32 18.51 15.98 0.136682874 1.158322904
64 12.09 10.57 0.125723739 1.143803217
128 8.19 7.15 0.126984127 1.145454545
256 6.02 5.39 0.104651163 1.116883117

Times with two streams

Batch Initial Time (s) Current Time(s) % Runtime reduction Speedup factor
1 156.74 119.03 0.240589511 1.316810888
2 106.13 79.63 0.249693772 1.33278915
4 65.1 48.64 0.252841782 1.338404605
8 38.46 29.07 0.244149766 1.323013416
16 22.68 17.29 0.237654321 1.311740891
32 14.31 10.98 0.232704403 1.303278689
64 9.51 7.19 0.243953733 1.322670376
128 6.66 5.18 0.222222222 1.285714286
256 5.05 4.06 0.196039604 1.243842365

List of changes:

  • Uses the per thread default stream for cublas
  • Uses the strided batched gemm cublas call when possible for the batchedGemm.
  • In the general batchedGemm case, reduces the number of memcpy calls from 3 to 1.
  • Rounds the width of input batches to a multiple of 8 when the GPU backend is being used. This is to enable better use of tensorcores on Volta architectures and newer
  • Adds notices to the files changed

How to test

I ran the regression tests. On Volta, some regression tests fail due to the additional use of tensor cores. This is because prior to CUDA 11, cublas does not use tensorcores if matrices fail to meet alignment requirements. This restriction is lifted in CUDA >= 11.

The differences in the float output from the failing regression tests are small. A list of them are provided here:

  • tests/interface/input-tsv/test_tsv_train_with_align.sh
  • tests/interface/input-tsv/test_tsv_train_with_align_and_weights.sh
  • tests/interface/input-tsv/test_tsv_train_with_align_and_weights_inputtypes.sh
  • tests/interface/input-tsv/test_tsv_train_with_align_pos0.sh
  • tests/interface/input-tsv/test_tsv_train_with_align_shuffle_in_ram.sh
  • tests/interface/input-tsv/test_tsv_train_with_align_stdin.sh
  • tests/scorer/nbest/test_compare_parallel_and_nbest.sh
  • tests/training/features/mixed-ensembles/test_ensemble_of_different_s2s.sh
  • tests/training/features/mixed-ensembles/test_ensemble_of_s2s_and_transformer.sh
  • tests/training/features/guided-alignment/test_guided_alignment_rnn.sh
  • tests/training/features/guided-alignment/test_guided_alignment_transformer.sh

OS: Ubuntu 18.04.3 LTS
Compiler gcc 7.5.0
nvcc version: 10.1.243

cmake command:

cmake .. -DCOMPILE_CPU=on -DCOMPILE_CUDA=on -DUSE_SENTENCEPIECE=on -DUSE_STATIC_LIBS=off -DCOMPILE_SERVER=off -DUSE_FBGEMM=on -DCOMPILE_CUDA_SM35=off -DCOMPILE_CUDA_SM50=off -DCOMPILE_CUDA_SM60=off -DCOMPILE_CUDA_SM70=on -DCOMPILE_CUDA_SM75=off -DCOMPILE_TESTS=on

Checklist

  • I have tested the code manually
  • I have run regression tests
  • I have read and followed CONTRIBUTING.md
  • I have updated CHANGELOG.md

…n. Reduced the memcpys from 3 to 1 for the general case.
… default stream. This have issue each thread issue calls to their own stream instead of the global default stream when marian is compiled to use a default stream per thread
@rhenry-nv rhenry-nv mentioned this pull request Dec 4, 2020
4 tasks
src/data/corpus.cpp Show resolved Hide resolved
@@ -1,3 +1,8 @@
/* Part of this file was contributed by NVIDIA under license:
* Copyright (C) 2020 NVIDIA Corporation
* SPDX-License-Identifier: MIT
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is this for? It seems the only change is the rounding of maxDims. What NVidia contribution was made here?

In general, I would be opposed to changing the comment style for licence. Marian source code does not have license information in the source files directly, but rather in a separate license file. Please let's continue to follow that pattern.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes that was the only contribution. However, I was told to include a notice even in files where I make one line changes. This is just me following instructions.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, I just saw the second part of your comment. Is there a process for adding NVIDIA to the license file? I'm not sure what the solution is here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The one that hasn't been updated since 2016 and doesn't even name Microsoft? I guess add it to your PR and shame Marcin.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kpu I think this is the one Frank was referring to. I'll ask Marcin if this is ok. I will also need to check to see if we are internally ok with removing the notices assuming we are added to the license file.

@emjotde What do you suggest as the way forward?

Copy link
Contributor Author

@rhenry-nv rhenry-nv Dec 15, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I checked internally and I can add NVIDIA to the main license file and remove the notices in all the other files!

I will take care of this in all the PRs I have submitted.

Edit: Let me know if the license change looks ok.

Expr indices;
// I think this doesn't work if model split among gpus but not sure if it matters

for (auto& state : states_) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add a comment what this logic does, as it is not obvious why the old code is not working.

Copy link
Contributor Author

@rhenry-nv rhenry-nv Dec 4, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is nothing wrong with the old code. This just needs to check if we need to ship indices to the GPU. I will add a comment explaining what this does.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a comment in a recent commit but I'm not sure if it's clear enough. If so, feel free to resolve this.

if (state.output) {
indices = state.output->graph()->indices(selIdx);
break;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens if neither? Is that a valid condition? If not, let's change this to else

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think neither is a valid condition

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, because it's a loop, sorry. But what happens if it never matches any of the conditions? Then indices ends up being NULL. Is it worth to ABORT_IF for that?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. I think indices can only end up being NULL if all of the states' output and cell fields are null. In that case, we will return a vector of nulls which was the behavior of the original code. (Also, I think these values are NULL on the first run of a network so we want this behavior).

src/tensors/gpu/prod.cpp Outdated Show resolved Hide resolved
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants