Small optimizations #768

rhenry-nv · 2020-12-02T22:36:46Z

Description

This PR splits out some small optimizations from PR #743.

Performance improvements from this PR as measured on a Titan V using a proxy transformer model:

Times with 1 Stream

Batch	Initial Time (s)	Current Time(s)	% Runtime reduction	Speedup factor
1	204.41	166.74	0.184286483	1.225920595
2	136.77	114.46	0.163120567	1.194915254
4	84.76	69.39	0.181335536	1.221501657
8	50.32	41.52	0.174880763	1.21194605
16	29.94	25.07	0.162658651	1.194256083
32	18.51	15.98	0.136682874	1.158322904
64	12.09	10.57	0.125723739	1.143803217
128	8.19	7.15	0.126984127	1.145454545
256	6.02	5.39	0.104651163	1.116883117

Times with two streams

Batch	Initial Time (s)	Current Time(s)	% Runtime reduction	Speedup factor
1	156.74	119.03	0.240589511	1.316810888
2	106.13	79.63	0.249693772	1.33278915
4	65.1	48.64	0.252841782	1.338404605
8	38.46	29.07	0.244149766	1.323013416
16	22.68	17.29	0.237654321	1.311740891
32	14.31	10.98	0.232704403	1.303278689
64	9.51	7.19	0.243953733	1.322670376
128	6.66	5.18	0.222222222	1.285714286
256	5.05	4.06	0.196039604	1.243842365

List of changes:

Uses the per thread default stream for cublas
Uses the strided batched gemm cublas call when possible for the batchedGemm.
In the general batchedGemm case, reduces the number of memcpy calls from 3 to 1.
Rounds the width of input batches to a multiple of 8 when the GPU backend is being used. This is to enable better use of tensorcores on Volta architectures and newer
Adds notices to the files changed

How to test

I ran the regression tests. On Volta, some regression tests fail due to the additional use of tensor cores. This is because prior to CUDA 11, cublas does not use tensorcores if matrices fail to meet alignment requirements. This restriction is lifted in CUDA >= 11.

The differences in the float output from the failing regression tests are small. A list of them are provided here:

tests/interface/input-tsv/test_tsv_train_with_align.sh
tests/interface/input-tsv/test_tsv_train_with_align_and_weights.sh
tests/interface/input-tsv/test_tsv_train_with_align_and_weights_inputtypes.sh
tests/interface/input-tsv/test_tsv_train_with_align_pos0.sh
tests/interface/input-tsv/test_tsv_train_with_align_shuffle_in_ram.sh
tests/interface/input-tsv/test_tsv_train_with_align_stdin.sh
tests/scorer/nbest/test_compare_parallel_and_nbest.sh
tests/training/features/mixed-ensembles/test_ensemble_of_different_s2s.sh
tests/training/features/mixed-ensembles/test_ensemble_of_s2s_and_transformer.sh
tests/training/features/guided-alignment/test_guided_alignment_rnn.sh
tests/training/features/guided-alignment/test_guided_alignment_transformer.sh

OS: Ubuntu 18.04.3 LTS
Compiler gcc 7.5.0
nvcc version: 10.1.243

cmake command:

cmake .. -DCOMPILE_CPU=on -DCOMPILE_CUDA=on -DUSE_SENTENCEPIECE=on -DUSE_STATIC_LIBS=off -DCOMPILE_SERVER=off -DUSE_FBGEMM=on -DCOMPILE_CUDA_SM35=off -DCOMPILE_CUDA_SM50=off -DCOMPILE_CUDA_SM60=off -DCOMPILE_CUDA_SM70=on -DCOMPILE_CUDA_SM75=off -DCOMPILE_TESTS=on

Checklist

I have tested the code manually
I have run regression tests
I have read and followed CONTRIBUTING.md
I have updated CHANGELOG.md

…n. Reduced the memcpys from 3 to 1 for the general case.

… default stream. This have issue each thread issue calls to their own stream instead of the global default stream when marian is compiled to use a default stream per thread

… used for better tensor core usage.

src/data/corpus.cpp

frankseide · 2020-12-04T21:18:53Z

src/data/corpus.cpp

@@ -1,3 +1,8 @@
+/* Part of this file was contributed by NVIDIA under license:
+ *   Copyright (C) 2020 NVIDIA Corporation
+ *   SPDX-License-Identifier: MIT


What is this for? It seems the only change is the rounding of maxDims. What NVidia contribution was made here?

In general, I would be opposed to changing the comment style for licence. Marian source code does not have license information in the source files directly, but rather in a separate license file. Please let's continue to follow that pattern.

Yes that was the only contribution. However, I was told to include a notice even in files where I make one line changes. This is just me following instructions.

Also, I just saw the second part of your comment. Is there a process for adding NVIDIA to the license file? I'm not sure what the solution is here.

The one that hasn't been updated since 2016 and doesn't even name Microsoft? I guess add it to your PR and shame Marcin.

@kpu I think this is the one Frank was referring to. I'll ask Marcin if this is ok. I will also need to check to see if we are internally ok with removing the notices assuming we are added to the license file.

@emjotde What do you suggest as the way forward?

I checked internally and I can add NVIDIA to the main license file and remove the notices in all the other files!

I will take care of this in all the PRs I have submitted.

Edit: Let me know if the license change looks ok.

frankseide · 2020-12-04T21:20:51Z

src/rnn/types.h

+    Expr indices;
+    // I think this doesn't work if model split among gpus but not sure if it matters
+
+    for (auto& state : states_) {


Please add a comment what this logic does, as it is not obvious why the old code is not working.

There is nothing wrong with the old code. This just needs to check if we need to ship indices to the GPU. I will add a comment explaining what this does.

I added a comment in a recent commit but I'm not sure if it's clear enough. If so, feel free to resolve this.

frankseide · 2020-12-04T21:21:24Z

src/rnn/types.h

+      if (state.output) {
+        indices = state.output->graph()->indices(selIdx);
+        break;
+      }


What happens if neither? Is that a valid condition? If not, let's change this to else

I think neither is a valid condition

Ah, because it's a loop, sorry. But what happens if it never matches any of the conditions? Then indices ends up being NULL. Is it worth to ABORT_IF for that?

Good point. I think indices can only end up being NULL if all of the states' output and cell fields are null. In that case, we will return a vector of nulls which was the behavior of the original code. (Also, I think these values are NULL on the first run of a network so we want this behavior).

src/tensors/gpu/prod.cpp

rhenry-nv added 8 commits December 2, 2020 10:35

Removed memcpy when batchA == batchB for batched matrix multiplicatio…

3c0cb25

…n. Reduced the memcpys from 3 to 1 for the general case.

Issue matrix multiplications in cublas and cusparse to the per thread…

396f6cf

… default stream. This have issue each thread issue calls to their own stream instead of the global default stream when marian is compiled to use a default stream per thread

Reduced the number of times indices need to be copied to the GPU

23d3038

Adds licenses to files changed

6c11ba8

Updates licences

7eaa40b

Rounds input batches to a multiple of 8 when the gpu backend is being…

c52c53b

… used for better tensor core usage.

updates changelog

359430c

Fixes changelog

342dc8f

rhenry-nv mentioned this pull request Dec 4, 2020

Topk refactor #770

Open

4 tasks

frankseide reviewed Dec 4, 2020

View reviewed changes

rhenry-nv added 3 commits December 4, 2020 14:58

Renames maxDims to maxWordsInBatchSentence

919bcbe

Renames batchX to batchDimX

160f0c8

Adds some comments to state select

0caca66

rhenry-nv mentioned this pull request Dec 8, 2020

Refactor of beam search to process factor groups in parallel #772

Closed

4 tasks

rhenry-nv added 3 commits December 15, 2020 11:50

Removes NVIDIA notices in source code and adds NVIDIA to license file

c2537c1

Fixes type in license file

8e1c35e

Merge remote-tracking branch 'public/master' into small_optimizations

f501eab

rhenry-nv mentioned this pull request Apr 9, 2021

Adds better Affine support for GPUs when using CUDA 11. Introduces a new bias addition kernel for CUDA < 11 #778

Merged

4 tasks

snukky added the performance label Nov 22, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Small optimizations #768

Small optimizations #768

rhenry-nv commented Dec 2, 2020 •

edited

Loading

frankseide Dec 4, 2020

rhenry-nv Dec 4, 2020

rhenry-nv Dec 4, 2020

kpu Dec 4, 2020

rhenry-nv Dec 4, 2020

rhenry-nv Dec 15, 2020 •

edited

Loading

frankseide Dec 4, 2020

rhenry-nv Dec 4, 2020 •

edited

Loading

rhenry-nv Dec 4, 2020

frankseide Dec 4, 2020

rhenry-nv Dec 4, 2020

frankseide Dec 4, 2020

rhenry-nv Dec 4, 2020

Small optimizations #768

Are you sure you want to change the base?

Small optimizations #768

Conversation

rhenry-nv commented Dec 2, 2020 • edited Loading

Description

How to test

Checklist

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rhenry-nv Dec 15, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rhenry-nv Dec 4, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rhenry-nv commented Dec 2, 2020 •

edited

Loading

rhenry-nv Dec 15, 2020 •

edited

Loading

rhenry-nv Dec 4, 2020 •

edited

Loading