Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add optimised max_element implementation as a specific case of nth_element with beam = 1 #981

Open
wants to merge 4 commits into
base: master
Choose a base branch
from

Conversation

XapaJIaMnu
Copy link
Contributor

Description

Add optimised max_element implementation as a specific case of nth_element with n = 1

Depending on the compiler used, this should speed up beam search by a factor of 2 to 10. Synthetic benchmark can be found here https://github.com/XapaJIaMnu/maxelem_test
A summary:

Cascade lake results

GCC 11.2 clang 14 icc 2022.1.0
std::max_element 2.6696s 0.4221s 0.4662s
sequential 1.0831s 1.1924s 1.1472s
AVX512 max + max_reduce 0.2412s 0.2152s 0.2142s
AVX512 max_reduce only 0.2570s 0.2629s 0.2325s
AVX512 cmp_ps_mask 0.1884s 0.1826s 0.1833s
AVX512 ^ + vectorized overhang 0.2097s 0.2089s 0.2072s
AVX cmp_ps + movemask 0.2181s 0.1697s 0.1702s
SSE cmplt_psp + movemask 0.2692s 0.2051s 0.2221s

Ryzen 9 5900HS results

GCC 11.2 clang 14
std::max_element 2.4101s 0.7221s
sequential 0.3252s 0.3518s
AVX cmp_ps + movemask 0.1476s 0.1214s
SSE cmplt_psp + movemask 0.1693s 0.1468s

Added dependencies: none

How to test

Just load any model with the new code path and test it with beam size of 1. In our testing this reduced runtime by about 1%.
I didn't run all regression tests because there's something broken in them right now.

Checklist

  • I have tested the code manually
  • I have run regression tests
  • I have read and followed CONTRIBUTING.md
  • I have updated CHANGELOG.md

@XapaJIaMnu
Copy link
Contributor Author

XapaJIaMnu commented Apr 17, 2023

Apologies for the late reply @hieuhoang , I have some benchmarks, finally.
I show a tiny11.tied.w configuration from WMT a few years ago, tested on 15.5k WMT sentences from the last many years (avg BLEU 36.{2,4,7} depending on configuration) with and without shortlist, using upstream fbgemm/intgemm/fp32 cpu backends). Here's the results:

tl;dr the beam1 code is 1-5 seconds faster depending on the test case. The bigger the output layer, the larger the difference.

Beam1Opt Master
AVX512, fbgemm, shortlist 94.02s 96.47s
AVX512, fbgemm, no shortlist 182.80s 188.05s
AVX2,fbgemm, shortlist 114.26s 115.75s
AVX2, fbgemm, no shortlist 195.78s 200.80s
AVX512, intgemm, shortlist 106.20s 108.71s
AVX512, intgemm, no shortlist 194.08s 200.36s
AVX2,intgemm, shortlist 119.08s 120.79s
AVX2, intgemm, no shortlist 200.11s 205.18s
AVX512, shortlist 118.29s 120.85s
AVX512, no shortlist 209.82s 216.63s
AVX2, shortlist 135.16s 136.96s
AVX2, no shortlist 215.94s 221.30s

To download the models and test for yourself, please get this tarball https://nbogoychev.com/files/speedtest.tar.gz
The avx512 results are on a one core Cascade lake CPU, and the avx2 is a one core Kaby lake.

@hieuhoang
Copy link
Collaborator

hieuhoang commented Apr 17, 2023

I have no objections to approving the PR. Nick's results show a slight improvement for his model. My result, below, show hardly any change. Inching forward.

<style> </style>
  master nick's max_element
Big Machine 52.85 51.74
  53 52.49
  51.11 52.4
+bin model 154.16 154.2
  150.89 153.32
  150.77 155.23
base+LSH 135.02 137.37
  135.27 136.12
  133.43 135.39
bin model+LSH 52.79 53.34
  53.08 52.58
  52.89 54.04
     
bin model, beam2 36.12 35.51
bin model, beam3 40.43 40.73
bin model, beam4 62.01 62.77
     

@XapaJIaMnu
Copy link
Contributor Author

I'd like to do some more tests first.
At the very least, can you reproduce the results on my model? (I have packaged runme scripts, just paths to marian branches need to be adjusted)

@hieuhoang
Copy link
Collaborator

hieuhoang commented Apr 18, 2023

I may do when I have time. But your results are your own. I review PR to protect results on my models

@XapaJIaMnu
Copy link
Contributor Author

Extra findings: I tested on a proper student model and the conclusion is that my changes don't work when using LSH, but they do consistently offer better performance in all other cases. How different is the LSH output layer compared to a shortlisted output layer?

@XapaJIaMnu
Copy link
Contributor Author

I am getting some misaligned memory accesses from the LSH scores, and they take more cycles to load, that could be one of the issues..?

@hieuhoang
Copy link
Collaborator

hieuhoang commented Apr 26, 2023

LSH needs to be used with output layer without bias, shortlist doesn't have that restriction.

Not sure where misalign memory is coming from, if the LSH params is set to an aligned-friendly values, eg --output-approx-knn 128 1024, you should have mem alignments.

if max_element increase speed by 5% without LSH/shortlist, then not surprised that there's no noticeable affect when using LSH/shortlist since the vocab size you need to find the max is much smaller

@XapaJIaMnu
Copy link
Contributor Author

I see improvements with word alignment based shortlist, but not with LSH based shortlist, where I am consistently slower. I also get misaligned addresses only when using LSH based shortlist.

How big are your output layers typically? I used 50 50 shortlist for my previous test.

I can't get improvement with 100 1024 LSH. What settings do you use?

When it comes to alignment, I'd expect the array to be 256 aligned at the start, i don't care about the end as I don't attempt to vectorise the overhang.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants