Add optimised max_element implementation as a specific case of nth_element with beam = 1 #981

XapaJIaMnu · 2023-03-16T16:20:27Z

Description

Add optimised max_element implementation as a specific case of nth_element with n = 1

Depending on the compiler used, this should speed up beam search by a factor of 2 to 10. Synthetic benchmark can be found here https://github.com/XapaJIaMnu/maxelem_test
A summary:

Cascade lake results

	GCC 11.2	clang 14	icc 2022.1.0
std::max_element	2.6696s	0.4221s	0.4662s
sequential	1.0831s	1.1924s	1.1472s
AVX512 max + max_reduce	0.2412s	0.2152s	0.2142s
AVX512 max_reduce only	0.2570s	0.2629s	0.2325s
AVX512 cmp_ps_mask	0.1884s	0.1826s	0.1833s
AVX512 ^ + vectorized overhang	0.2097s	0.2089s	0.2072s
AVX cmp_ps + movemask	0.2181s	0.1697s	0.1702s
SSE cmplt_psp + movemask	0.2692s	0.2051s	0.2221s

Ryzen 9 5900HS results

	GCC 11.2	clang 14
std::max_element	2.4101s	0.7221s
sequential	0.3252s	0.3518s
AVX cmp_ps + movemask	0.1476s	0.1214s
SSE cmplt_psp + movemask	0.1693s	0.1468s

Added dependencies: none

How to test

Just load any model with the new code path and test it with beam size of 1. In our testing this reduced runtime by about 1%.
I didn't run all regression tests because there's something broken in them right now.

Checklist

I have tested the code manually
I have run regression tests
I have read and followed CONTRIBUTING.md
I have updated CHANGELOG.md

…ement with beam size of 1

XapaJIaMnu · 2023-04-17T15:51:06Z

Apologies for the late reply @hieuhoang , I have some benchmarks, finally.
I show a tiny11.tied.w configuration from WMT a few years ago, tested on 15.5k WMT sentences from the last many years (avg BLEU 36.{2,4,7} depending on configuration) with and without shortlist, using upstream fbgemm/intgemm/fp32 cpu backends). Here's the results:

tl;dr the beam1 code is 1-5 seconds faster depending on the test case. The bigger the output layer, the larger the difference.

	Beam1Opt	Master
AVX512, fbgemm, shortlist	94.02s	96.47s
AVX512, fbgemm, no shortlist	182.80s	188.05s
AVX2,fbgemm, shortlist	114.26s	115.75s
AVX2, fbgemm, no shortlist	195.78s	200.80s
AVX512, intgemm, shortlist	106.20s	108.71s
AVX512, intgemm, no shortlist	194.08s	200.36s
AVX2,intgemm, shortlist	119.08s	120.79s
AVX2, intgemm, no shortlist	200.11s	205.18s
AVX512, shortlist	118.29s	120.85s
AVX512, no shortlist	209.82s	216.63s
AVX2, shortlist	135.16s	136.96s
AVX2, no shortlist	215.94s	221.30s

To download the models and test for yourself, please get this tarball https://nbogoychev.com/files/speedtest.tar.gz
The avx512 results are on a one core Cascade lake CPU, and the avx2 is a one core Kaby lake.

hieuhoang · 2023-04-17T22:15:14Z

I have no objections to approving the PR. Nick's results show a slight improvement for his model. My result, below, show hardly any change. Inching forward.

	master	nick's max_element
Big Machine	52.85	51.74
	53	52.49
	51.11	52.4
+bin model	154.16	154.2
	150.89	153.32
	150.77	155.23
base+LSH	135.02	137.37
	135.27	136.12
	133.43	135.39
bin model+LSH	52.79	53.34
	53.08	52.58
	52.89	54.04

bin model, beam2	36.12	35.51
bin model, beam3	40.43	40.73
bin model, beam4	62.01	62.77

XapaJIaMnu · 2023-04-17T23:28:14Z

I'd like to do some more tests first.
At the very least, can you reproduce the results on my model? (I have packaged runme scripts, just paths to marian branches need to be adjusted)

hieuhoang · 2023-04-18T00:18:26Z

I may do when I have time. But your results are your own. I review PR to protect results on my models

XapaJIaMnu · 2023-04-25T17:25:09Z

Extra findings: I tested on a proper student model and the conclusion is that my changes don't work when using LSH, but they do consistently offer better performance in all other cases. How different is the LSH output layer compared to a shortlisted output layer?

XapaJIaMnu · 2023-04-25T18:27:12Z

I am getting some misaligned memory accesses from the LSH scores, and they take more cycles to load, that could be one of the issues..?

hieuhoang · 2023-04-26T00:37:45Z

LSH needs to be used with output layer without bias, shortlist doesn't have that restriction.

Not sure where misalign memory is coming from, if the LSH params is set to an aligned-friendly values, eg --output-approx-knn 128 1024, you should have mem alignments.

if max_element increase speed by 5% without LSH/shortlist, then not surprised that there's no noticeable affect when using LSH/shortlist since the vocab size you need to find the max is much smaller

XapaJIaMnu · 2023-04-26T11:35:26Z

I see improvements with word alignment based shortlist, but not with LSH based shortlist, where I am consistently slower. I also get misaligned addresses only when using LSH based shortlist.

How big are your output layers typically? I used 50 50 shortlist for my previous test.

I can't get improvement with 100 1024 LSH. What settings do you use?

When it comes to alignment, I'd expect the array to be 256 aligned at the start, i don't care about the end as I don't attempt to vectorise the overhang.

Add optimised max_element implementation as a specific case of nth_el…

d1acc1f

…ement with beam size of 1

XapaJIaMnu requested a review from hieuhoang March 16, 2023 16:20

XapaJIaMnu added 3 commits March 16, 2023 16:28

Make fallback cases take const float *

2326267

Disable size_t int warning on windows

2eed532

Try harder to compile on windows

b3cfca0

hieuhoang approved these changes Apr 17, 2023

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add optimised max_element implementation as a specific case of nth_element with beam = 1 #981

Add optimised max_element implementation as a specific case of nth_element with beam = 1 #981

XapaJIaMnu commented Mar 16, 2023

XapaJIaMnu commented Apr 17, 2023 •

edited

Loading

hieuhoang commented Apr 17, 2023 •

edited

Loading

XapaJIaMnu commented Apr 17, 2023

hieuhoang commented Apr 18, 2023 •

edited

Loading

XapaJIaMnu commented Apr 25, 2023

XapaJIaMnu commented Apr 25, 2023

hieuhoang commented Apr 26, 2023 •

edited

Loading

XapaJIaMnu commented Apr 26, 2023

Add optimised max_element implementation as a specific case of nth_element with beam = 1 #981

Are you sure you want to change the base?

Add optimised max_element implementation as a specific case of nth_element with beam = 1 #981

Conversation

XapaJIaMnu commented Mar 16, 2023

Description

Cascade lake results

Ryzen 9 5900HS results

How to test

Checklist

XapaJIaMnu commented Apr 17, 2023 • edited Loading

hieuhoang commented Apr 17, 2023 • edited Loading

XapaJIaMnu commented Apr 17, 2023

hieuhoang commented Apr 18, 2023 • edited Loading

XapaJIaMnu commented Apr 25, 2023

XapaJIaMnu commented Apr 25, 2023

hieuhoang commented Apr 26, 2023 • edited Loading

XapaJIaMnu commented Apr 26, 2023

XapaJIaMnu commented Apr 17, 2023 •

edited

Loading

hieuhoang commented Apr 17, 2023 •

edited

Loading

hieuhoang commented Apr 18, 2023 •

edited

Loading

hieuhoang commented Apr 26, 2023 •

edited

Loading