Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Avx512 Validation #45

Merged
merged 7 commits into from
Jun 20, 2024
Merged

Avx512 Validation #45

merged 7 commits into from
Jun 20, 2024

Conversation

Nick-Nuon
Copy link
Collaborator

@Nick-Nuon Nick-Nuon commented Jun 19, 2024

Here is the Avx512 Validation.
there are things that probably could be further polished ( the C++ code uses the ternary_logic instruction which could save us a further few instructions, there are also instructions that the API doesn't seem to expose ) but in the spirit of smaller more frequent PRs as well as recent comments:it works.

EDIT: I saw #44 shortly after publishing this one.

These are the benchmarks on the server:

Method FileName Mean Error StdDev Speed (GB/s)
SIMDUtf8ValidationRealDataAvx2 data/Arabic-Lipsum.utf8.txt 10,137.4 ns 85.02 ns 4.66 ns 8.06
SIMDUtf8ValidationRealDataAvx2 data/Chinese-Lipsum.utf8.txt 8,662.5 ns 43.74 ns 2.40 ns 8.06
SIMDUtf8ValidationRealDataAvx2 data/Emoji-Lipsum.utf8.txt 8,121.8 ns 25.02 ns 1.37 ns 8.07
SIMDUtf8ValidationRealDataAvx2 data/Hebrew-Lipsum.utf8.txt 8,278.8 ns 30.10 ns 1.65 ns 8.03
SIMDUtf8ValidationRealDataAvx2 data/Hindi-Lipsum.utf8.txt 18,107.1 ns 18,788.77 ns 1,029.88 ns 4.86
SIMDUtf8ValidationRealDataAvx2 data/Japanese-Lipsum.utf8.txt 8,365.4 ns 284.57 ns 15.60 ns 8.11
SIMDUtf8ValidationRealDataAvx2 data/Korean-Lipsum.utf8.txt 8,261.5 ns 81.05 ns 4.44 ns 8.06
SIMDUtf8ValidationRealDataAvx2 data/Latin-Lipsum.utf8.txt 1,135.5 ns 22.44 ns 1.23 ns 76.57
SIMDUtf8ValidationRealDataAvx2 data/Russian-Lipsum.utf8.txt 14,730.5 ns 56,181.67 ns 3,079.51 ns 7.11
SIMDUtf8ValidationRealDataAvx2 data/arabic.utf8.txt 66,011.5 ns 1,510.07 ns 82.77 ns 8.09
SIMDUtf8ValidationRealDataAvx2 data/chinese.utf8.txt 25,716.2 ns 63,979.72 ns 3,506.94 ns 7.05
SIMDUtf8ValidationRealDataAvx2 data/czech.utf8.txt 14,952.6 ns 95.66 ns 5.24 ns 10.21
SIMDUtf8ValidationRealDataAvx2 data/english.utf8.txt 25,157.6 ns 310.73 ns 17.03 ns 15.52
SIMDUtf8ValidationRealDataAvx2 data/esperanto.utf8.txt 6,722.1 ns 10.32 ns 0.57 ns 12.94
SIMDUtf8ValidationRealDataAvx2 data/french.utf8.txt 105,968.9 ns 245,134.86 ns 13,436.67 ns 4.22
SIMDUtf8ValidationRealDataAvx2 data/german.utf8.txt 15,640.2 ns 116.63 ns 6.39 ns 13.16
SIMDUtf8ValidationRealDataAvx2 data/greek.utf8.txt 17,982.8 ns 393.02 ns 21.54 ns 10.08
SIMDUtf8ValidationRealDataAvx2 data/hebrew.utf8.txt 20,402.7 ns 1,294.21 ns 70.94 ns 9.32
SIMDUtf8ValidationRealDataAvx2 data/hindi.utf8.txt 39,937.4 ns 636.20 ns 34.87 ns 9.93
SIMDUtf8ValidationRealDataAvx2 data/japanese.utf8.txt 16,833.3 ns 31.83 ns 1.74 ns 9.76
SIMDUtf8ValidationRealDataAvx2 data/korean.utf8.txt 10,639.6 ns 1,524.43 ns 83.56 ns 9.20
SIMDUtf8ValidationRealDataAvx2 data/persan.utf8.txt 16,204.9 ns 240.51 ns 13.18 ns 9.64
SIMDUtf8ValidationRealDataAvx2 data/portuguese.utf8.txt 32,543.2 ns 2,024.50 ns 110.97 ns 8.62
SIMDUtf8ValidationRealDataAvx2 data/russian.utf8.txt 46,393.5 ns 5,818.04 ns 318.91 ns 8.77
SIMDUtf8ValidationRealDataAvx2 data/thai.utf8.txt 69,299.8 ns 3,565.29 ns 195.43 ns 8.57
SIMDUtf8ValidationRealDataAvx2 data/turkish.utf8.txt 23,016.1 ns 102,029.82 ns 5,592.60 ns 8.48
SIMDUtf8ValidationRealDataAvx2 data/vietnamese.utf8.txt 36,518.6 ns 4,671.64 ns 256.07 ns 8.74
SIMDUtf8ValidationRealDataAvx512 data/Arabic-Lipsum.utf8.txt 6,820.6 ns 86.80 ns 4.76 ns 11.98
SIMDUtf8ValidationRealDataAvx512 data/Chinese-Lipsum.utf8.txt 6,116.2 ns 56.26 ns 3.08 ns 11.42
SIMDUtf8ValidationRealDataAvx512 data/Emoji-Lipsum.utf8.txt 5,456.8 ns 28.24 ns 1.55 ns 12.01
SIMDUtf8ValidationRealDataAvx512 data/Hebrew-Lipsum.utf8.txt 6,027.7 ns 3,844.90 ns 210.75 ns 11.03
SIMDUtf8ValidationRealDataAvx512 data/Hindi-Lipsum.utf8.txt 7,794.0 ns 11.89 ns 0.65 ns 11.29
SIMDUtf8ValidationRealDataAvx512 data/Japanese-Lipsum.utf8.txt 5,913.1 ns 5.38 ns 0.30 ns 11.47
SIMDUtf8ValidationRealDataAvx512 data/Korean-Lipsum.utf8.txt 5,874.6 ns 28.47 ns 1.56 ns 11.34
SIMDUtf8ValidationRealDataAvx512 data/Latin-Lipsum.utf8.txt 1,830.7 ns 5.59 ns 0.31 ns 47.49
SIMDUtf8ValidationRealDataAvx512 data/Russian-Lipsum.utf8.txt 11,364.3 ns 47,422.31 ns 2,599.38 ns 9.22
SIMDUtf8ValidationRealDataAvx512 data/arabic.utf8.txt 39,222.3 ns 335.19 ns 18.37 ns 13.61
SIMDUtf8ValidationRealDataAvx512 data/chinese.utf8.txt 14,327.0 ns 66.59 ns 3.65 ns 12.66
SIMDUtf8ValidationRealDataAvx512 data/czech.utf8.txt 11,846.1 ns 8.33 ns 0.46 ns 12.89
SIMDUtf8ValidationRealDataAvx512 data/english.utf8.txt 18,218.8 ns 66,402.30 ns 3,639.73 ns 21.43
SIMDUtf8ValidationRealDataAvx512 data/esperanto.utf8.txt 5,419.8 ns 125.89 ns 6.90 ns 16.05
SIMDUtf8ValidationRealDataAvx512 data/french.utf8.txt 31,563.0 ns 307.41 ns 16.85 ns 14.16
SIMDUtf8ValidationRealDataAvx512 data/german.utf8.txt 12,794.2 ns 498.18 ns 27.31 ns 16.08
SIMDUtf8ValidationRealDataAvx512 data/greek.utf8.txt 13,762.4 ns 85.81 ns 4.70 ns 13.18
SIMDUtf8ValidationRealDataAvx512 data/hebrew.utf8.txt 16,335.0 ns 29,068.55 ns 1,593.35 ns 11.64
SIMDUtf8ValidationRealDataAvx512 data/hindi.utf8.txt 29,788.0 ns 238.93 ns 13.10 ns 13.31
SIMDUtf8ValidationRealDataAvx512 data/japanese.utf8.txt 12,964.3 ns 41.56 ns 2.28 ns 12.68
SIMDUtf8ValidationRealDataAvx512 data/korean.utf8.txt 7,874.1 ns 90.45 ns 4.96 ns 12.43
SIMDUtf8ValidationRealDataAvx512 data/persan.utf8.txt 15,290.9 ns 60,966.07 ns 3,341.76 ns 10.22
SIMDUtf8ValidationRealDataAvx512 data/portuguese.utf8.txt 19,620.1 ns 12,876.00 ns 705.78 ns 14.30
SIMDUtf8ValidationRealDataAvx512 data/russian.utf8.txt 31,804.6 ns 182.54 ns 10.01 ns 12.80
SIMDUtf8ValidationRealDataAvx512 data/thai.utf8.txt 43,122.9 ns 617.69 ns 33.86 ns 13.77
SIMDUtf8ValidationRealDataAvx512 data/turkish.utf8.txt 13,750.4 ns 58.26 ns 3.19 ns 14.19
SIMDUtf8ValidationRealDataAvx512 data/vietnamese.utf8.txt 24,828.3 ns 964.78 ns 52.88 ns 12.85
DotnetRuntimeUtf8ValidationRealData data/Arabic-Lipsum.utf8.txt 55,401.0 ns 21,990.47 ns 1,205.37 ns 1.47
DotnetRuntimeUtf8ValidationRealData data/Chinese-Lipsum.utf8.txt 18,919.9 ns 725.61 ns 39.77 ns 3.69
DotnetRuntimeUtf8ValidationRealData data/Emoji-Lipsum.utf8.txt 71,969.6 ns 842.15 ns 46.16 ns .91
DotnetRuntimeUtf8ValidationRealData data/Hebrew-Lipsum.utf8.txt 29,432.7 ns 1,275.79 ns 69.93 ns 2.26
DotnetRuntimeUtf8ValidationRealData data/Hindi-Lipsum.utf8.txt 41,157.2 ns 257.42 ns 14.11 ns 2.14
DotnetRuntimeUtf8ValidationRealData data/Japanese-Lipsum.utf8.txt 18,760.8 ns 484.99 ns 26.58 ns 3.61
DotnetRuntimeUtf8ValidationRealData data/Korean-Lipsum.utf8.txt 55,965.0 ns 252,823.94 ns 13,858.13 ns 1.19
DotnetRuntimeUtf8ValidationRealData data/Latin-Lipsum.utf8.txt 915.9 ns 7.12 ns 0.39 ns 94.93
DotnetRuntimeUtf8ValidationRealData data/Russian-Lipsum.utf8.txt 82,021.6 ns 23,870.83 ns 1,308.44 ns 1.28
DotnetRuntimeUtf8ValidationRealData data/arabic.utf8.txt 340,792.7 ns 19,903.61 ns 1,090.98 ns 1.57
DotnetRuntimeUtf8ValidationRealData data/chinese.utf8.txt 96,788.6 ns 237,243.54 ns 13,004.12 ns 1.87
DotnetRuntimeUtf8ValidationRealData data/czech.utf8.txt 48,871.4 ns 3,868.00 ns 212.02 ns 3.12
DotnetRuntimeUtf8ValidationRealData data/english.utf8.txt 22,857.3 ns 235.70 ns 12.92 ns 17.08
DotnetRuntimeUtf8ValidationRealData data/esperanto.utf8.txt 10,296.1 ns 139.32 ns 7.64 ns 8.45
DotnetRuntimeUtf8ValidationRealData data/french.utf8.txt 114,348.8 ns 5,024.67 ns 275.42 ns 3.91
DotnetRuntimeUtf8ValidationRealData data/german.utf8.txt 27,639.6 ns 120,013.08 ns 6,578.32 ns 7.45
DotnetRuntimeUtf8ValidationRealData data/greek.utf8.txt 87,059.7 ns 2,557.58 ns 140.19 ns 2.08
DotnetRuntimeUtf8ValidationRealData data/hebrew.utf8.txt 122,241.9 ns 1,744.12 ns 95.60 ns 1.56
DotnetRuntimeUtf8ValidationRealData data/hindi.utf8.txt 252,360.3 ns 43,712.06 ns 2,396.01 ns 1.57
DotnetRuntimeUtf8ValidationRealData data/japanese.utf8.txt 67,274.8 ns 5,856.09 ns 320.99 ns 2.44
DotnetRuntimeUtf8ValidationRealData data/korean.utf8.txt 56,054.9 ns 137,078.89 ns 7,513.76 ns 1.75
DotnetRuntimeUtf8ValidationRealData data/persan.utf8.txt 78,697.7 ns 21,130.72 ns 1,158.25 ns 1.98
DotnetRuntimeUtf8ValidationRealData data/portuguese.utf8.txt 55,472.5 ns 11,424.28 ns 626.20 ns 5.06
DotnetRuntimeUtf8ValidationRealData data/russian.utf8.txt 274,754.8 ns 15,428.52 ns 845.69 ns 1.48
DotnetRuntimeUtf8ValidationRealData data/thai.utf8.txt 190,944.7 ns 7,580.04 ns 415.49 ns 3.11
DotnetRuntimeUtf8ValidationRealData data/turkish.utf8.txt 57,592.7 ns 282,176.80 ns 15,467.06 ns 3.39
DotnetRuntimeUtf8ValidationRealData data/vietnamese.utf8.txt 283,430.4 ns 12,219.31 ns 669.78 ns 1.13
Utf8ValidationRealDataScalar data/Arabic-Lipsum.utf8.txt 63,536.4 ns 341.19 ns 18.70 ns 1.29
Utf8ValidationRealDataScalar data/Chinese-Lipsum.utf8.txt 60,855.3 ns 498.71 ns 27.34 ns 1.15
Utf8ValidationRealDataScalar data/Emoji-Lipsum.utf8.txt 56,561.4 ns 83,535.15 ns 4,578.84 ns 1.16
Utf8ValidationRealDataScalar data/Hebrew-Lipsum.utf8.txt 51,848.2 ns 1,108.96 ns 60.79 ns 1.28
Utf8ValidationRealDataScalar data/Hindi-Lipsum.utf8.txt 96,320.7 ns 42,366.84 ns 2,322.27 ns .91
Utf8ValidationRealDataScalar data/Japanese-Lipsum.utf8.txt 59,364.0 ns 16,945.89 ns 928.86 ns 1.14
Utf8ValidationRealDataScalar data/Korean-Lipsum.utf8.txt 53,720.0 ns 2,329.92 ns 127.71 ns 1.24
Utf8ValidationRealDataScalar data/Latin-Lipsum.utf8.txt 54,860.0 ns 925.25 ns 50.72 ns 1.58
Utf8ValidationRealDataScalar data/Russian-Lipsum.utf8.txt 162,837.6 ns 446,029.09 ns 24,448.36 ns .64
Utf8ValidationRealDataScalar data/arabic.utf8.txt 591,749.5 ns 46,791.13 ns 2,564.78 ns .90
Utf8ValidationRealDataScalar data/chinese.utf8.txt 195,110.4 ns 16,390.58 ns 898.42 ns .93
Utf8ValidationRealDataScalar data/czech.utf8.txt 149,503.3 ns 374.97 ns 20.55 ns 1.02
Utf8ValidationRealDataScalar data/english.utf8.txt 251,103.8 ns 7,510.90 ns 411.70 ns 1.55
Utf8ValidationRealDataScalar data/esperanto.utf8.txt 66,051.1 ns 85,049.79 ns 4,661.87 ns 1.32
Utf8ValidationRealDataScalar data/french.utf8.txt 285,057.2 ns 15,087.89 ns 827.02 ns 1.57
Utf8ValidationRealDataScalar data/german.utf8.txt 148,723.7 ns 10,200.15 ns 559.10 ns 1.38
Utf8ValidationRealDataScalar data/greek.utf8.txt 205,564.6 ns 14,293.06 ns 783.45 ns .88
Utf8ValidationRealDataScalar data/hebrew.utf8.txt 252,585.2 ns 2,724.71 ns 149.35 ns .75
Utf8ValidationRealDataScalar data/hindi.utf8.txt 463,545.7 ns 49,923.28 ns 2,736.46 ns .86
Utf8ValidationRealDataScalar data/japanese.utf8.txt 161,064.1 ns 27,864.82 ns 1,527.37 ns 1.02
Utf8ValidationRealDataScalar data/korean.utf8.txt 117,066.8 ns 558.20 ns 30.60 ns .84
Utf8ValidationRealDataScalar data/persan.utf8.txt 179,968.4 ns 5,710.02 ns 312.99 ns .87
Utf8ValidationRealDataScalar data/portuguese.utf8.txt 217,844.7 ns 647.70 ns 35.50 ns 1.29
Utf8ValidationRealDataScalar data/russian.utf8.txt 526,490.0 ns 89,822.49 ns 4,923.47 ns .77
Utf8ValidationRealDataScalar data/thai.utf8.txt 539,945.7 ns 3,131.04 ns 171.62 ns 1.10
Utf8ValidationRealDataScalar data/turkish.utf8.txt 161,740.3 ns 4,280.48 ns 234.63 ns 1.21
Utf8ValidationRealDataScalar data/vietnamese.utf8.txt 471,340.2 ns 2,275,023.21 ns 124,701.70 ns .68
SIMDUtf8ValidationRealDataSse data/Arabic-Lipsum.utf8.txt 18,899.5 ns 347.74 ns 19.06 ns 4.32
SIMDUtf8ValidationRealDataSse data/Chinese-Lipsum.utf8.txt 16,338.5 ns 163.09 ns 8.94 ns 4.27
SIMDUtf8ValidationRealDataSse data/Emoji-Lipsum.utf8.txt 14,897.3 ns 302.15 ns 16.56 ns 4.40
SIMDUtf8ValidationRealDataSse data/Hebrew-Lipsum.utf8.txt 15,394.4 ns 166.48 ns 9.13 ns 4.32
SIMDUtf8ValidationRealDataSse data/Hindi-Lipsum.utf8.txt 20,502.8 ns 502.51 ns 27.54 ns 4.29
SIMDUtf8ValidationRealDataSse data/Japanese-Lipsum.utf8.txt 15,872.0 ns 58.99 ns 3.23 ns 4.27
SIMDUtf8ValidationRealDataSse data/Korean-Lipsum.utf8.txt 15,404.0 ns 200.73 ns 11.00 ns 4.32
SIMDUtf8ValidationRealDataSse data/Latin-Lipsum.utf8.txt 1,767.3 ns 20.46 ns 1.12 ns 49.19
SIMDUtf8ValidationRealDataSse data/Russian-Lipsum.utf8.txt 24,934.2 ns 140.44 ns 7.70 ns 4.20
SIMDUtf8ValidationRealDataSse data/arabic.utf8.txt 119,518.4 ns 3,013.52 ns 165.18 ns 4.47
SIMDUtf8ValidationRealDataSse data/chinese.utf8.txt 32,132.3 ns 3,737.64 ns 204.87 ns 5.64
SIMDUtf8ValidationRealDataSse data/czech.utf8.txt 42,566.7 ns 1,832.55 ns 100.45 ns 3.59
SIMDUtf8ValidationRealDataSse data/english.utf8.txt 48,431.4 ns 227.80 ns 12.49 ns 8.06
SIMDUtf8ValidationRealDataSse data/esperanto.utf8.txt 11,393.0 ns 39.71 ns 2.18 ns 7.63
SIMDUtf8ValidationRealDataSse data/french.utf8.txt 119,575.1 ns 24,103.53 ns 1,321.20 ns 3.74
SIMDUtf8ValidationRealDataSse data/german.utf8.txt 26,453.7 ns 231.94 ns 12.71 ns 7.78
SIMDUtf8ValidationRealDataSse data/greek.utf8.txt 32,165.2 ns 9,724.94 ns 533.06 ns 5.64
SIMDUtf8ValidationRealDataSse data/hebrew.utf8.txt 41,494.2 ns 4,674.52 ns 256.23 ns 4.58
SIMDUtf8ValidationRealDataSse data/hindi.utf8.txt 82,003.0 ns 8,793.81 ns 482.02 ns 4.84
SIMDUtf8ValidationRealDataSse data/japanese.utf8.txt 30,986.7 ns 6,717.56 ns 368.21 ns 5.30
SIMDUtf8ValidationRealDataSse data/korean.utf8.txt 17,630.1 ns 518.81 ns 28.44 ns 5.55
SIMDUtf8ValidationRealDataSse data/persan.utf8.txt 27,523.1 ns 1,802.55 ns 98.80 ns 5.68
SIMDUtf8ValidationRealDataSse data/portuguese.utf8.txt 57,748.9 ns 21,180.81 ns 1,160.99 ns 4.86
SIMDUtf8ValidationRealDataSse data/russian.utf8.txt 92,969.0 ns 13,638.38 ns 747.57 ns 4.38
SIMDUtf8ValidationRealDataSse data/thai.utf8.txt 122,815.2 ns 2,025.15 ns 111.01 ns 4.83
SIMDUtf8ValidationRealDataSse data/turkish.utf8.txt 32,433.2 ns 883.04 ns 48.40 ns 6.01
SIMDUtf8ValidationRealDataSse data/vietnamese.utf8.txt 80,863.5 ns 4,768.17 ns 261.36 ns 3.95

Almost there!

@lemire
Copy link
Member

lemire commented Jun 20, 2024

@Nick-Nuon Obviously, we can fold this in if it is ready. :-)

int start_point = processedLength;

// The block goes from processedLength to processedLength/16*16.
int asciibytes = 0; // number of ascii bytes in the block (could also be called n1)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please see the other PR, we do not need asciibytes. ;-)

// important: we just update asciibytes if there was no error.
// We count the number of ascii bytes in the block using just some simple arithmetic
// and no expensive operation:
asciibytes += (int)(64 - Popcnt.X64.PopCount(mask));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't actually need to update asciibytes. :-)

return invalidBytePointer;
}
prevIncomplete = Vector512<byte>.Zero;
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you look at the other PR, I do some extra work right after...

prevIncomplete = Vector512<byte>.Zero;

... because in a lot of data files, we might have streams of ASCII. So I think we need to grab the twitter.json file and run tests to see whether my optimization applies here. I suspect it does.

(It is quite simple.)

src/UTF8.cs Outdated Show resolved Hide resolved

// We skip any ASCII characters at the start of the buffer
int asciirun = 0;
for (; asciirun + 128 <= inputLength; asciirun += 128)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here you unroll in blocks of 128 bytes which might be the right choice, but we would like to run some checks. We should be certain that 128 bytes is better than 64 bytes.

Copy link
Member

@lemire lemire left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wow. I was giving up on the AVX-512 because I was concerned we would need to mess around and we would get bad results, but we do not. The results are actually quite good.

Very good work. I am impressed !!!

Ok. Now we have a good problem in that we have two PRs to merge. But that's not overly difficult.

@lemire
Copy link
Member

lemire commented Jun 20, 2024

❤️

@lemire
Copy link
Member

lemire commented Jun 20, 2024

I am merging your PR. Let us rebase the other one.

@lemire lemire merged commit 92da59a into main Jun 20, 2024
6 checks passed
@Nick-Nuon
Copy link
Collaborator Author

Wow. I was giving up on the AVX-512 because I was concerned we would need to mess around and we would get bad results, but we do not. The results are actually quite good.

Very good work. I am impressed !!!

Ok. Now we have a good problem in that we have two PRs to merge. But that's not overly difficult.

Glad to be of service! ^_^

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants