Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prebuilt binary with PGO here #141

Open
kkocdko opened this issue May 24, 2024 · 8 comments
Open

Prebuilt binary with PGO here #141

kkocdko opened this issue May 24, 2024 · 8 comments

Comments

@kkocdko
Copy link

kkocdko commented May 24, 2024

Update 20240902: use this newer version then run ./zcodecs ect xxx.

Profile-Guided Optimizations enabled.

[kkocdko@klf misc]$ ./hyperfine -w 1 -r 5 './ect_flto -5 1.png 2.png 3.png '
Benchmark 1: ./ect_flto -5 1.png 2.png 3.png 
  Time (mean ± σ):      5.400 s ±  0.011 s    [User: 5.351 s, System: 0.035 s]
  Range (min … max):    5.389 s …  5.411 s    5 runs
 
[kkocdko@klf misc]$ ./hyperfine -w 1 -r 5 './ect_flto_pgo -5 1.png 2.png 3.png '
Benchmark 1: ./ect_flto_pgo -5 1.png 2.png 3.png 
  Time (mean ± σ):      4.481 s ±  0.014 s    [User: 4.428 s, System: 0.042 s]
  Range (min … max):    4.469 s …  4.503 s    5 runs
 
[kkocdko@klf misc]$ 

The real result depends on your workload.

#120

@ghtm2
Copy link

ghtm2 commented Aug 24, 2024

For anyone who might stumble upon this:
The addition of a x86-64 micro architecture level can squeeze out some more performance, depending upon the compression level and hardware capabilities.

Benchmark 1 = plain build
Benchmark 2 = the binary linked above
Benchmark 3 = ltoed, pgoed and x86-64-v3 leveled build
Benchmark 4 = ltoed, pgoed and x86-64-v4 leveled build

Benchmark 1: mkdir -p /tmp/tst; cp *.png /tmp/tst/; ./ect /tmp/tst; rm -rf /tmp/tst
  Time (mean ± σ):      3.139 s ±  0.014 s    [User: 3.102 s, System: 0.031 s]
  Range (min … max):    3.120 s …  3.159 s    5 runs
 
Benchmark 2: mkdir -p /tmp/tst; cp *.png /tmp/tst/; ./ect_clevert /tmp/tst; rm -rf /tmp/tst
  Time (mean ± σ):      2.911 s ±  0.013 s    [User: 2.878 s, System: 0.026 s]
  Range (min … max):    2.895 s …  2.926 s    5 runs
 
Benchmark 3: mkdir -p /tmp/tst; cp *.png /tmp/tst/; ./ect_v3 /tmp/tst; rm -rf /tmp/tst
  Time (mean ± σ):      2.865 s ±  0.005 s    [User: 2.828 s, System: 0.030 s]
  Range (min … max):    2.858 s …  2.871 s    5 runs
 
Benchmark 4: mkdir -p /tmp/tst; cp *.png /tmp/tst/; ./ect_v4 /tmp/tst; rm -rf /tmp/tst
  Time (mean ± σ):      2.880 s ±  0.002 s    [User: 2.843 s, System: 0.030 s]
  Range (min … max):    2.878 s …  2.882 s    5 runs
 
Summary
  mkdir -p /tmp/tst; cp *.png /tmp/tst/; ./ect_v3 /tmp/tst; rm -rf /tmp/tst ran
    1.01 ± 0.00 times faster than mkdir -p /tmp/tst; cp *.png /tmp/tst/; ./ect_v4 /tmp/tst; rm -rf /tmp/tst
    1.02 ± 0.01 times faster than mkdir -p /tmp/tst; cp *.png /tmp/tst/; ./ect_clevert /tmp/tst; rm -rf /tmp/tst
    1.10 ± 0.01 times faster than mkdir -p /tmp/tst; cp *.png /tmp/tst/; ./ect /tmp/tst; rm -rf /tmp/tst

At default settings the difference is neglegible, if that is all you use, don't bother.

Benchmark 1: mkdir -p /tmp/tst; cp *.png /tmp/tst/; ./ect -5 /tmp/tst; rm -rf /tmp/tst
  Time (mean ± σ):      6.439 s ±  0.096 s    [User: 6.389 s, System: 0.037 s]
  Range (min … max):    6.334 s …  6.548 s    5 runs
 
Benchmark 2: mkdir -p /tmp/tst; cp *.png /tmp/tst/; ./ect_clevert -5 /tmp/tst; rm -rf /tmp/tst
  Time (mean ± σ):      5.213 s ±  0.017 s    [User: 5.164 s, System: 0.037 s]
  Range (min … max):    5.193 s …  5.230 s    5 runs
 
Benchmark 3: mkdir -p /tmp/tst; cp *.png /tmp/tst/; ./ect_v3 -5 /tmp/tst; rm -rf /tmp/tst
  Time (mean ± σ):      4.358 s ±  0.016 s    [User: 4.307 s, System: 0.040 s]
  Range (min … max):    4.340 s …  4.379 s    5 runs
 
Benchmark 4: mkdir -p /tmp/tst; cp *.png /tmp/tst/; ./ect_v4 -5 /tmp/tst; rm -rf /tmp/tst
  Time (mean ± σ):      4.258 s ±  0.010 s    [User: 4.208 s, System: 0.040 s]
  Range (min … max):    4.251 s …  4.276 s    5 runs
 
Summary
  mkdir -p /tmp/tst; cp *.png /tmp/tst/; ./ect_v4 -5 /tmp/tst; rm -rf /tmp/tst ran
    1.02 ± 0.00 times faster than mkdir -p /tmp/tst; cp *.png /tmp/tst/; ./ect_v3 -5 /tmp/tst; rm -rf /tmp/tst
    1.22 ± 0.00 times faster than mkdir -p /tmp/tst; cp *.png /tmp/tst/; ./ect_clevert -5 /tmp/tst; rm -rf /tmp/tst
    1.51 ± 0.02 times faster than mkdir -p /tmp/tst; cp *.png /tmp/tst/; ./ect -5 /tmp/tst; rm -rf /tmp/tst

Does almost as much as adding pgo did.

Benchmark 1: mkdir -p /tmp/tst; cp *.png /tmp/tst/; ./ect -9 /tmp/tst; rm -rf /tmp/tst
  Time (mean ± σ):     65.767 s ±  0.164 s    [User: 65.578 s, System: 0.052 s]
  Range (min … max):   65.602 s … 66.035 s    5 runs
 
Benchmark 2: mkdir -p /tmp/tst; cp *.png /tmp/tst/; ./ect_clevert -9 /tmp/tst; rm -rf /tmp/tst
  Time (mean ± σ):     43.676 s ±  0.030 s    [User: 43.521 s, System: 0.052 s]
  Range (min … max):   43.637 s … 43.711 s    5 runs
 
Benchmark 3: mkdir -p /tmp/tst; cp *.png /tmp/tst/; ./ect_v3 -9 /tmp/tst; rm -rf /tmp/tst
  Time (mean ± σ):     27.658 s ±  0.162 s    [User: 27.531 s, System: 0.056 s]
  Range (min … max):   27.488 s … 27.927 s    5 runs
 
Benchmark 4: mkdir -p /tmp/tst; cp *.png /tmp/tst/; ./ect_v4 -9 /tmp/tst; rm -rf /tmp/tst
  Time (mean ± σ):     25.154 s ±  0.079 s    [User: 25.034 s, System: 0.054 s]
  Range (min … max):   25.058 s … 25.256 s    5 runs
 
Summary
  mkdir -p /tmp/tst; cp *.png /tmp/tst/; ./ect_v4 -9 /tmp/tst; rm -rf /tmp/tst ran
    1.10 ± 0.01 times faster than mkdir -p /tmp/tst; cp *.png /tmp/tst/; ./ect_v3 -9 /tmp/tst; rm -rf /tmp/tst
    1.74 ± 0.01 times faster than mkdir -p /tmp/tst; cp *.png /tmp/tst/; ./ect_clevert -9 /tmp/tst; rm -rf /tmp/tst
    2.61 ± 0.01 times faster than mkdir -p /tmp/tst; cp *.png /tmp/tst/; ./ect -9 /tmp/tst; rm -rf /tmp/tst

Quite a bump, shaves off at least 16 seconds and more than halves the time when compared to the plain build.

@kkocdko
Copy link
Author

kkocdko commented Aug 25, 2024

@ghtm2 Could you provide your binary? In that day, I tested the avx256 and avx512 build but it run even slower in my machine (AMD R5 5600U {zen3}). If enable avx will faster it's quiet a big bump! And, which CPU is used in your benchmark?

@kkocdko
Copy link
Author

kkocdko commented Aug 27, 2024

@ghtm2 Hi, did you have nasm installed while building the binary?

@ghtm2
Copy link

ghtm2 commented Aug 31, 2024

@ghtm2 Could you provide your binary? In that day, I tested the avx256 and avx512 build but it run even slower in my machine (AMD R5 5600U {zen3}). If enable avx will faster it's quiet a big bump! And, which CPU is used in your benchmark?

Sure, here are the v3 and v4 binaries: ect.tar.gz
You'll need at least glibc 2.38 installed though.
The CPU used is a AMD Ryzen 7 7840U, so Zen 4.

@ghtm2 Hi, did you have nasm installed while building the binary?

Yes.

@kkocdko
Copy link
Author

kkocdko commented Aug 31, 2024

@ghtm2 Awesome! Your binary is much faster, how did you do that? I append -march=x86-64-v3 -mavx2 here, but it's even slower, increase my benchmark from 48s to 1m27s, and your ect_v3 binary is 26s.

if (CMAKE_CXX_COMPILER_ID STREQUAL "GNU" OR CMAKE_CXX_COMPILER_ID STREQUAL "Clang"
OR CMAKE_CXX_COMPILER_ID STREQUAL "AppleClang" OR CMAKE_CXX_COMPILER_ID STREQUAL "ARMClang")
if(CPU_TYPE STREQUAL "x86_64" OR CPU_TYPE STREQUAL "i386")
set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -mpclmul -msse4.2")
set(CMAKE_C_FLAGS "${CMAKE_C_FLAGS} -mpclmul -msse4.2")

And, my whole build script here, I ran build with llvm-19, did you use GCC?:

https://github.com/clevert-app/clevert/blob/main/.github/workflows/asset_zcodecs.yml#L171

I really, really want to replicate your success.

@kkocdko
Copy link
Author

kkocdko commented Aug 31, 2024

I objdump your binary, GCC 14.2.1?

@kkocdko
Copy link
Author

kkocdko commented Sep 1, 2024

I reproduced your benchmark. It's faster using GCC instead of Clang. I will try to tweak it more. Thank you!

@ghtm2
Copy link

ghtm2 commented Sep 2, 2024

Sorry for the glacial response times, I'm quite busy at the moment.

Yes, I've build it with GCC 14.2.1 as that is what's currently shipped on Arch.
I can also confirm, that Clang produces noticeably slower ect binaries, no matter the flags.

I've made a small howto to reproduce the build for arch and derivatives: howto.tar.gz

I'm pretty sure that there is still some performance to be had with the appropriate flags and better input for PGO.
One might also want to try to further optimize with bolt, but I currently don't have the time to try.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants