Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Speed is too slow in 3.7.5 with icpx than 3.6.5 with icpc for both PBE and EXX calculations #5103

Closed
9 tasks
xdzhu opened this issue Sep 14, 2024 · 13 comments
Closed
9 tasks
Labels
Performance Issues related to fail running ABACUS

Comments

@xdzhu
Copy link

xdzhu commented Sep 14, 2024

Details

Recently, I perform SOC + EXX calculation. You can check the INPUT and output files in
hse-3.6vs3.7-lowerspeed.zip

When I choose 3.6.5 version to calculate, the speed is OK. Evey PBE step costs 13s and EXX costs 178s. Although it faces the slower PBE speed between every EXX step.
image

When I change to 3.7.5, speed is very slow. Evey PBE step costs 43s and EXX costs 270s, which is twice than the 3.6.5 version above.
image

Task list for Issue attackers (only for developers)

  • Reproduce the performance issue on a similar system or environment.
  • Identify the specific section of the code causing the performance issue.
  • Investigate the issue and determine the root cause.
  • Research best practices and potential solutions for the identified performance issue.
  • Implement the chosen solution to address the performance issue.
  • Test the implemented solution to ensure it improves performance without introducing new issues.
  • Optimize the solution if necessary, considering trade-offs between performance and other factors (e.g., code complexity, readability, maintainability).
  • Review and incorporate any relevant feedback from users or developers.
  • Merge the improved solution into the main codebase and notify the issue reporter.
@xdzhu xdzhu added the Performance Issues related to fail running ABACUS label Sep 14, 2024
@xdzhu
Copy link
Author

xdzhu commented Sep 14, 2024

EVEN in nspin=1 case, 3.7.5 also faces a big backstep of speed than 3.6.5

As you can see, 3.7.5 is:
image

when 3.6.5 gives:
image

@xdzhu
Copy link
Author

xdzhu commented Sep 14, 2024

When I set ks_solver scalapack_gvx instead of genelpa, the slow speed still remains:

3.7.5
image

3.6.5
image

@QuantumMisaka
Copy link
Collaborator

@xdzhu What're your ABACUS installation dependencies?

@xdzhu
Copy link
Author

xdzhu commented Sep 14, 2024

I compared the time cost of these two versions. It seems arised from ESolver_KS_LCAO - runner and HSolverLCAO - solve modules.

3.7.5与3.6.5时间对比测试.xlsx

image

@xdzhu
Copy link
Author

xdzhu commented Sep 14, 2024

@xdzhu What're your ABACUS installation dependencies?

Both with intel OneAPI 2023.1.0 and GCC 13.1.0.

3.6.5 with LibRI_0.1.0_loop3
3.7.5 with LibRI_0.2.0

@xdzhu
Copy link
Author

xdzhu commented Sep 15, 2024

I have noticed that in 3.7.x version i take the icpx and mpicxx compilers instead of icpc and mpiicpc which I use to compile 3.6.5 version.

When I change the CXX and MPI_CXX to icpc and mpiicpc and recompile the 3.7.5 version, it goes faster than icpx case and the peformance is also nearly same with the 3.6.5 version.

3.7.5 with icpc
image

3.7.5 with icpx
image

3.6.5 with icpc
image

@QuantumMisaka
Copy link
Collaborator

@xdzhu What're your hardware setting?

@xdzhu
Copy link
Author

xdzhu commented Sep 16, 2024

@QuantumMisaka The calculation node hardware is with Intel(R) Xeon(R) Gold 6248 CPU @ 2.50GHz (2*20C), 40 cores, and I run ABACUS with following command:
mpirun -np 10 -genv OMP_THREADS_NUM=4 abacus

@xdzhu xdzhu changed the title Speed is too slow in version 3.7.x than 3.6.5 for both PBE and EXX calculations Speed is too slow in 3.7.5 with icpx than 3.6.5 with icpc for both PBE and EXX calculations Sep 16, 2024
@WHUweiqingzhou
Copy link
Collaborator

could you:

  1. delete old ./build before you build a new one
  2. make some test with OMP_NUM_THREADS=1

@jinzx10
Copy link
Collaborator

jinzx10 commented Sep 26, 2024

If the result is OK and the only issue is the performance, according to this official guide, one possible reason as listed in the "Performance" section is that "-O3" is no longer sufficient to enable advanced loop optimization & vectorization; "-xhost" might be necessary. Do we have any benchmark on this compiler flag? @caic99

@jinzx10
Copy link
Collaborator

jinzx10 commented Sep 26, 2024

The test case in the zip file takes very long time... do you have smaller examples with the same issue? @xdzhu

@caic99
Copy link
Member

caic99 commented Sep 26, 2024

"-xhost" might be necessary. Do we have any benchmark on this compiler flag?

@jinzx10 I've tested it on a previous version of ABACUS, and it does not help a lot (-1% time) since the weightlifting parts are the math libs (here MKL and ELPA).
I would suggest we have a better alignment on the version of those compilers and their dependencies, and further carry tests on a latest environment.

@jinzx10
Copy link
Collaborator

jinzx10 commented Sep 26, 2024

I have another concern about the compilers. "mpicxx" might be a wrapper of g++; the wrapper for icpx might be "mpiicpx". On my local PC (WSL2 Ubuntu 22.04) where intel compilers are installed via apt, mpicxx is clearly a wrapper of g++ as shown below:

zuxin@legion:/opt/intel/oneapi/mpi/2021.13/bin$ which mpicxx
/opt/intel/oneapi/mpi/2021.13/bin/mpicxx
zuxin@legion:/opt/intel/oneapi/mpi/2021.13/bin$ mpicxx -v
mpigxx for the Intel(R) MPI Library 2021.13 for Linux*
Copyright Intel Corporation.
Using built-in specs.
COLLECT_GCC=g++
COLLECT_LTO_WRAPPER=/usr/lib/gcc/x86_64-linux-gnu/11/lto-wrapper
OFFLOAD_TARGET_NAMES=nvptx-none:amdgcn-amdhsa
OFFLOAD_TARGET_DEFAULT=1
Target: x86_64-linux-gnu
Configured with: ../src/configure -v --with-pkgversion='Ubuntu 11.4.0-1ubuntu1~22.04' --with-bugurl=file:///usr/share/doc/gcc-11/README.Bugs --enable-languages=c,ada,c++,go,brig,d,fortran,objc,obj-c++,m2 --prefix=/usr --with-gcc-major-version-only --program-suffix=-11 --program-prefix=x86_64-linux-gnu- --enable-shared --enable-linker-build-id --libexecdir=/usr/lib --without-included-gettext --enable-threads=posix --libdir=/usr/lib --enable-nls --enable-bootstrap --enable-clocale=gnu --enable-libstdcxx-debug --enable-libstdcxx-time=yes --with-default-libstdcxx-abi=new --enable-gnu-unique-object --disable-vtable-verify --enable-plugin --enable-default-pie --with-system-zlib --enable-libphobos-checking=release --with-target-system-zlib=auto --enable-objc-gc=auto --enable-multiarch --disable-werror --enable-cet --with-arch-32=i686 --with-abi=m64 --with-multilib-list=m32,m64,mx32 --enable-multilib --with-tune=generic --enable-offload-targets=nvptx-none=/build/gcc-11-XeT9lY/gcc-11-11.4.0/debian/tmp-nvptx/usr,amdgcn-amdhsa=/build/gcc-11-XeT9lY/gcc-11-11.4.0/debian/tmp-gcn/usr --without-cuda-driver --enable-checking=release --build=x86_64-linux-gnu --host=x86_64-linux-gnu --target=x86_64-linux-gnu --with-build-config=bootstrap-lto-lean --enable-link-serialization=2
Thread model: posix
Supported LTO compression algorithms: zlib zstd
gcc version 11.4.0 (Ubuntu 11.4.0-1ubuntu1~22.04)

while mpiicpx is clearly different from mpicxx:

zuxin@legion:/opt/intel/oneapi/mpi/2021.13/bin$ which mpiicpx
/opt/intel/oneapi/mpi/2021.13/bin/mpiicpx
zuxin@legion:/opt/intel/oneapi/mpi/2021.13/bin$ mpiicpx -v
mpiicpx for the Intel(R) MPI Library @IMPI_OFFICIALVERSION@ for Linux*
Copyright Intel Corporation.
Intel(R) oneAPI DPC++/C++ Compiler 2023.2.4 (2023.2.4.20240127)
Target: x86_64-unknown-linux-gnu
Thread model: posix
InstalledDir: /opt/intel/oneapi/compiler/2023.2.4/linux/bin-llvm
Configuration file: /opt/intel/oneapi/compiler/2023.2.4/linux/bin-llvm/../bin/icpx.cfg
Found candidate GCC installation: /usr/lib/gcc/x86_64-linux-gnu/11
Found candidate GCC installation: /usr/lib/gcc/x86_64-linux-gnu/12
Selected GCC installation: /usr/lib/gcc/x86_64-linux-gnu/12
Candidate multilib: .;@m64
Selected multilib: .;@m64
Found CUDA installation: /usr/local/cuda, version
icpx: warning: argument unused during compilation: '-I /opt/intel/oneapi/mpi/2021.13/include' [-Wunused-command-line-argument]

I think it might worth trying mpiicpx instead of mpicxx.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Performance Issues related to fail running ABACUS
Projects
None yet
Development

No branches or pull requests

5 participants