Optimize memory access #67

sfiligoi · 2025-02-12T02:39:12Z

The weighted method on the CPU benefits drastically from transposed access pattern in CPU mode, due to shorter vector length and finer grained logic.
Also optimized the CPU vs GPU sizes a bit.
Newer NVCC also seems to optimize better without explicit vector_size in OpenACC.

…e anywhere anymore

…ns we do not use

… the cpu-only internals)

…er anymore

…er anymore - Fix typo

…ent clean_install

sfiligoi · 2025-02-12T15:26:57Z

Some benchmark numbers:
On a 8-core AMD Ryzen 9 7940HS (16 threads, using AVX512)
EMP weighted normalized times went from 822s to 273s (vs 682s in v1.4)
EMP unweighted times went from 155s to 139s (vs 190s in v1.4)

On a NVIDIA RTX4060 GPU
EMP weighted normalized times went from 247s to 173s
EMP unweighted times went from 41s to 37s

sfiligoi · 2025-02-12T17:57:18Z

On a Apple M2 Pro CPU (12 threads, ARM)
EMP weighted normalized times went from 690s to 331s
EMP unweighted times went from 164s to 161s

sfiligoi · 2025-02-15T00:29:55Z

On barnacle2 b2-006 node, which has 2x AMD EPYC 7302 CPUs

A single 16-core AMD EPYC 7302 CPU (32 threads, AVX2)
EMP weighted normalized times went from 602s to 200s (vs 507s in v1.4)
EMP unweighted times went from 120s to 112s (vs 147s in v1.4)

Using both 16-core AMD EPYC 7302 CPUs (64 threads total, AVX2)
EMP weighted normalized times went from 341s to 116s (vs 291s in v1.4)
EMP unweighted times went from 65s to 63s (vs 80s in v1.4)

sfiligoi and others added 30 commits January 23, 2025 14:38

Add su::biom::load_n_samples and use in su.cpp

3c1b044

Remove unnecessary header includes

6838090

Expose the partial functions in libssu

1efade6

Make libssu trully thread-safe

897ce3e

Make ssu and faithpd use the shared library

817b7d7

Add missing header include

518c319

Use explicit object list in rapi_test dependency

b45d9dc

Export dl_read_partial and properly clean between rebuilds

1f65ef6

Use shared library in test_api. Deprecate merge_partial, since not us…

ef32989

…e anywhere anymore

test_ska does not use the unifrac object files

f9a2aaf

Create an intermediate archive file to avoid dependencies for functio…

2d106e4

…ns we do not use

Add API-based unifrac tests in test_su

a1a2c5b

Add API-based faithpd tests in test_su

b45a9c7

Add test_su_api to exercise the optimiza code paths (test_su exercize…

ddceeee

… the cpu-only internals)

Handle NOGPU at top-level Makefile, so nv not required

e53236e

Add support for parallel make

ba82d84

Add clean_install make option

267d158

Fix test_su_api checks

7c9c6ac

Add cpu-only builds

a74e67c

Explicitly create all variants of unifrac_task_noclass

13f7026

Use internally a C API layer for the accelerated code

b1b1f69

Fix Makefile for acc

fc1c73d

Explicity use h5c++ for everything but acc code. Use CXX for acc code

df97366

Modularize ld loading in src/ssu_ld.c

aac7d13

Move handling of UNIFRAC_CPU_INFO into ssu_ld.c

bec652e

Put gpu code into its own shared library. No need for pgi h5c++ wrapp…

042ad3f

…er anymore

Put gpu code into its own shared library. No need for pgi h5c++ wrapp…

5a0ce0c

…er anymore - Fix typo

Add symlink unifrac_task.cpp to make R happy

0fab9aa

Do not check for /proc/driver/nvidia/gpus, not needed

c65289b

Make loading of GPU shared library optional during initial test

2ff25ef

sfiligoi added 26 commits February 3, 2025 12:08

Always build acc_nv on Linux

e58031d

Add support for AMD GPU compilation

c8dedf1

Add install_amd_clang.sh and use in CI

3b571f1

Make BUILD_NV_OFFLOAD and BUILD_AMD_OFFLOAD optional

5e521bd

Fix typo

7e0dfc6

Clarify which GPU is used

ba0f00a

Properly build test_binaries last on Linux

e4a1ae1

Use curl instead of wget

e3684ee

Enable curl link following, and use apt install

f529184

Better handle failure in installing gpu compilers

408309a

Update README with AMD complier instructions

adf33aa

Improve installation instructions in README

23b542a

Add uprivileged install fallback in scripts/install_amd_clang.sh

155853d

Add support for UNIFRAC_USE_NVIDIA_GPU and UNIFRAC_USE_AMD_GPU. Docum…

664a38c

…ent clean_install

Fix typo

3cb1977

Update NV setup script to download the latest version

25beb33

Improve build docs

99dd51b

Remove popcnt-based selection

9389792

Transpose embedded_proportions for non-Vaw and non-binary variants

0f87955

Tune emb size for new access pattern

58c97c2

Fix preprocesssor check

f5407d9

NVCC happier when one does not specify vector_length explicitly

e84beae

Explicitly request vectorization in WeightedVal1

e158991

Use the (old) straight logic for the GPUs

432a28e

Tune RECOMMENDED_MAX_EMBS_BOOL for CPUs

64ff847

NVCC happier when one does not specify vector_length explicitly

82c9345

sfiligoi marked this pull request as draft February 12, 2025 02:39

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize memory access #67

Optimize memory access #67

sfiligoi commented Feb 12, 2025

sfiligoi commented Feb 12, 2025 •

edited

Loading

sfiligoi commented Feb 12, 2025

sfiligoi commented Feb 15, 2025 •

edited

Loading

Optimize memory access #67

Are you sure you want to change the base?

Optimize memory access #67

Conversation

sfiligoi commented Feb 12, 2025

sfiligoi commented Feb 12, 2025 • edited Loading

sfiligoi commented Feb 12, 2025

sfiligoi commented Feb 15, 2025 • edited Loading

sfiligoi commented Feb 12, 2025 •

edited

Loading

sfiligoi commented Feb 15, 2025 •

edited

Loading