-
Notifications
You must be signed in to change notification settings - Fork 9
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimize memory access #67
base: main
Are you sure you want to change the base?
Conversation
…e anywhere anymore
… the cpu-only internals)
…er anymore - Fix typo
…ent clean_install
Some benchmark numbers: On a NVIDIA RTX4060 GPU |
On a Apple M2 Pro CPU (12 threads, ARM) |
On barnacle2 b2-006 node, which has 2x AMD EPYC 7302 CPUs A single 16-core AMD EPYC 7302 CPU (32 threads, AVX2) Using both 16-core AMD EPYC 7302 CPUs (64 threads total, AVX2) |
The weighted method on the CPU benefits drastically from transposed access pattern in CPU mode, due to shorter vector length and finer grained logic.
Also optimized the CPU vs GPU sizes a bit.
Newer NVCC also seems to optimize better without explicit vector_size in OpenACC.