Skip to content

CPUArch dispatching and unified shared library #113

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 61 commits into
base: main
Choose a base branch
from

Conversation

Alexsandruss
Copy link
Contributor

@Alexsandruss Alexsandruss commented Apr 23, 2025

Changes:

  • CPUID (x86_64), MSR (Linux-ARM) and brand string (MacOS-ARM) readers to get supported instructions
  • MicroArchEnvironment singleton to get optimal supported arch for the machine
  • Dispatching macros to instantiate required template class instances and dispatch to optimal branch
  • CMake config changes to enable dispatcher for all build targets
  • Rework of python binding for compilation of unified shared library

Current limitations:

  • x86_64 architectures are limited to server platforms listed in archspec
  • ARM support has basic implementation only (2 targets with simple checks per Linux and MacOS)

@napetrov napetrov requested review from ahuber21, ethanglaser, mihaic, dian-lun-lin and ibhati and removed request for ahuber21 and ethanglaser April 23, 2025 19:32
@napetrov
Copy link
Contributor

/intelci

Copy link
Contributor

@ibhati ibhati left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add Inner product and L2 distances as well. Thanks!

@ibhati
Copy link
Contributor

ibhati commented Apr 23, 2025

/intelci

@Alexsandruss Alexsandruss marked this pull request as ready for review April 24, 2025 14:18
@dian-lun-lin
Copy link
Contributor

Thank you so much for the effort! How do we test if the correct supported instructions are called? Will we have a pipeline to test different architectures?

@Alexsandruss
Copy link
Contributor Author

Thank you so much for the effort! How do we test if the correct supported instructions are called? Will we have a pipeline to test different architectures?

I'm planning to write a testing script which disassembles dispatched functions and validates usage of correct instructions.

# "icelake_server",
"sapphirerapids",
# "graniterapids",
# "graniterapids_d",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@napetrov I remember we had a discussion earlier on how many microarchs to build the Python package with as the time/size increase with each new entry here. Do we need all of them? Also, we need to keep them in sync with the private repository for building the shared library

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree that we should be careful with adding those - we not necessary need to build entire library for codepath to utilize some of the features.

We can call for bf16 even from sse codepath - the only thing that compilation for particular target would do is compiler code optimization and i doubt that there is perf difference for generic code despite specific functionality usage -this is usually works out to be just run.

Also we would need older isa set - prior to broadwell for clean run on systems that doesn't support AVX2

@ibhati
Copy link
Contributor

ibhati commented Apr 25, 2025

Can you please run clang formatting with the below commands from the repository's root?

sudo apt install clang-format-15
bash tools/clang-format.sh clang-format-15

@@ -146,7 +146,7 @@ endif()

add_library(svs_native_options INTERFACE)
add_library(svs::native_options ALIAS svs_native_options)
target_compile_options(svs_native_options INTERFACE -march=native -mtune=native)
target_compile_options(svs_native_options INTERFACE -DSVS_CPUARCH_NATIVE -march=native -mtune=native)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I understand correctly, SVS_CPUARCH_NATIVE should be an option similar to the other options, such as SVS_NO_AVX512. By default, it can be set to OFF. In this case, it will compile for all x86_64 architectures if building on an x86 platform.

@Alexsandruss Alexsandruss force-pushed the dev/unified-so branch 2 times, most recently from d25c704 to b3e2ecc Compare May 6, 2025 07:34
Copy link
Contributor

@ahuber21 ahuber21 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you summarize how the dispatching will be used or configured? Maybe run me through an example where the library was built with all optimizations instantiated and is then executed on an M2 MacBook.

class MicroArchEnvironment {
public:
static MicroArchEnvironment& get_instance() {
// TODO: ensure thread safety
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How/when will this TODO be addressed?


#if defined(__x86_64__)

#define SVS_DISPATCH_CLASS_BY_MICROARCH(cls, method, args) \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible to avoid these macros and replace them with templated functions?

* limitations under the License.
*/

#define SVS_PACK_ARGS(...) __VA_ARGS__
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, just wondering if this could be handled differently & with templates in modern C++?

Copy link
Contributor

@ahuber21 ahuber21 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the way of defining and retrieving available microarchs can be improved.

return supported_archs_;
}

const std::vector<MicroArch>& get_compiled_microarchs() const {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The (to be) compiled microarchs are known at compile time and should therefore be constexpr. The relevant bits here should be rewritten. This will make testing easier, as it should allow us to loop over the archs and test all available distance specializations.

out << std::endl;

// Print all compiled microarchitectures
const auto& compiled_archs = arch_env.get_compiled_microarchs();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As said in the other comments, this should be retrievable via

constexpr auto& compiled archs = /* whatever */;

@@ -139,9 +141,17 @@ float compute(DistanceCosineSimilarity distance, std::span<Ea, Da> a, std::span<
assert(a.size() == b.size());
constexpr size_t extent = lib::extract_extent(Da, Db);
if constexpr (extent == Dynamic) {
return CosineSimilarity::compute(a.data(), b.data(), distance.norm_, a.size());
SVS_DISPATCH_CLASS_BY_MICROARCH(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really don't like the usage of preprocessor macros here. We lose the benefits of the type system here. compute is processed as a string literal and downstream you invoke cls<uarch>::method. That's very different from the style this library is written in. It's probably not trivial, but we should rewrite this and make use of the compiler as much as possible, i.e. provide templated classes. If you think this is not possible, please provide some justification for the implementation you chose.

The relevant piece of code that must be executed is svs::arch::MicroArchEnvironment::get_instance().get_microarch() and there must be better ways of bringing it into the plumbing.

What I particularly dislike is that we do have some compile-time dispatching in this very function (if constexpr (extent == Dynamic)), and the we clutter it with runtime dispatching. This will nullify all the benefits of the compile-time dispatching for extent.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants