Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance playground #183

Merged
merged 132 commits into from
Sep 20, 2024
Merged

Conversation

asmeurer
Copy link
Member

@asmeurer asmeurer commented Jul 3, 2024

Looking into improving the performance of ndindex by creating an extension module (#181).

Since one of the biggest performance issues right now is the overhead of type checking when creating new objects, I've focused on this for now. This should also give the best idea of what the various options will look like in terms of maintainability.

I've created a Python file simple_slice which contains an object SimpleSlice that does type checking and nothing else. I've also created simple_slice_cython, simple_slice_pybind11, and simple_slice_rust.

I picked Slice because

  1. It's one of the more complicated to type check, since we have to check that arguments are None or integers. This should also give an idea of how type handling works since the arguments are a union type.
  2. In versioned-hdf5 it's the most common type of object to be created.

To compile, run

conda install cython rust pybind11 maturin

and

python setup_simple_slice.py build_ext --inplace
maturin develop

you can then run the tests with

pytest --no-cov test_slice.py

and the benchmarks with

python time_slice.py

On my machine, this is the result of the benchmarks:

Benchmarking SimpleSlice implementations
----------------------------------------

Object Creation:
Creation - ndindex.slice: 528 ns
Creation - simple_slice: 528 ns
Creation - simple_slice_cython: 130 ns
Creation - simple_slice_pybind11: 1.17 μs
Creation - simple_slice_rust: 3.63 μs

Args Access:
Access - ndindex.slice: 13.6 ns
Access - simple_slice: 13.8 ns
Access - simple_slice_cython: 21 ns
Access - simple_slice_pybind11: 118 ns
Access - simple_slice_rust: 662 ns

Most of this code was generated by Claude/ChatGPT.

I still need to investigate whether more optimizations can be made in these implementations. I will say so far that:

  • Cython is by far the simplest implementation. It's also the fastest and the only one that is actually faster than pure Python.
  • ChatGPT and Claude have an extremely difficult time writing Rust. I had a hard time just getting them to produce a correct version here that compiled (and the raw method is still wrong), and I gave up getting them to do any sorts of optimizations. So I'm sure there probably are optimizations that can be made here. However, unless someone can give some good arguments why I should stick with Rust, I'm not really feeling it is a good option.
  • I'm also sure that the C++ (and maybe even Cython) implementations can be improved. I'm not really an expert in Cython, C++, or Rust, so it may be a good idea to have a human who is provide some suggestions here.

For instance, in Cython there is a has_stop, trick to avoid None which can probably be used in C++ and Rust to allow the arguments to be typed as integers.

I will play around with this some more next week, and maybe also look at how some methods like reduce or as_subindex can be improved.

Another question that needs to be looked at is how much I can get away with compiling only critical stuff and leaving the rest in Python, and how easy these different implementations make doing that.

@peytondmurray @ArvidJB

asmeurer added 27 commits July 3, 2024 13:36
This is slower, and it's better if we can avoid it anyways.

This reverts commit 08821c0.
Note that .raw is currently broken in this implementation.
@asmeurer asmeurer marked this pull request as draft July 3, 2024 23:34
@ArvidJB
Copy link

ArvidJB commented Jul 4, 2024

Looks promising! I agree, Cython is probably the easiest to use and if it's the fastest as well we can focus on that.

Also, I think we can make things even simpler by always converting slices to three Py_ssize_t numbers via PySlice_Unpack I think (see https://docs.python.org/3/c-api/slice.html). That would eliminate tracking of None.

Lastly, I think I figured out why ndindex is downloaded so many times: It's used in python-blosc2!

@asmeurer
Copy link
Member Author

asmeurer commented Jul 4, 2024

You can only eliminate None from a slice if you know the array shape (i.e. call reduce() with shape).

And anyways it's an intentional design decision to not do any sort of canonicalization in the constructors, which I'd like to maintain.

Dev Cython is breaking the coverage reporting in the tests. The purpose of the
build is only to test dev NumPy. Maybe testing dev Cython (and others) would
be useful in the future, but for now this is breaking this CI.
@asmeurer
Copy link
Member Author

This is ready to go, if anyone wants to do a final review. Otherwise, I plan to do a release later this week. Hopefully it will go smoothly. The release process did get updated because we are now building wheels.

@asmeurer asmeurer marked this pull request as ready for review September 17, 2024 20:38
@asmeurer asmeurer enabled auto-merge September 20, 2024 21:02
@asmeurer asmeurer merged commit 85a1588 into Quansight-Labs:main Sep 20, 2024
20 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants