Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Processing with numba #246

Open
hcwinsemius opened this issue Jul 20, 2022 · 26 comments
Open

Processing with numba #246

hcwinsemius opened this issue Jul 20, 2022 · 26 comments

Comments

@hcwinsemius
Copy link

Is your feature request related to a problem? Please describe.

The fact PIV works over independent interrogation windows, that can be processed in parallel seems to make OpenPIV very suitable for multi-core processing. I propose to implement numba as orchestrator with JIT compilation so that a user is flexible in the platforms used. This may make OpenPIV much more scalable to larger problems, and make these run faster on platforms with high cpu/mem availability, and working (albeit slower) without memory problems on smaller systems.

Describe the solution you'd like
Make the functions that perform analysis over interrogation windows numba-compatible. We should then analyse at what level the parallelisation should be organized to really speed up the code. Numba works on numpy/scipy code basis so this should be quite feasible to do.

Describe alternatives you've considered
An alternative is to process several pairs of images in parallel (e.g. through dask arrays) but this will not allow solving problems with one image pair requiring lots of memory.

Additional context
The use case for this request is scalability of #pyopenrivercam (see https://localdevices.github.io/pyorc). We noticed that users with relatively large areas of interest run into memory problems. Furthermore we would like to increase ability to parallelize on clusters.

@alexlib
Copy link
Member

alexlib commented Jul 20, 2022

Hi @hcwinsemius - I believe it's a great idea. We'd only like to ensure that people could install openpiv-python also without numba, i.e. to make the import and the algorithm as an option, installable also on some strange computing units or without support of compilers, etc.

@ErichZimmer - could you please be in touch with @hcwinsemius and show him the branch of numba you've been working on?

thanks

@alexlib
Copy link
Member

alexlib commented Jul 20, 2022

@hcwinsemius if clusters is an option, take a look please on https://github.com/OpenPIV/openpiv-python-gpu

@ErichZimmer
Copy link
Contributor

Hi all,
One issue I ran into when using numba is the lack of support for two-dimensional FFTs. One way around the FFT issue is to use numba.objmode, but that resulted in a slower cross correlation algorithm. Other than that, numba provided a decent performance boost while on one thread, especially towards the validation algorithms.

@hcwinsemius
Copy link
Author

@ErichZimmer really good point. I started some research on this today and indeed it may be that the cross-correlation estimation may be the bottleneck. Did you also try making a convolution with 2 for loops and see if that creates additional performance? Would love to hear what your experience so far is. Also I was curious if you know which part of the chain is most expensive (my guess was cross correlation but haven't profiled yet).

@alexlib your point on dependencies is also very good. Numba is not easy on edge systems like raspberry-pi. If numba seems promising, I could also write wrappers around the existing functions and put these in a separate library like openpiv-numba, which then can be installed on top of openpiv-python. That would entirely separate core functionality from any numba compilation. How do you feel about this?

@ErichZimmer
Copy link
Contributor

@hcwinsemius
I performed a quick performance test on a normalized spatial correlation algorithm based on literature on direct PIV correlation. For window sizes of 32x32 pixels, the direct method is as fast as the FFT method with a search radius of up to 5 pixels (allowing for velocity up to +/- 4 pixels). If the search radius increases beyond 5 pixels, there is a very steep performance hit.

@ErichZimmer
Copy link
Contributor

@hcwinsemius I'll create a branch for my numba version tomorrow (hopefully) so you can get a good idea of what I did and how to keep compatibility with or without numba.

@hcwinsemius
Copy link
Author

@ErichZimmer you mean performance increases steeply when the search radius increases? That would be excellent news for river observations where the search radius usually is about the size of the interrogation window. Eager to give this a try.

@alexlib
Copy link
Member

alexlib commented Jul 21, 2022

Please take a look on what fluiddyn project is doing (they are on bitbucket)

@alexlib
Copy link
Member

alexlib commented Jul 21, 2022

hi, please take a look at the work done for the project https://github.com/fluiddyn/fluidimage - they claim some CPU/GPU support using Pythran or other libraries.

@timdewhirst
Copy link
Member

If I could chip in at this point: this is perhaps an area where the openpiv-python and openpiv-c++ projects could merge? The original observation is correct: the calculation of displacement by evaluation of independent interrogation areas is "embarrasingly parallel" and well suited to multicore processing, and by extension to GPUs.

We have support for multicore processing in openpiv-c++, and running on a CPU (M1PRO - 8 cores of which 6 are performance) we get something like ~10us/interrogation area - it's slower on a i7, probably due to non-unified memory - but still quite fast.

The key, really, to high performance code is to ensure data is transferred/transformed as little as possible, and that operations are done in a predictable and non-random fashion - this lets the branch predictors and caches to give maximal benefit.

We have some work ongoing at the moment to write python bindings for openpiv-c++ which would allow python to be used for high-level structure and c++ for high performance.

Thoughts?

@alexlib
Copy link
Member

alexlib commented Jul 21, 2022

this is a very good discussion. I'm already dreaming of openpiv2 :) that merged all your ideas and effort into a solid package.

@ErichZimmer
Copy link
Contributor

Pythran (one of the tools fluidimage uses) allows for 1D FFTs from numpy (64 bit floating arithmetics, so it may still be slower than our current implementation using scipy). Since FFTs are separable, we can do FFTs for each dimension to replicate a 2D FFT.
On using OpenPIV with a c++ backend, I am slightly worried on making enough wheels for distribution. One thing I love about OpenPIV-Python is that all platform dependent conflicts are done by third party packages. This means there is no compiling of code and therefore, a good deal of cross-platform compatibility.

@alexlib
Copy link
Member

alexlib commented Jul 21, 2022

Pythran (one of the tools fluidimage uses) allows for 1D FFTs from numpy (64 bit floating arithmetics, so it may still be slower than our current implementation using scipy). Since FFTs are separable, we can do FFTs for each dimension to replicate a 2D FFT.
On using OpenPIV with a c++ backend, I am slightly worried on making enough wheels for distribution. One thing I love about OpenPIV-Python is that all platform dependent conflicts are done by third party packages. This means there is no compiling of code and therefore, a good deal of cross-platform compatibility.

I have a reasonable experience with cibuildwheel system - it generates automatically all the major platform binary wheels and uploads them to pypi or elsewhere. If we manage to build it on Github actions, we might be fine with the desktop users. We won't then be able to create something for drones, raspberry pi, etc. so we still need to have a system that prepares a version (a fallback) to pure Python with installable libraries, plus an optional use of C++ high speed computation if the binary wheel is available.

@ErichZimmer
Copy link
Contributor

@hcwinsemius FWI, for the direct correlation method, I meant for a very steep performance loss when the search radius is greater than 5 pixels compared to FFTs.

On my numba accelerated version, I am not able to upload it at this time due to the beginnings of a natural disaster (I won't have access to my laptop for approximately 2 weeks). However, I'll try to explain what I did. For each numba accelerated module, there is a large try-catch statement, one for numba functions if numba is present and one for a "pure" python implementation. The only accelerated modules were validation and process (except correlation algorithms), where the functions were decomposed into for-loops so numba can take advantage of simd intrinsics.

On using a c++ backend, I looked at a few packages using scikit-build and most had somewhat complex packaging systems. One thing we have to realize when using c++ as a backend (using Openpiv-c--qt) is that there are 5 dynamic libraries that have to be created for each platform (jpeg62, liblzma, openpivcore, tiff, and zlib1). That is where I am confused on when attempting a cross-platform compilation (unless we were to precompile openpiv-c--qt for each platform and somehow link the libraries to the python frontend).

@timdewhirst
Copy link
Member

timdewhirst commented Jul 22, 2022

On C++ (or rather, non pure python) modules: dependencies are always a possibility, but this can be worked around - any moderately complex python module or package loading images will probably somewhere have the same set of dependencies - just take a look at the guts of a conda install!

However, one solution is statically link - dynamically linked libraries don't have to be used (unless there is a licence requirement) - and e.g. the default build on OS X does just this (it's probably the same on Linux but I don't have a box at hand to check!):

% otool -L libopenpivcore.dylib 
libopenpivcore.dylib:
	@rpath/libopenpivcore.dylib (compatibility version 0.0.0, current version 0.0.0)
	/usr/lib/libSystem.B.dylib (compatibility version 1.0.0, current version 1311.100.3)
	/usr/lib/libc++.1.dylib (compatibility version 1.0.0, current version 1300.23.0)

i.e. the only dynamic libraries are system libraries.

The same could be done for the module as a whole i.e. statically link against libopenpivcore which in turn is statically linked thereby removing any (including transitive) dependencies. This is one of the reasons that I wanted to make openpiv-c++ as dependency light as possible.

@ErichZimmer
Copy link
Contributor

ErichZimmer commented Jul 25, 2022

Implementing a normalized direct correlation algorithm based on Lewis, J.P.. (2001) Fast Normalized Cross-Correlation, the direct correlation algorithm is as fast or faster than the FFT method until the search radius reaches 6 pixels (returns a 12x12px correlation matrix). I think this is a reasonable size for a correlation matrix.
I may as well put together a numba/pythran version of OpenPIV as it seems pretty easy to do from experience.

@ErichZimmer
Copy link
Contributor

@hcwinsemius should we use pythran as it supports 1D r2c and c2r FFTs?

@hcwinsemius
Copy link
Author

I am currently experimenting a bit with Jax which would be both CPU and GPU compliant and easy to install. Some of the processing steps were quite simple and speed up to a factor of 5 to 10. I haven't optimized the cross correlation yet but am also considering to (for larger search windows) replace it by an optimization search algorithm.

It is really only experimenting so that we can have a proper discussion on whether this would be an interesting alternative for pure scipy numpy or not.

I don't know pythran. Will have a look.

@ErichZimmer
Copy link
Contributor

I performed some more tests with direct non-normalized correlation (fastest) with numba and pythran and found it to be quite sluggish compared to FFTs unless the window size is less than 32x32 px and search radius is less than 5px. Additionally, using FFTW single precision in a c++ test script completely blows everything away due to reuse of plans and using the multi-threading mechanism of openpiv-c--qt (in fact, it was quite a bit faster than PIVview 3.9 when not performing repeated correlations). So, I would be focusing more on c/c++ as it seems reasonably easy to program while keeping python dependencies and the subsequent docker image size small. However, if we want a pythran/numba framework like FluidImage, then I would not mind allocating off-time as PIV and its associated programs is part of my hobby.

@timdewhirst
Copy link
Member

Very interesting! On the topic of performance, GPUs aren't really a magic bullet that will automatically make everything fast, rather they are great if you have an "embarrassingly parallel" problem to solve and you can structure your code so that everything can be done in parallel on the GPU: the reason for this is that, unless you're using an M1 chip or a new Grace Hopper chip, there is a non-negligible cost to transferring data from host memory to the GPU for processing, and this will be especially bad if you're transferring lots of very small things instead of one large thing.

I've not looked at the numba or pythran code, but I would guess that the rough architecture to follow should be:

  • transfer complete images to GPU (not individual interrogation areas)
  • perform correlations in parallel on GPU
  • write results into one large block of memory local to GPU
  • copy back all data to host

Probably an area for more research. I'd be interested in seeing the results for FFTW - I've long toyed with the idea of incorporating it within openpiv-c-qt to see what the performance is like. I would imagine it might be about 2x as fast as my very basic recursive implementation as it extensively uses SIMD as well as the tricks I've used i.e. plans, thread local data, etc.

@ErichZimmer
Copy link
Contributor

@hcwinsemius any progress with your testing with numba?

@hcwinsemius
Copy link
Author

hcwinsemius commented Nov 26, 2022 via email

@hcwinsemius
Copy link
Author

Hi all,

I noticed this issue is still open. I finally have some update on using numba (instead of numpy or pythran). I have a working version of a preliminary library that resolves cross correlation with FFT using the FFT implementation made by @styfenschaer in the rocket-fft package, and using the OpenPIV implementation of most of the functionalities. I of course acknowledge this project for that in the README.md.

The FFT functions used are not yet part of the core numba package, which makes this a little risky, but if that happens at some point, and is maintained properly, this can lead to a 10-fold increase in speed, provided that a large enough problem is sent to the cross correlation solver at once. I therefore implemented options to resolve cross correlation for many image pairs at once. Typically that is done with videos of several seconds long.

If you want to stay updated, I can post a link once I have a first release.

@alexlib
Copy link
Member

alexlib commented Oct 9, 2024

Yes please, @hcwinsemius keep us posted. Numba is a promising direction because it allows the simplicity of Python code and clarity for the new users. OpenPIV in Python wasn't planned to be super fast, that's why we have C++ packages. Python version is for the new features, new ideas, rapid coding.

@ErichZimmer
Copy link
Contributor

@hcwinsemius This is great news. I struggled to implement a numba-friendly FFT library or find one that presented promising results and flexibility. This was the only real technical difficulty I had when playing around with numba for JIT. Other than that, transcribing OpenPIV-Python to a numba extension with a fallback to the standard OpenPIV library is trivial and may only take up an afternoon to code once the proper engineering is complete. If needed, we can collaborate sometime late January/Early February and get things rolling since I have some previous experience using numba (e.g., prototype Tomo-PIV and marker detection implementations).

@ErichZimmer
Copy link
Contributor

Also, chucking the 3D correlation stack (if stacked interrogations windows are used) allows for an embarrassingly parallel setup and allowing for further performance increases, something I think you touched on in your previous comment. I actually implemented a serial version of a chunked correlation algorithm so I can process 4512x4512px PIV images using little memory and with some decent performance increases. I just forgot about that feature and will make a pull request for it in a future date. Anyhow, numba provides several means for parallelizing workflows, and this seems to be a promising avenue to investigate.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants