-
Notifications
You must be signed in to change notification settings - Fork 129
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(wip): cupy histograms #1095
base: master
Are you sure you want to change the base?
Conversation
0668e63
to
a0aa63a
Compare
(cf https://docs.cupy.dev/en/stable/reference/generated/cupy.histogramdd.html , which is the standard regular-array API conforming to the numpy version - no boost or anything) |
@martindurant - cool, it appears to use the same techniques (fill by atomics) so it'll be subject to the same scaling limitations I'm seeing. However, the number of calls to fill is a bit more lean so maybe it's worth backing an implementation with it. I'll have to try some benchmarks. Otherwise - there's significant functionality missing from the cupy hists that we'll still have to add on top, if it turns out to run faster in the first place. |
Yes, I expect this is the case. dask-histogram also uses boost and adds a numpy- compatible API on top for those that expect it; and of course, it's designed to work with awkward. I expect there's an amount of sharing and refactoring that can eventually be done. |
Yeah - I think it is all possible. Right now is really getting all the pipe-fittings in place. I'll let you know if there's any clear win in the benchmarks. |
@Saransh-cpp: @jpivarski and I talked last Friday and it came up that you might be interested in taking this "pilot project" and turning it into a full-blow UHI compatible histogramming interface (a la scikit-hep/hist), but for cupy/cuda histograms. What's in this PR has the necessary functionality for HEP and we can convert to scikit-hep/hist, but it would be nice to have a cohesive ecosystem and only convert to CPU memory-space at the last moment. This would grant us more or less infinite scaling. We also has some ideas towards warpwise-distributed histograms where a (collection) of warps would tend to a sub-range of bins so that filling can be done more in parallel. This old implementation description demonstrates that if you stick to a warp (i.e. 32 bins) and replicate histograms to do filling parallel you can reach 10GB/s filling rates, because there's no use of atomics. This also has interesting parallels to cluster-distributed histograms where a (relatively enormous) histogram could be distributed across a whole dask cluster and achieve scaling to 100s of GBs in size or more. This would effectively remove scaling limitations for histograms for the foreseeable future and is probably important for achieving precision HL-LHC analyses. Anyway - please let us know if you are interested in turning this into a more mature package and possibly adding features to it! We're happy to answer any questions you may have. |
HI @lgray, apologies for the late reply. I have submitted a proposal for the I still need to go through this PR in detail, but I will pick it up very soon and discuss the work in awkward-dask meetings. Thank you again for the ping! :) |
Hi @lgray @jpivarski, I am experimenting with inheriting In [1]: import cuda_histogram
In [2]: cuda_histogram.hist_tools.Regular(10, 0, 1) # the repr is inherited, works!
Out[2]: Regular(10, 0, 1)
In [3]: cuda_histogram.hist_tools.Regular(10, 0, 1).extent # inherited property, works!
Out[3]: 12
In [4]: type(cuda_histogram.hist_tools.Regular(10, 0, 1).edges) # overriden by `Bin` to return a cupy array
Out[4]: cupy.ndarray
In [5]: type(cuda_histogram.hist_tools.Regular(10, 0, 1).widths) # not overriden, inherited, thus a numpy array
Out[5]: numpy.ndarray The class inheritance looks like this right now - class Bin(hist.axis.Regular, family="cuda_histogram"): # has common methods for Regular and Variable axis but I might split it
...
class Regular(Bin, family="cuda_histogram"):
... |
@Saransh-cpp I've been mulling over this for a while, and I think that really only the filling should be done on the GPU, from the user's perspective. I would say that for any other user facing thing like axes and such it should produce np arrays, since axes and other arrays tend to be very small and shipping them to/from GPU is inefficient. What do you think? |
If the histogram bin edges/descriptions are immutable, you could have a copy of them on both CPU and GPU. Then they'd be available on the GPU for the filling operations, and available on the CPU for user queries. The only thing that would need to be transferred is the bin data. It can be a new |
@jpivarski that's just about what I'm thinking, but fill shouldn't trigger an xfer to CPU (as then on multiple fills you xfer to CPU each fill which syncronizes the device). So it should xfer to CPU on any non-fill access of the bin/weight contents, IMO. I think this is possible as well? |
That would work. Although if |
Multiple fills being able to proceed in parallel is better than sync'ing on each call, I'd say. |
Thanks for the detailed discussion, @lgray @jpivarski! I have been working on the API and right now it looks like this - In [1]: import cuda_histogram; import numpy as np
In [2]: a = cuda_histogram.axis.Regular(10, 0, 1, name="r")
In [3]: c = cuda_histogram.Hist(a)
In [4]: a, c
Out[4]:
(Regular(10, 0, 1, name='r', label='Axis 0'),
Hist(Regular(10, 0, 1), storage=Double()))
In [5]: c.fill(r=np.random.normal(size=1_000_000)) # the signature of fill needs to be changed to match other libraries
In [6]: c.values()
Out[6]: array([39441., 38312., 37423., 35879., 34110., 32467., 30292.])
In [7]: type(c.values())
Out[7]: cupy.ndarray
In [8]: type(a.edges)
Out[8]: numpy.ndarray Only the fill method executes on GPU with this design. Could you please let me know if this is the right direction or should I change something in this example? |
35285fa
to
8100467
Compare
Hi @lgray, the flow behavior in this PR works like this right now - In [1]: import cuda_histogram; import numpy as np; import hist
In [2]: a = cuda_histogram.axis.Bin("n", "l", 10, 0, 1)
In [3]: c = cuda_histogram.Hist("c", a)
In [4]: c.fill(n=np.random.normal(size=1_000_000))
In [5]: c.values()
Out[5]:
{(): array([59481., 62261., 63944., 65914., 67848., 69844., 70480., 71625.,
72873., 73828.])}
In [6]: c.values("over")
Out[6]:
{(): (array([59481., 62261., 63944., 65914., 67848., 69844., 70480., 71625.,
72873., 73828.]),
array([59481., 62261., 63944., 65914., 67848., 69844., 70480., 71625.,
72873., 73828.]))} I am not sure if this is the intended behavior. The flow behaviors in other histogramming libraries for a 1D histogram just adds overflow and underflow values to the end of the array instead of duplicating the entire array. Could you please let me know if this is the intended behavior? |
Yes, coffea histograms also have "NaN-flow" which is a special overflow bin for NaNs, so that counts and weights don't get affected by floating point problems. I personally like this feature, but none of the other histogram libraries care to implement it. Insofar as the filling workflow you've shown, does it work without copies using GPU-only arrays? |
Yes! The fill method uses the exact logic in this PR and makes it compatible with other histogramming libraries - Here is a more detailed usage example - https://github.com/Saransh-cpp/cuda-histogram?tab=readme-ov-file#usage I have updated the repository with documentation, a few tests, and CI. The API is now very similar to boost-histogram's API. The Categorical axis is the only major thing missing (or left for refactoring). I have requested to transfer the repository to Scikit-HEP and will create a |
work in progress on using cupy in the old coffea-hist package as a demonstrator