-
Notifications
You must be signed in to change notification settings - Fork 369
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Numpy support #1431
Comments
Hello, We should document this better. |
One spot that I think could be better supported for numpy is the caiman.mmapping.load_memmap function. Currently, the function restricts memmap loading to files with a .mmap extension. However, there's no technical limitation preventing the use of .npy files. By switching the memmap constructor on line 66 in mmapping.py from np.memmap to np.lib.format.open_memmap, .npy files can be loaded without additional code modifications. This change offers two key benefits:
I'd be happy to submit a pull request implementing this change if the maintainers are interested. |
In general, I believe it is best practice to use the standard .npy extension instead of .mmap. While changing this throughout the codebase would be somewhat tedious, the most immediate improvement would be to modify the load_memmap function to support .npy files. This would prevent users from having to unnecessarily resave existing NumPy arrays just to work with the function. |
To be clear, these are not the same format on-disk, and if we were to actually use the header information on shape and data type, we'd have to have separate code paths to do that versus the (headerless/raw) mmap format that np.memmap uses. I don't think it's a bad idea to do this for intermediate files (I assume this is mostly what you're talking about) - we've occasionally thought about switching data formats before for intermediate files (I was thinking about zarr or n5 sometime back - both have a lot of appeal and also have cross-language support). It may also be worth thinking about making a set of standard outputs in a post-run directory (or even mid-run directory) to make gluing it into other software easier, although if we're doing that, neither npy nor mmap is particularly friendly while zarr/n5 both seem pretty compelling. If you're more thinking about initial input files to the pipeline, adding support for more formats isn't a bad idea (but we already support npy files). |
I believe that other than the 128-byte header, the format for .mmap and .npy files are the exact same on-disk. But you are right that np.memmap expects no header versus np.load (or np.lib.format.open_memmap) extract the header information. In my use case, I am using only some parts of the pipeline. I have npy files that I need to submit into the CNMF step. Part of this step calls mmapping.load_memmap, which throws an error when I give an npy file -- even though the npy file is in the exact same format as the .mmap file and would be able to be used interchangeably. |
This is how I think support for npy files could easily be added. Adding support for npy files in this way would not affect any code outside of this function, as the object returned by np.memmap is identical to the object returned by np.load:
|
Let me think about this for a bit. You have an example of a usecase that might benefit (although another option would be for your calling code to save in the headerless format and name files in the way caiman expects). Still, the code change isn't too intrusive (assuming you only want to provide input; it'd be more intrusive to change the saved formats over and I'm still leaning towards a more language-neutral format if we get around to that). I'll try to get back to you within the next few days on this - might also check in with Kushal and see how our general ideas of how to handle temporary files and output dirs might mesh. Thanks for reaching out to us either way - we'll be in touch. |
As far as I understand as long as you have something that's sufficiently numpy-like in the right shape and order you should be able to give it to https://github.com/flatironinstitute/CaImAn/blob/main/caiman/source_extraction/cnmf/cnmf.py#L408 IMO the better way to do this is to decouple file loading and array objects that get passed to algos. |
Agreed, there don't seem to be any technical barriers preventing numpy-like arrays in Working within the current numpy array memmapping scheme, the core issue is that |
Thanks for the expedient response @pgunn. I can confirm that the Being said, if you're following along with the examples like I was, the
I just reinstalled caiman in a fresh environment as I didn't remember what I changed to get it working last week. This specific failure can be blamed on the When I went looking for the failure point, I did notice the parameter validation is a bit dense. Have you considered modernizing your validation (e.g., with pydantic)? It's obviously a non-trivial refactor (albeit one not intrusive to the rest of the codebase) and another dependency, but it would make your validation much more straight forward, easy to debug, and easier to implement additional constraints. You can also decorate your classes eventually, which will pass the parameters through the validator before instantiation, and will allow you to liberate them from having lots of duplicated validation logic. For example,
If desired, this has the added benefit of providing all of your defaults from one location (the CNMFParams), so you wouldn't have to chase down inconsistencies there. If you like, we can discuss this in a separate issue. |
@darikoneil There's a lot of low-hanging fruit in terms of simplifying and improving code quality; better validation is just one of probably dozens of examples. Hopefully we'll either get to fixing this or some future more substantial overhaul will see us replace (rather than fix) a lot of the bad code. The lack of support in get_file_size() is something I'll get to soon though (if you want to do a PR against the dev branch, that's fine - some people really like to be on a list of contributors for tons of projects and while that way of thinking is alien to me, I don't really mind it, plus it will save me a small amount of time). |
I'm not sure I agree with this argument; the fact that they're very similar formats doesn't matter a lot given how trivially convertable most file formats are to each other; if it'd take a few lines of code to convert a hdf5 file to the memmap format or a few lines of code to convert a npy file to the memmap format, does it really matter? In some ways, the hdf5 file is closer to the npy format in terms of practical handling because hdf5 files support introspection on what's inside and memmap does not. That said, if information can be moved from filenames into internal headers, whatever format we pick that has internal headers, that feels like a better thing to be parsing (more structure, less likelihood of running into filename length or naming limits, plus the filenames are fairly ugly). Looking back, I think it probably was a mistake to encode information in filenames and be headerless (although any long-lived software project will have a lot of hindsight-revealed regrets). Still thinking this through though. |
One caveat is the contiguity in memory & type coercion. Chances are the data is going to be provided as a frames x Y x X tensor of unsigned integers, as most imaging software records their data this way and most libraries load imaging stacks / videos this way. If this is the case, the array is going to need to be rewritten anyway. |
Another thing I'm wary of is effectively making a promise to preserve formats for temporary files over a run; this is something I think it'd be good to change and if we produce a documented format that's friendly to external use before we're ready, we may be effectively stuck with supporting that long-term and it may hamper us when we decide to move to something better (whether that be zarr/hdf5, some dask format, or something else). (maybe this ship has already sailed?) |
Sounds good, I'll make a pull request now. Re: validation, setting up an a more-developer friendly validation class is pretty straightforward and something I wouldn't mind starting. I'm pretty busy in the immediate future with a thesis defense in ~3 months, but I'll check back in the spring and see if it's still needed. |
Let's hold this issue open just a little longer while we consider concerns/ideas raised by fdeguiere03 |
I think my argument is that
In my case, even though I have the data prepared in an |
@darikoneil type validation is important and a separate issue. It would fit with a larger refactor, perhaps with #1391 Some params would be easy do this for, but there are many which may require digging into the nested calls to determine all the possible types but if you want to do it that'd be great. Caiman's lack of input and output validation makes it very hard to create reliable visualizations. |
@darikoneil Just checked out your PR. One small suggestion: using
|
In theory we should probably look over all file types we support in get_file_size() for this kind of thing. |
Based on a quick look, the tiff method looks good. I haven't used the other file types much, so can't make a quick judgement on their methodology. Would you want me to push that fix on the npy file size? |
there's a public method, not good practice to use private methods for these types of things: https://numpy.org/doc/stable/reference/generated/numpy.lib.format.read_array_header_2_0.html |
Sure, the method just changes based on the npy file version. Could use something like this:
I thought the |
yup checking for versions is the right way to do it even if it takes more lines of code, libraries can and do change private methods anytime they need to without warning |
Oops, 100% should have been using the mmap_mode keyword there. I think the PR went through already, so make another I guess. I vote that the keyword should be preferred over reading the header since the private method presumably doesn’t have a stability guarantee. |
@kushalkolar Cool, I'll check #1391 out. Definitely don't mind doing it! It would also improve the documentation since the constraints for a validator/parameter class would. Not sure how fast I could do it with defending soon, but I think I could implement it fairly quickly once I have some time for it. |
@pgunn @kushalkolar any preference on opening the Also wanted to check in what you were thinking on adding |
On the first part, I think reading the header probably will have the library do less work and should be preferred, even though the code will be less legible. Still thinking about the second. |
@pgunn Happy 2025! I wanted to check back in on what you were thinking on adding Since Let me know what you're thinking! |
I've gone back and forth on this, but I think, failing to reach a conclusion, it makes sense to accept the suggestion and implement the feature. |
Great, I really appreciate it! I'll work on making the pull request this evening. |
Your setup:
Documentation implies support for numpy arrays (and numpy memmaps, considering they are essentially drop-in), but this is not the case (or at least it's not clear to me how)? If I don't just have reading comprehension issues, an inability to pass data directly is a pretty significant deterrent to use of the package as-is, particularly for real-time / online usage where your data is likely to be flat binary.
Aside from the performance benefits of avoiding multiple memory-swaps, accepting numpy data provides users a lot of flexibility in how to provide their data. From an end-user perspective, it seems a bit silly/convoluted to take some data, convert it to .tif or .hdf5., only to load and then convert it again. Plus, converting to .tif might give someone dozens of files to keep track of and .hdf5 is notorious for corruption (or at least, was). Obviously I can and have edited the functions to accept numpy arrays locally, but perhaps other users might not be savy or motivated.
The text was updated successfully, but these errors were encountered: