Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Numpy support #1431

Open
darikoneil opened this issue Dec 3, 2024 · 31 comments
Open

Numpy support #1431

darikoneil opened this issue Dec 3, 2024 · 31 comments

Comments

@darikoneil
Copy link

Your setup:

  1. Operating System (Linux, MacOS, Windows): Windows 10
  2. Hardware type (x86, ARM..) and RAM: x86, 128 GB RAM, 24 GB VRAM
  3. Python Version (e.g. 3.9): 3.10.8
  4. Caiman version (e.g. 1.9.12): 1.11.3
  5. Which demo exhibits the problem (if applicable): N/A, but insert paths to numpy array or pass array directly
  6. How you installed Caiman (pure conda, conda + compile, colab, ..): conda
  7. Details:

Documentation implies support for numpy arrays (and numpy memmaps, considering they are essentially drop-in), but this is not the case (or at least it's not clear to me how)? If I don't just have reading comprehension issues, an inability to pass data directly is a pretty significant deterrent to use of the package as-is, particularly for real-time / online usage where your data is likely to be flat binary.

Aside from the performance benefits of avoiding multiple memory-swaps, accepting numpy data provides users a lot of flexibility in how to provide their data. From an end-user perspective, it seems a bit silly/convoluted to take some data, convert it to .tif or .hdf5., only to load and then convert it again. Plus, converting to .tif might give someone dozens of files to keep track of and .hdf5 is notorious for corruption (or at least, was). Obviously I can and have edited the functions to accept numpy arrays locally, but perhaps other users might not be savy or motivated.

@pgunn
Copy link
Member

pgunn commented Dec 3, 2024

Hello,
The load() function in caiman.base.movies should be able to load a .npy file; right now I don't think we have docs or example code for loading in-memory numpy arrays but I think it'd work to call the movie constructor around an ndarray and it will (probably) do the right thing.

We should document this better.

@fdeguire03
Copy link

One spot that I think could be better supported for numpy is the caiman.mmapping.load_memmap function.

Currently, the function restricts memmap loading to files with a .mmap extension. However, there's no technical limitation preventing the use of .npy files. By switching the memmap constructor on line 66 in mmapping.py from np.memmap to np.lib.format.open_memmap, .npy files can be loaded without additional code modifications. This change offers two key benefits:

  1. Increased file type flexibility
  2. Preservation of memmap shape and data type information directly in the file header, eliminating the need to decode these details from the filename

I'd be happy to submit a pull request implementing this change if the maintainers are interested.

@fdeguire03
Copy link

In general, I believe it is best practice to use the standard .npy extension instead of .mmap. While changing this throughout the codebase would be somewhat tedious, the most immediate improvement would be to modify the load_memmap function to support .npy files. This would prevent users from having to unnecessarily resave existing NumPy arrays just to work with the function.
The changes are straightforward—essentially just switching from np.memmap to np.lib.format.open_memmap—but would improve user convenience.

@pgunn
Copy link
Member

pgunn commented Dec 12, 2024

To be clear, these are not the same format on-disk, and if we were to actually use the header information on shape and data type, we'd have to have separate code paths to do that versus the (headerless/raw) mmap format that np.memmap uses.

I don't think it's a bad idea to do this for intermediate files (I assume this is mostly what you're talking about) - we've occasionally thought about switching data formats before for intermediate files (I was thinking about zarr or n5 sometime back - both have a lot of appeal and also have cross-language support). It may also be worth thinking about making a set of standard outputs in a post-run directory (or even mid-run directory) to make gluing it into other software easier, although if we're doing that, neither npy nor mmap is particularly friendly while zarr/n5 both seem pretty compelling.

If you're more thinking about initial input files to the pipeline, adding support for more formats isn't a bad idea (but we already support npy files).

@fdeguire03
Copy link

I believe that other than the 128-byte header, the format for .mmap and .npy files are the exact same on-disk. But you are right that np.memmap expects no header versus np.load (or np.lib.format.open_memmap) extract the header information.

In my use case, I am using only some parts of the pipeline. I have npy files that I need to submit into the CNMF step. Part of this step calls mmapping.load_memmap, which throws an error when I give an npy file -- even though the npy file is in the exact same format as the .mmap file and would be able to be used interchangeably.

@fdeguire03
Copy link

This is how I think support for npy files could easily be added.

Adding support for npy files in this way would not affect any code outside of this function, as the object returned by np.memmap is identical to the object returned by np.load:

extension = pathlib.Path(filename).suffix
if extension not in ('.mmap', '.npy'):
    logger.error(f"Unknown extension for file {filename}")
    raise ValueError(f'Unknown file extension for file {filename} (should be .mmap or .npy)')
# Strip path components and use CAIMAN_DATA/example_movies
# TODO: Eventually get the code to save these in a different dir
#fn_without_path = os.path.split(filename)[-1]
#fpart = fn_without_path.split('_')[1:]  # The filename encodes the structure of the map
decoded_fn = caiman.paths.decode_mmap_filename_dict(filename)
d1		= decoded_fn['d1']
d2		= decoded_fn['d2']
d3		= decoded_fn['d3']
T		= decoded_fn['T']
order  	= decoded_fn['order']

#d1, d2, d3, T, order = int(fpart[-9]), int(fpart[-7]), int(fpart[-5]), int(fpart[-1]), fpart[-3]

filename = caiman.paths.fn_relocated(filename)
if extension == '.mmap':
     Yr = np.memmap(filename, mode=mode, shape=prepare_shape((d1 * d2 * d3, T)), dtype=np.float32, order=order)
elif extension == '.npy':
     Yr = np.load(filename, mmap_mode=mode)

@pgunn
Copy link
Member

pgunn commented Dec 12, 2024

Let me think about this for a bit. You have an example of a usecase that might benefit (although another option would be for your calling code to save in the headerless format and name files in the way caiman expects). Still, the code change isn't too intrusive (assuming you only want to provide input; it'd be more intrusive to change the saved formats over and I'm still leaning towards a more language-neutral format if we get around to that).

I'll try to get back to you within the next few days on this - might also check in with Kushal and see how our general ideas of how to handle temporary files and output dirs might mesh.

Thanks for reaching out to us either way - we'll be in touch.

@kushalkolar
Copy link
Collaborator

As far as I understand as long as you have something that's sufficiently numpy-like in the right shape and order you should be able to give it to CNMF.fit()?

https://github.com/flatironinstitute/CaImAn/blob/main/caiman/source_extraction/cnmf/cnmf.py#L408

IMO the better way to do this is to decouple file loading and array objects that get passed to algos.

@fdeguire03
Copy link

fdeguire03 commented Dec 12, 2024

Agreed, there don't seem to be any technical barriers preventing numpy-like arrays in CNMF.fit(). However, the current parallelization framework relies on load_memmap() for efficient data loading across child processes, and this function raises an exception if the file type is not .mmap. This is likely a deliberate design choice to ensure consistent and efficient data handling in parallel processing. As Paul mentioned above, there is some discussion to be had about the best file type for this efficient loading in child processes.

Working within the current numpy array memmapping scheme, the core issue is that .npy files are essentially equivalent to .mmap files and should be acceptable in load_memmap() without raising an exception.

@darikoneil
Copy link
Author

Thanks for the expedient response @pgunn. I can confirm that the load() & load_movie_chain() functions handle .npy files appropriately. The MotionCorrect and OnAcid classes handle this appropriately as well. I can also confirm that directly instantiating seems to work, although one has to take care to cast the data to a float (since loading files does that automatically for the user). Ultimately, I ended up just reformatting the data into pixels x frames mmaps since they were about to put into that form anyway.

Being said, if you're following along with the examples like I was, the CNMFParams class will fail to validate when passed .npy files. Here's a minimal reproducible example:

import numpy as np
from caiman.source_extraction.cnmf.params import CNMFParams

params = CNMFParams(params_dict={"fnames": [
"H:\\E002\\Retrieval\\results\\testimages00.npy",
"H:\\E002\\Retrieval\\results\\testimages01.npy"
]})

I just reinstalled caiman in a fresh environment as I didn't remember what I changed to get it working last week. This specific failure can be blamed on the caiman.base.movies.get_file_size function not supporting .npy files. It's an easy fix, 2-3 liner. I can submit a pull request, but since it's extremely trivial it's probably easier for you to just add it into whatever your current development branch is.

When I went looking for the failure point, I did notice the parameter validation is a bit dense. Have you considered modernizing your validation (e.g., with pydantic)? It's obviously a non-trivial refactor (albeit one not intrusive to the rest of the codebase) and another dependency, but it would make your validation much more straight forward, easy to debug, and easier to implement additional constraints. You can also decorate your classes eventually, which will pass the parameters through the validator before instantiation, and will allow you to liberate them from having lots of duplicated validation logic.

For example,

from pydantic import BaseModel, Field, field_validator, model_validator

class CNMFParams(BaseModel):
	decay_time: float = Field(default=1.0, gt=0.0)  validated such that it is a non-zero, positive number
	dxy: tuple[float, float] = None
	dims: tuple[int, ...] = None
	is3D: bool = False
	gSig: tuple[int, ...] = None

	@field_validator("dxy", mode="before")
	@classmethod
	def validate_dxy(cls, dxy):
	#:validate such that it is non-zero and two dimensions and if only one dimension, coerce to two
		if isinstance(dxy, int | float):
			return dxy, dxy
		elif isinstance(dxy, list | tuple) and len(dxy)==1:
			return dxy[0], dxy[0]
		else:
			# will raise ValidationError if not tuple[float, float]
			return dxy 

	@model_validator(mode="after")
	def validate_dimension_dependent_parameters(self):
		# validate parameters are consistent
		if self.is3D:
			assert len(gSig) == 3
	
# validate_with decorator implementation not here for simplicity

class SomeCaimanClass:
	@validate_with(CNMFParams)
	def __init__(self, decay_time, dxy, dims, is3D, gSig):
		...
class SomeOtherCaimanClass:
	@validate_with(CNMFParams)
	def __init__(self, dxy, dims):
		...

If desired, this has the added benefit of providing all of your defaults from one location (the CNMFParams), so you wouldn't have to chase down inconsistencies there. If you like, we can discuss this in a separate issue.

@pgunn
Copy link
Member

pgunn commented Dec 12, 2024

@darikoneil
Ah, that's right. get_file_size() probably should live and share code with the two other functions that parse input and it's probably worth going over all the formats (particularly those that don't have CI tests) and make sure they have everything they need in each function.

There's a lot of low-hanging fruit in terms of simplifying and improving code quality; better validation is just one of probably dozens of examples. Hopefully we'll either get to fixing this or some future more substantial overhaul will see us replace (rather than fix) a lot of the bad code.

The lack of support in get_file_size() is something I'll get to soon though (if you want to do a PR against the dev branch, that's fine - some people really like to be on a list of contributors for tons of projects and while that way of thinking is alien to me, I don't really mind it, plus it will save me a small amount of time).

@pgunn
Copy link
Member

pgunn commented Dec 12, 2024

Agreed, there don't seem to be any technical barriers preventing numpy-like arrays in CNMF.fit(). However, the current parallelization framework relies on load_memmap() for efficient data loading across child processes, and this function raises an exception if the file type is not .mmap. This is likely a deliberate design choice to ensure consistent and efficient data handling in parallel processing. As Paul mentioned above, there is some discussion to be had about the best file type for this efficient loading in child processes.

Working within the current numpy array memmapping scheme, the core issue is that .npy files are essentially equivalent to .mmap files and should be acceptable in load_memmap() without raising an exception.

I'm not sure I agree with this argument; the fact that they're very similar formats doesn't matter a lot given how trivially convertable most file formats are to each other; if it'd take a few lines of code to convert a hdf5 file to the memmap format or a few lines of code to convert a npy file to the memmap format, does it really matter? In some ways, the hdf5 file is closer to the npy format in terms of practical handling because hdf5 files support introspection on what's inside and memmap does not.

That said, if information can be moved from filenames into internal headers, whatever format we pick that has internal headers, that feels like a better thing to be parsing (more structure, less likelihood of running into filename length or naming limits, plus the filenames are fairly ugly). Looking back, I think it probably was a mistake to encode information in filenames and be headerless (although any long-lived software project will have a lot of hindsight-revealed regrets).

Still thinking this through though.

@darikoneil
Copy link
Author

In general, I believe it is best practice to use the standard .npy extension instead of .mmap. While changing this throughout the codebase would be somewhat tedious, the most immediate improvement would be to modify the load_memmap function to support .npy files. This would prevent users from having to unnecessarily resave existing NumPy arrays just to work with the function. The changes are straightforward—essentially just switching from np.memmap to np.lib.format.open_memmap—but would improve user convenience.

One caveat is the contiguity in memory & type coercion. Chances are the data is going to be provided as a frames x Y x X tensor of unsigned integers, as most imaging software records their data this way and most libraries load imaging stacks / videos this way. If this is the case, the array is going to need to be rewritten anyway.

@pgunn
Copy link
Member

pgunn commented Dec 12, 2024

Another thing I'm wary of is effectively making a promise to preserve formats for temporary files over a run; this is something I think it'd be good to change and if we produce a documented format that's friendly to external use before we're ready, we may be effectively stuck with supporting that long-term and it may hamper us when we decide to move to something better (whether that be zarr/hdf5, some dask format, or something else).

(maybe this ship has already sailed?)

@darikoneil
Copy link
Author

@darikoneil Ah, that's right. get_file_size() probably should live and share code with the two other functions that parse input and it's probably worth going over all the formats (particularly those that don't have CI tests) and make sure they have everything they need in each function.

There's a lot of low-hanging fruit in terms of simplifying and improving code quality; better validation is just one of probably dozens of examples. Hopefully we'll either get to fixing this or some future more substantial overhaul will see us replace (rather than fix) a lot of the bad code.

The lack of support in get_file_size() is something I'll get to soon though (if you want to do a PR against the dev branch, that's fine - some people really like to be on a list of contributors for tons of projects and while that way of thinking is alien to me, I don't really mind it, plus it will save me a small amount of time).

Sounds good, I'll make a pull request now. Re: validation, setting up an a more-developer friendly validation class is pretty straightforward and something I wouldn't mind starting. I'm pretty busy in the immediate future with a thesis defense in ~3 months, but I'll check back in the spring and see if it's still needed.

This was referenced Dec 12, 2024
@pgunn
Copy link
Member

pgunn commented Dec 12, 2024

Let's hold this issue open just a little longer while we consider concerns/ideas raised by fdeguiere03

@fdeguire03
Copy link

I'm not sure I agree with this argument; the fact that they're very similar formats doesn't matter a lot given how trivially convertable most file formats are to each other; if it'd take a few lines of code to convert a hdf5 file to the memmap format or a few lines of code to convert a npy file to the memmap format, does it really matter? In some ways, the hdf5 file is closer to the npy format in terms of practical handling because hdf5 files support introspection on what's inside and memmap does not.

I think my argument is that .npy and .mmap are not just very similar formats - they are fundamentally the same format. .npy simply has a small header attached to hold some metadata. You can even load an .npy file with the np.memmap constructor if you add an offset:

mmap1 = np.memmap(<file_path>.mmap, shape=shape, dtype=np.float32, mode='r')
mmap2 = np.memmap(<file_path>.npy, shape=shape, dtype=np.float32, mode='r', offset=128)

In my case, even though I have the data prepared in an .npy file in the correct format, I am being forced to copy into an identical .mmap file to be able to pass it through the load_memmap() function.

@kushalkolar
Copy link
Collaborator

@darikoneil type validation is important and a separate issue. It would fit with a larger refactor, perhaps with #1391

Some params would be easy do this for, but there are many which may require digging into the nested calls to determine all the possible types but if you want to do it that'd be great. Caiman's lack of input and output validation makes it very hard to create reliable visualizations.

@fdeguire03
Copy link

@darikoneil Just checked out your PR. One small suggestion: using np.load loads the entire .npy file into memory just to extract the shape. This isn't a big deal for small files, but the worst-case time is very bad. Two options to avoid the unnecessary load:

  1. Read only the file header:
with open(file_name, 'rb') as f:
    # Read the header information
    version = np.lib.format.read_magic(f)
    shape, _, _ = np.lib.format._read_array_header(f, version)
  1. Open the file in memory-mapped mode:
shape = np.load(file_name, allow_pickle=False, mmap_mode='r').shape

@pgunn
Copy link
Member

pgunn commented Dec 12, 2024

In theory we should probably look over all file types we support in get_file_size() for this kind of thing.

@fdeguire03
Copy link

Based on a quick look, the tiff method looks good. I haven't used the other file types much, so can't make a quick judgement on their methodology.

Would you want me to push that fix on the npy file size?

@kushalkolar
Copy link
Collaborator

there's a public method, not good practice to use private methods for these types of things: https://numpy.org/doc/stable/reference/generated/numpy.lib.format.read_array_header_2_0.html

@fdeguire03
Copy link

fdeguire03 commented Dec 12, 2024

Sure, the method just changes based on the npy file version. Could use something like this:

shape, _, _ = np.lib.format.read_array_header_1_0(f) if version == (1, 0) else np.lib.format.read_array_header_2_0(f)

I thought the _read_array_header(f, version) method was slightly cleaner, but up to you on what you think looks good. Could also use the np.load with mmap_mode and avoid using these read array header methods.

@kushalkolar
Copy link
Collaborator

yup checking for versions is the right way to do it even if it takes more lines of code, libraries can and do change private methods anytime they need to without warning

@darikoneil
Copy link
Author

Sure, the method just changes based on the npy file version. Could use something like this:

shape, _, _ = np.lib.format.read_array_header_1_0(f) if version == (1, 0) else np.lib.format.read_array_header_2_0(f)

I thought the _read_array_header(f, version) method was slightly cleaner, but up to you on what you think looks good. Could also use the np.load with mmap_mode and avoid using these read array header methods.

@darikoneil Just checked out your PR. One small suggestion: using np.load loads the entire .npy file into memory just to extract the shape. This isn't a big deal for small files, but the worst-case time is very bad. Two options to avoid the unnecessary load:

  1. Read only the file header:
with open(file_name, 'rb') as f:
    # Read the header information
    version = np.lib.format.read_magic(f)
    shape, _, _ = np.lib.format._read_array_header(f, version)
  1. Open the file in memory-mapped mode:
shape = np.load(file_name, allow_pickle=False, mmap_mode='r').shape

Oops, 100% should have been using the mmap_mode keyword there. I think the PR went through already, so make another I guess. I vote that the keyword should be preferred over reading the header since the private method presumably doesn’t have a stability guarantee.

@darikoneil
Copy link
Author

@kushalkolar Cool, I'll check #1391 out. Definitely don't mind doing it! It would also improve the documentation since the constraints for a validator/parameter class would. Not sure how fast I could do it with defending soon, but I think I could implement it fairly quickly once I have some time for it.

@fdeguire03
Copy link

@pgunn @kushalkolar any preference on opening the .npy file in memory-mapped mode vs. using the public numpy methods to read the header? I can submit a new pull request, just wanted to get your input on conventions.

Also wanted to check in what you were thinking on adding .npy support to the load_memmap method.

@pgunn
Copy link
Member

pgunn commented Dec 13, 2024

On the first part, I think reading the header probably will have the library do less work and should be preferred, even though the code will be less legible.

Still thinking about the second.

@fdeguire03
Copy link

@pgunn Happy 2025!

I wanted to check back in on what you were thinking on adding .npy support to the load_memmap method.

Since .npy files are equivalent to .mmap files (except for the addition of a 128-byte header), it would only be a 2-3 line change in load_memmap.

Let me know what you're thinking!

@pgunn
Copy link
Member

pgunn commented Jan 9, 2025

I've gone back and forth on this, but I think, failing to reach a conclusion, it makes sense to accept the suggestion and implement the feature.

@fdeguire03
Copy link

Great, I really appreciate it! I'll work on making the pull request this evening.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants