Numpy support #1431

darikoneil · 2024-12-03T19:12:10Z

Your setup:

Operating System (Linux, MacOS, Windows): Windows 10
Hardware type (x86, ARM..) and RAM: x86, 128 GB RAM, 24 GB VRAM
Python Version (e.g. 3.9): 3.10.8
Caiman version (e.g. 1.9.12): 1.11.3
Which demo exhibits the problem (if applicable): N/A, but insert paths to numpy array or pass array directly
How you installed Caiman (pure conda, conda + compile, colab, ..): conda
Details:

Documentation implies support for numpy arrays (and numpy memmaps, considering they are essentially drop-in), but this is not the case (or at least it's not clear to me how)? If I don't just have reading comprehension issues, an inability to pass data directly is a pretty significant deterrent to use of the package as-is, particularly for real-time / online usage where your data is likely to be flat binary.

Aside from the performance benefits of avoiding multiple memory-swaps, accepting numpy data provides users a lot of flexibility in how to provide their data. From an end-user perspective, it seems a bit silly/convoluted to take some data, convert it to .tif or .hdf5., only to load and then convert it again. Plus, converting to .tif might give someone dozens of files to keep track of and .hdf5 is notorious for corruption (or at least, was). Obviously I can and have edited the functions to accept numpy arrays locally, but perhaps other users might not be savy or motivated.

pgunn · 2024-12-03T21:23:27Z

Hello,
The load() function in caiman.base.movies should be able to load a .npy file; right now I don't think we have docs or example code for loading in-memory numpy arrays but I think it'd work to call the movie constructor around an ndarray and it will (probably) do the right thing.

We should document this better.

fdeguire03 · 2024-12-12T00:50:21Z

One spot that I think could be better supported for numpy is the caiman.mmapping.load_memmap function.

Currently, the function restricts memmap loading to files with a .mmap extension. However, there's no technical limitation preventing the use of .npy files. By switching the memmap constructor on line 66 in mmapping.py from np.memmap to np.lib.format.open_memmap, .npy files can be loaded without additional code modifications. This change offers two key benefits:

Increased file type flexibility
Preservation of memmap shape and data type information directly in the file header, eliminating the need to decode these details from the filename

I'd be happy to submit a pull request implementing this change if the maintainers are interested.

fdeguire03 · 2024-12-12T00:58:49Z

In general, I believe it is best practice to use the standard .npy extension instead of .mmap. While changing this throughout the codebase would be somewhat tedious, the most immediate improvement would be to modify the load_memmap function to support .npy files. This would prevent users from having to unnecessarily resave existing NumPy arrays just to work with the function.
The changes are straightforward—essentially just switching from np.memmap to np.lib.format.open_memmap—but would improve user convenience.

pgunn · 2024-12-12T01:10:40Z

To be clear, these are not the same format on-disk, and if we were to actually use the header information on shape and data type, we'd have to have separate code paths to do that versus the (headerless/raw) mmap format that np.memmap uses.

I don't think it's a bad idea to do this for intermediate files (I assume this is mostly what you're talking about) - we've occasionally thought about switching data formats before for intermediate files (I was thinking about zarr or n5 sometime back - both have a lot of appeal and also have cross-language support). It may also be worth thinking about making a set of standard outputs in a post-run directory (or even mid-run directory) to make gluing it into other software easier, although if we're doing that, neither npy nor mmap is particularly friendly while zarr/n5 both seem pretty compelling.

If you're more thinking about initial input files to the pipeline, adding support for more formats isn't a bad idea (but we already support npy files).

fdeguire03 · 2024-12-12T01:21:43Z

I believe that other than the 128-byte header, the format for .mmap and .npy files are the exact same on-disk. But you are right that np.memmap expects no header versus np.load (or np.lib.format.open_memmap) extract the header information.

In my use case, I am using only some parts of the pipeline. I have npy files that I need to submit into the CNMF step. Part of this step calls mmapping.load_memmap, which throws an error when I give an npy file -- even though the npy file is in the exact same format as the .mmap file and would be able to be used interchangeably.

fdeguire03 · 2024-12-12T01:29:28Z

This is how I think support for npy files could easily be added.

Adding support for npy files in this way would not affect any code outside of this function, as the object returned by np.memmap is identical to the object returned by np.load:

extension = pathlib.Path(filename).suffix
if extension not in ('.mmap', '.npy'):
    logger.error(f"Unknown extension for file {filename}")
    raise ValueError(f'Unknown file extension for file {filename} (should be .mmap or .npy)')
# Strip path components and use CAIMAN_DATA/example_movies
# TODO: Eventually get the code to save these in a different dir
#fn_without_path = os.path.split(filename)[-1]
#fpart = fn_without_path.split('_')[1:]  # The filename encodes the structure of the map
decoded_fn = caiman.paths.decode_mmap_filename_dict(filename)
d1		= decoded_fn['d1']
d2		= decoded_fn['d2']
d3		= decoded_fn['d3']
T		= decoded_fn['T']
order  	= decoded_fn['order']

#d1, d2, d3, T, order = int(fpart[-9]), int(fpart[-7]), int(fpart[-5]), int(fpart[-1]), fpart[-3]

filename = caiman.paths.fn_relocated(filename)
if extension == '.mmap':
     Yr = np.memmap(filename, mode=mode, shape=prepare_shape((d1 * d2 * d3, T)), dtype=np.float32, order=order)
elif extension == '.npy':
     Yr = np.load(filename, mmap_mode=mode)

pgunn · 2024-12-12T01:40:41Z

Let me think about this for a bit. You have an example of a usecase that might benefit (although another option would be for your calling code to save in the headerless format and name files in the way caiman expects). Still, the code change isn't too intrusive (assuming you only want to provide input; it'd be more intrusive to change the saved formats over and I'm still leaning towards a more language-neutral format if we get around to that).

I'll try to get back to you within the next few days on this - might also check in with Kushal and see how our general ideas of how to handle temporary files and output dirs might mesh.

Thanks for reaching out to us either way - we'll be in touch.

kushalkolar · 2024-12-12T04:52:44Z

As far as I understand as long as you have something that's sufficiently numpy-like in the right shape and order you should be able to give it to CNMF.fit()?

https://github.com/flatironinstitute/CaImAn/blob/main/caiman/source_extraction/cnmf/cnmf.py#L408

IMO the better way to do this is to decouple file loading and array objects that get passed to algos.

fdeguire03 · 2024-12-12T05:23:05Z

Agreed, there don't seem to be any technical barriers preventing numpy-like arrays in CNMF.fit(). However, the current parallelization framework relies on load_memmap() for efficient data loading across child processes, and this function raises an exception if the file type is not .mmap. This is likely a deliberate design choice to ensure consistent and efficient data handling in parallel processing. As Paul mentioned above, there is some discussion to be had about the best file type for this efficient loading in child processes.

Working within the current numpy array memmapping scheme, the core issue is that .npy files are essentially equivalent to .mmap files and should be acceptable in load_memmap() without raising an exception.

darikoneil · 2024-12-12T15:44:25Z

Thanks for the expedient response @pgunn. I can confirm that the load() & load_movie_chain() functions handle .npy files appropriately. The MotionCorrect and OnAcid classes handle this appropriately as well. I can also confirm that directly instantiating seems to work, although one has to take care to cast the data to a float (since loading files does that automatically for the user). Ultimately, I ended up just reformatting the data into pixels x frames mmaps since they were about to put into that form anyway.

Being said, if you're following along with the examples like I was, the CNMFParams class will fail to validate when passed .npy files. Here's a minimal reproducible example:

import numpy as np
from caiman.source_extraction.cnmf.params import CNMFParams

params = CNMFParams(params_dict={"fnames": [
"H:\\E002\\Retrieval\\results\\testimages00.npy",
"H:\\E002\\Retrieval\\results\\testimages01.npy"
]})

I just reinstalled caiman in a fresh environment as I didn't remember what I changed to get it working last week. This specific failure can be blamed on the caiman.base.movies.get_file_size function not supporting .npy files. It's an easy fix, 2-3 liner. I can submit a pull request, but since it's extremely trivial it's probably easier for you to just add it into whatever your current development branch is.

When I went looking for the failure point, I did notice the parameter validation is a bit dense. Have you considered modernizing your validation (e.g., with pydantic)? It's obviously a non-trivial refactor (albeit one not intrusive to the rest of the codebase) and another dependency, but it would make your validation much more straight forward, easy to debug, and easier to implement additional constraints. You can also decorate your classes eventually, which will pass the parameters through the validator before instantiation, and will allow you to liberate them from having lots of duplicated validation logic.

For example,

from pydantic import BaseModel, Field, field_validator, model_validator

class CNMFParams(BaseModel):
	decay_time: float = Field(default=1.0, gt=0.0)  validated such that it is a non-zero, positive number
	dxy: tuple[float, float] = None
	dims: tuple[int, ...] = None
	is3D: bool = False
	gSig: tuple[int, ...] = None

	@field_validator("dxy", mode="before")
	@classmethod
	def validate_dxy(cls, dxy):
	#:validate such that it is non-zero and two dimensions and if only one dimension, coerce to two
		if isinstance(dxy, int | float):
			return dxy, dxy
		elif isinstance(dxy, list | tuple) and len(dxy)==1:
			return dxy[0], dxy[0]
		else:
			# will raise ValidationError if not tuple[float, float]
			return dxy 

	@model_validator(mode="after")
	def validate_dimension_dependent_parameters(self):
		# validate parameters are consistent
		if self.is3D:
			assert len(gSig) == 3
	
# validate_with decorator implementation not here for simplicity

class SomeCaimanClass:
	@validate_with(CNMFParams)
	def __init__(self, decay_time, dxy, dims, is3D, gSig):
		...
class SomeOtherCaimanClass:
	@validate_with(CNMFParams)
	def __init__(self, dxy, dims):
		...

If desired, this has the added benefit of providing all of your defaults from one location (the CNMFParams), so you wouldn't have to chase down inconsistencies there. If you like, we can discuss this in a separate issue.

pgunn · 2024-12-12T15:52:54Z

@darikoneil
Ah, that's right. get_file_size() probably should live and share code with the two other functions that parse input and it's probably worth going over all the formats (particularly those that don't have CI tests) and make sure they have everything they need in each function.

There's a lot of low-hanging fruit in terms of simplifying and improving code quality; better validation is just one of probably dozens of examples. Hopefully we'll either get to fixing this or some future more substantial overhaul will see us replace (rather than fix) a lot of the bad code.

The lack of support in get_file_size() is something I'll get to soon though (if you want to do a PR against the dev branch, that's fine - some people really like to be on a list of contributors for tons of projects and while that way of thinking is alien to me, I don't really mind it, plus it will save me a small amount of time).

pgunn · 2024-12-12T16:00:40Z

Agreed, there don't seem to be any technical barriers preventing numpy-like arrays in CNMF.fit(). However, the current parallelization framework relies on load_memmap() for efficient data loading across child processes, and this function raises an exception if the file type is not .mmap. This is likely a deliberate design choice to ensure consistent and efficient data handling in parallel processing. As Paul mentioned above, there is some discussion to be had about the best file type for this efficient loading in child processes.

Working within the current numpy array memmapping scheme, the core issue is that .npy files are essentially equivalent to .mmap files and should be acceptable in load_memmap() without raising an exception.

I'm not sure I agree with this argument; the fact that they're very similar formats doesn't matter a lot given how trivially convertable most file formats are to each other; if it'd take a few lines of code to convert a hdf5 file to the memmap format or a few lines of code to convert a npy file to the memmap format, does it really matter? In some ways, the hdf5 file is closer to the npy format in terms of practical handling because hdf5 files support introspection on what's inside and memmap does not.

That said, if information can be moved from filenames into internal headers, whatever format we pick that has internal headers, that feels like a better thing to be parsing (more structure, less likelihood of running into filename length or naming limits, plus the filenames are fairly ugly). Looking back, I think it probably was a mistake to encode information in filenames and be headerless (although any long-lived software project will have a lot of hindsight-revealed regrets).

Still thinking this through though.

darikoneil · 2024-12-12T16:01:51Z

In general, I believe it is best practice to use the standard .npy extension instead of .mmap. While changing this throughout the codebase would be somewhat tedious, the most immediate improvement would be to modify the load_memmap function to support .npy files. This would prevent users from having to unnecessarily resave existing NumPy arrays just to work with the function. The changes are straightforward—essentially just switching from np.memmap to np.lib.format.open_memmap—but would improve user convenience.

One caveat is the contiguity in memory & type coercion. Chances are the data is going to be provided as a frames x Y x X tensor of unsigned integers, as most imaging software records their data this way and most libraries load imaging stacks / videos this way. If this is the case, the array is going to need to be rewritten anyway.

pgunn · 2024-12-12T16:04:36Z

Another thing I'm wary of is effectively making a promise to preserve formats for temporary files over a run; this is something I think it'd be good to change and if we produce a documented format that's friendly to external use before we're ready, we may be effectively stuck with supporting that long-term and it may hamper us when we decide to move to something better (whether that be zarr/hdf5, some dask format, or something else).

(maybe this ship has already sailed?)

darikoneil · 2024-12-12T16:11:19Z

@darikoneil Ah, that's right. get_file_size() probably should live and share code with the two other functions that parse input and it's probably worth going over all the formats (particularly those that don't have CI tests) and make sure they have everything they need in each function.

There's a lot of low-hanging fruit in terms of simplifying and improving code quality; better validation is just one of probably dozens of examples. Hopefully we'll either get to fixing this or some future more substantial overhaul will see us replace (rather than fix) a lot of the bad code.

The lack of support in get_file_size() is something I'll get to soon though (if you want to do a PR against the dev branch, that's fine - some people really like to be on a list of contributors for tons of projects and while that way of thinking is alien to me, I don't really mind it, plus it will save me a small amount of time).

Sounds good, I'll make a pull request now. Re: validation, setting up an a more-developer friendly validation class is pretty straightforward and something I wouldn't mind starting. I'm pretty busy in the immediate future with a thesis defense in ~3 months, but I'll check back in the spring and see if it's still needed.

pgunn · 2024-12-12T16:44:17Z

Let's hold this issue open just a little longer while we consider concerns/ideas raised by fdeguiere03

fdeguire03 · 2024-12-12T17:02:32Z

I'm not sure I agree with this argument; the fact that they're very similar formats doesn't matter a lot given how trivially convertable most file formats are to each other; if it'd take a few lines of code to convert a hdf5 file to the memmap format or a few lines of code to convert a npy file to the memmap format, does it really matter? In some ways, the hdf5 file is closer to the npy format in terms of practical handling because hdf5 files support introspection on what's inside and memmap does not.

I think my argument is that .npy and .mmap are not just very similar formats - they are fundamentally the same format. .npy simply has a small header attached to hold some metadata. You can even load an .npy file with the np.memmap constructor if you add an offset:

mmap1 = np.memmap(<file_path>.mmap, shape=shape, dtype=np.float32, mode='r')
mmap2 = np.memmap(<file_path>.npy, shape=shape, dtype=np.float32, mode='r', offset=128)

In my case, even though I have the data prepared in an .npy file in the correct format, I am being forced to copy into an identical .mmap file to be able to pass it through the load_memmap() function.

kushalkolar · 2024-12-12T17:22:55Z

@darikoneil type validation is important and a separate issue. It would fit with a larger refactor, perhaps with #1391

Some params would be easy do this for, but there are many which may require digging into the nested calls to determine all the possible types but if you want to do it that'd be great. Caiman's lack of input and output validation makes it very hard to create reliable visualizations.

fdeguire03 · 2024-12-12T22:16:03Z

@darikoneil Just checked out your PR. One small suggestion: using np.load loads the entire .npy file into memory just to extract the shape. This isn't a big deal for small files, but the worst-case time is very bad. Two options to avoid the unnecessary load:

Read only the file header:

with open(file_name, 'rb') as f:
    # Read the header information
    version = np.lib.format.read_magic(f)
    shape, _, _ = np.lib.format._read_array_header(f, version)

Open the file in memory-mapped mode:

shape = np.load(file_name, allow_pickle=False, mmap_mode='r').shape

pgunn · 2024-12-12T22:18:45Z

In theory we should probably look over all file types we support in get_file_size() for this kind of thing.

fdeguire03 · 2024-12-12T22:26:36Z

Based on a quick look, the tiff method looks good. I haven't used the other file types much, so can't make a quick judgement on their methodology.

Would you want me to push that fix on the npy file size?

kushalkolar · 2024-12-12T22:27:11Z

there's a public method, not good practice to use private methods for these types of things: https://numpy.org/doc/stable/reference/generated/numpy.lib.format.read_array_header_2_0.html

fdeguire03 · 2024-12-12T22:29:47Z

Sure, the method just changes based on the npy file version. Could use something like this:

shape, _, _ = np.lib.format.read_array_header_1_0(f) if version == (1, 0) else np.lib.format.read_array_header_2_0(f)

I thought the _read_array_header(f, version) method was slightly cleaner, but up to you on what you think looks good. Could also use the np.load with mmap_mode and avoid using these read array header methods.

kushalkolar · 2024-12-13T00:00:13Z

yup checking for versions is the right way to do it even if it takes more lines of code, libraries can and do change private methods anytime they need to without warning

darikoneil · 2024-12-13T00:39:33Z

Sure, the method just changes based on the npy file version. Could use something like this:
shape, _, _ = np.lib.format.read_array_header_1_0(f) if version == (1, 0) else np.lib.format.read_array_header_2_0(f)
I thought the _read_array_header(f, version) method was slightly cleaner, but up to you on what you think looks good. Could also use the np.load with mmap_mode and avoid using these read array header methods.

@darikoneil Just checked out your PR. One small suggestion: using np.load loads the entire .npy file into memory just to extract the shape. This isn't a big deal for small files, but the worst-case time is very bad. Two options to avoid the unnecessary load:

Read only the file header:
with open(file_name, 'rb') as f:
    # Read the header information
    version = np.lib.format.read_magic(f)
    shape, _, _ = np.lib.format._read_array_header(f, version)
Open the file in memory-mapped mode:
shape = np.load(file_name, allow_pickle=False, mmap_mode='r').shape

Oops, 100% should have been using the mmap_mode keyword there. I think the PR went through already, so make another I guess. I vote that the keyword should be preferred over reading the header since the private method presumably doesn’t have a stability guarantee.

darikoneil · 2024-12-13T01:03:09Z

@kushalkolar Cool, I'll check #1391 out. Definitely don't mind doing it! It would also improve the documentation since the constraints for a validator/parameter class would. Not sure how fast I could do it with defending soon, but I think I could implement it fairly quickly once I have some time for it.

fdeguire03 · 2024-12-13T03:09:12Z

@pgunn @kushalkolar any preference on opening the .npy file in memory-mapped mode vs. using the public numpy methods to read the header? I can submit a new pull request, just wanted to get your input on conventions.

Also wanted to check in what you were thinking on adding .npy support to the load_memmap method.

pgunn · 2024-12-13T15:46:41Z

On the first part, I think reading the header probably will have the library do less work and should be preferred, even though the code will be less legible.

Still thinking about the second.

fdeguire03 · 2025-01-07T23:27:33Z

@pgunn Happy 2025!

I wanted to check back in on what you were thinking on adding .npy support to the load_memmap method.

Since .npy files are equivalent to .mmap files (except for the addition of a 128-byte header), it would only be a 2-3 line change in load_memmap.

Let me know what you're thinking!

pgunn · 2025-01-09T01:01:20Z

I've gone back and forth on this, but I think, failing to reach a conclusion, it makes sense to accept the suggestion and implement the feature.

fdeguire03 · 2025-01-09T02:12:54Z

Great, I really appreciate it! I'll work on making the pull request this evening.

This was referenced Dec 12, 2024

caiman.base.movies.get_file_size fix #1436

Closed

Update movies.py #1437

Merged

fdeguire03 mentioned this issue Dec 13, 2024

Improve get_file_size efficiency for .npy files #1439

Merged

1 task

fdeguire03 mentioned this issue Jan 9, 2025

Add npy support to load_memmap #1446

Merged

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Numpy support #1431

Numpy support #1431

darikoneil commented Dec 3, 2024

pgunn commented Dec 3, 2024

fdeguire03 commented Dec 12, 2024

fdeguire03 commented Dec 12, 2024

pgunn commented Dec 12, 2024

fdeguire03 commented Dec 12, 2024

fdeguire03 commented Dec 12, 2024

pgunn commented Dec 12, 2024

kushalkolar commented Dec 12, 2024

fdeguire03 commented Dec 12, 2024 •

edited

Loading

darikoneil commented Dec 12, 2024

pgunn commented Dec 12, 2024

pgunn commented Dec 12, 2024

darikoneil commented Dec 12, 2024

pgunn commented Dec 12, 2024

darikoneil commented Dec 12, 2024

pgunn commented Dec 12, 2024

fdeguire03 commented Dec 12, 2024

kushalkolar commented Dec 12, 2024

fdeguire03 commented Dec 12, 2024

pgunn commented Dec 12, 2024

fdeguire03 commented Dec 12, 2024

kushalkolar commented Dec 12, 2024

fdeguire03 commented Dec 12, 2024 •

edited

Loading

kushalkolar commented Dec 13, 2024

darikoneil commented Dec 13, 2024

darikoneil commented Dec 13, 2024

fdeguire03 commented Dec 13, 2024

pgunn commented Dec 13, 2024

fdeguire03 commented Jan 7, 2025

pgunn commented Jan 9, 2025

fdeguire03 commented Jan 9, 2025

Numpy support #1431

Numpy support #1431

Comments

darikoneil commented Dec 3, 2024

pgunn commented Dec 3, 2024

fdeguire03 commented Dec 12, 2024

fdeguire03 commented Dec 12, 2024

pgunn commented Dec 12, 2024

fdeguire03 commented Dec 12, 2024

fdeguire03 commented Dec 12, 2024

pgunn commented Dec 12, 2024

kushalkolar commented Dec 12, 2024

fdeguire03 commented Dec 12, 2024 • edited Loading

darikoneil commented Dec 12, 2024

pgunn commented Dec 12, 2024

pgunn commented Dec 12, 2024

darikoneil commented Dec 12, 2024

pgunn commented Dec 12, 2024

darikoneil commented Dec 12, 2024

pgunn commented Dec 12, 2024

fdeguire03 commented Dec 12, 2024

kushalkolar commented Dec 12, 2024

fdeguire03 commented Dec 12, 2024

pgunn commented Dec 12, 2024

fdeguire03 commented Dec 12, 2024

kushalkolar commented Dec 12, 2024

fdeguire03 commented Dec 12, 2024 • edited Loading

kushalkolar commented Dec 13, 2024

darikoneil commented Dec 13, 2024

darikoneil commented Dec 13, 2024

fdeguire03 commented Dec 13, 2024

pgunn commented Dec 13, 2024

fdeguire03 commented Jan 7, 2025

pgunn commented Jan 9, 2025

fdeguire03 commented Jan 9, 2025

fdeguire03 commented Dec 12, 2024 •

edited

Loading

fdeguire03 commented Dec 12, 2024 •

edited

Loading