Skip to content

Processing large datasets

Andrea Giovannucci edited this page May 3, 2016 · 5 revisions

Processing of large datasets

The purpose of this page is to explain how to use the constrained NMF algorithm to process large datasets that create memory issues. The main steps of this process which is implemented in the file demo_patches.py are explained below:

Reading the file and saving it in a memory mapped .mmap file, readable by the numpy.memmap module

This process has to be executed once for every new dataset. The data is loaded into memory in chunks and then saved incrementally in a .mmap file in format pixels X frames. The name of the saved file contains information about the original size and array reading order. For instance a file named

Yr_d1_483_d2_492_order_F_frames_1600_.mmap

has frame shape (483,492), has 1600 frames, and is stored in 'F' order. These files can be read with the utility function load_memmap in the utilities submodule.

There are two functions for performing this memory mapping and retrieval procedure:

  • utilities.save_memmap: This function assumes that the whole dataset is stored in a list of .tif files. You should have at least enough memory to open one such file. The function allows also to downsample each movie in any of the x,y,z directions, to remove portions of the beginning of each movie, and to only select a subset of pixels (see documentation)

  • utilities.load_memmap: This function provides a memory mapped version of the file passed as path. On this file you can do several numpy operations without having to load the file into memory.

Run the CNMF algorithm on spatially overlapping patches

Once the dataset is saved in a .mmap format we can now apply the CNMF algorithm on spatially overlapping patches in parallel. The logic of the parcelization and parallel execution is embedded into the map_reduce.run_CNMF_patches function. The function takes as input

  1. an option dictionary: exactly as in the case of the demo.py example. However, in this case one should take into account that these are the options parameters related to a single patch. Important parameters that should be specified are
  • the expected number of components per patch K
  • the expected size of neurons as specified by gSig (gSig=[5,5] means that the neurons are approximately 11 pixels in x and y)
  • the threshold for merging neurons
  • the p parameter of the autoregressive model
  • memory_fact representing the fraction of patch to be processed in a single memory load (decrease this number to optimize memory usage)
  1. The geometric parameters describing the patches:
  • the half size receptive field rf (rf=10 means that the patches cover an area of 20x20 pixels)
  • the stride representing the amount of overlap among patches in pixels

map_reduce.run_CNMF_patches

The parallel processing of the different patches is performed with the function map_reduce.run_CNMF_patches Then the standard CNMF procedure is performed on each patch (preprocessing, initialization, update spatial, update temporal, merge, update spatial, update temporal) and the results for each patch are returned in the variables

  • A_tot: matrix of spatial filters including all the components found in all the patches and represented in the coordinate frame of the whole frame
  • C_tot: matrix of calcium traces corresponding to elements in A_tot
  • sn_tot: per pixel noise estimates
  • optional_outputs: dictionary containing the outputs of the algorithm per patch

saved in the struct array RESULTS.

Equalization: An added benefit of processing different spatial patches separately is that the algorithm is looking for cells in an unbiased way throughout the field of view and not only in the brightest areas which is a property of the greedy initialization algorithm. However, this also creates a large number of false positive cells. To deal with this we classify the components using a simple procedure explained below.

Merging: After the processing of all the patches is finished the results are combined and the components are merged using the standard merge_components.m function. Note that in this case, merging can be done only using the fast version (chosen by setting options.fast_merge = 1) which is also the default option.

Classification: As mentioned above, applying the method on overlapping patches forces the algorithm to (initially) identify a specified number of components in each patch regardless of the number of (active) cells that exist in each patch. This in practice can create a large number of false positive components. To deal with problem we use a simple classification approach based on the power spectral density (PSD) of each voxel. After computing the PSD of each pixel (this computation take place in preprocess_data.m and only the values at relatively high frequencies are stored) the spectrum is whitened and k_means clustering is applied with clusters corresponding to voxels that are active (i.e., at least one neuron is captured in that voxel) or not. This classification approach creates a binary mask of active/inactive voxels. Then each component is kept if it significantly overlaps with the mask of active pixels, otherwise it is thrown away. This operation is performed from classify_components.m

It is important to note that this classification method is just one very simple example. Different methods will be incorporated in the future and ideas for classifying identified components to true/false are welcome.

Further updating of the components

Once run_CNMF_patches.m is complete we need to update the components once more, since merging the results using the fast option accounts for the data twice over the overlapping regions. To do this update_spatial_components.m and update_temporal_components.m have been modified so that they can handle memory mapped data as well.