Skip to content

Processing large datasets

Andrea Giovannucci edited this page May 2, 2016 · 5 revisions

Processing of large datasets

The purpose of this page is to explain how to use the constrained NMF algorithm to process large datasets that create memory issues. The main steps of this process which is implemented in the file demo_memmap.m are explained below:

Reading the file and saving it in a memory mapped .mat file

This (time consuming) process has to be executed once for every new dataset. The data is loaded into memory (either the whole dataset at once, or in chunks) and then saved in a .mat file that contains the following entries:

  • Y: The data in its native dimensions as a 3d array (or 4d array if 3d imaging is performed) where the last dimension is time and each entry corresponds to a different timestep.
  • Yr: The data reshaped as a 2d array with rows corresponding to the different voxels and columns corresponding to the different timesteps.
  • nY: The minimum value over the whole dataset (in case we want to subtract it prior to processing)
  • sizY: The dimensions of Y

Intuitively it doesn't make sense to store both Y and Yr since double the space is required. However we do so (at least for the time being), because empirically it leads to much faster loading and processing of the data in the subsequent steps.

There are two functions for performing this memory mapping procedure:

  • memmap_file.m: This function assumes that the whole dataset is stored in a single .tif. It reads the file and saves it in .mat format in the same folder and with the same name. If the file is too large to fit in memory then it can be read in chunks by setting the variable chunksize to an appropriate value. As an input the user needs to specify the path to the file.
  • memmap_file_sequence.m: This function assumes that the dataset is stored in a sequence of files, where different files correspond to different (set of) timesteps named sequentially. It reads the file and saves it in .mat format in the same folder and with the same name. As an input the user needs to specify the path to the folder that contains the files.

After it's saved the dataset is read using the matfile command that doesn't load the dataset in the memory but rather points to it on the hard drive.

Run the CNMF algorithm on spatially overlapping patches

Once the dataset is saved in a .mat format we can now apply the CNMF algorithm on spatially overlapping patches in parallel. First we specify the coordinates of the patches using the function construct_patches.m The function takes as input the dimensions of the dataset sizY (excluding the number of timesteps), the size of each patch along each dimension siz and the amount of overlap overlap in voxels, in each dimension. The output is a cell array where each entry contains the beginning and end points of a patch for each dimension.

run_CNMF_patches.m

The parallel processing of the different patches is performed with the function run_CNMF_patches.m The function takes as an input the memory mapped dataset data, the number of cells to be located in each patch K, the size of the Gaussian kernel tau to be used during initialization with the greedy method and the standard options structure options. Then the standard CNMF procedure is performed and the results for each patch are saved in the struct array RESULTS.

Equalization: An added benefit of processing different spatial patches separately is that the algorithm is looking for cells in an unbiased way throughout the field of view and not only in the brightest areas which is a property of the greedy initialization algorithm. However, this also creates a large number of false positive cells. To deal with this we classify the components using a simple procedure explained below.

Merging: After the processing of all the patches is finished the results are combined and the components are merged using the standard merge_components.m function. Note that in this case, merging can be done only using the fast version (chosen by setting options.fast_merge = 1) which is also the default option.

Classification: As mentioned above, applying the method on overlapping patches forces the algorithm to (initially) identify a specified number of components in each patch regardless of the number of (active) cells that exist in each patch. This in practice can create a large number of false positive components. To deal with problem we use a simple classification approach based on the power spectral density (PSD) of each voxel. After computing the PSD of each pixel (this computation take place in preprocess_data.m and only the values at relatively high frequencies are stored) the spectrum is whitened and k_means clustering is applied with clusters corresponding to voxels that are active (i.e., at least one neuron is captured in that voxel) or not. This classification approach creates a binary mask of active/inactive voxels. Then each component is kept if it significantly overlaps with the mask of active pixels, otherwise it is thrown away. This operation is performed from classify_components.m

It is important to note that this classification method is just one very simple example. Different methods will be incorporated in the future and ideas for classifying identified components to true/false are welcome.

Further updating of the components

Once run_CNMF_patches.m is complete we need to update the components once more, since merging the results using the fast option accounts for the data twice over the overlapping regions. To do this update_spatial_components.m and update_temporal_components.m have been modified so that they can handle memory mapped data as well.