Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rechunking large BCF files #241

Open
AbaPanu opened this issue Mar 12, 2024 · 13 comments
Open

Rechunking large BCF files #241

AbaPanu opened this issue Mar 12, 2024 · 13 comments

Comments

@AbaPanu
Copy link

AbaPanu commented Mar 12, 2024

Hello!

I work with µXRF BCF files that are around 35 GB large and therefore need to load them in "Lazy" mode. I discovered that hyperspy automatically only loads the files as one dask chunk, which is identical to the 3D dimensions of the "non-Lazy" numpy array (tested with smaller files that can be loaded non-lazy). As a result, I cannot do anything with files that are too large for non-lazy mode, because hyperspy crashes as soon as I .compute(). According to the documentation there are options to "rechunk" or "optimize" the chunks, but no hints on how to use them. Is it possible to rechunk during loading? If not, is it even possible to load large BCF files?

Thanks a lot!

@CSSFrancis
Copy link
Member

@AbaPanu From my understanding the Broker M4 files are compressed but not chunked? @ericpre is that correct?

Unfortunately that is the worst case scenario... Is there a computer with greater than 32GB of RAM (Probably the acquisition computer) that you can use?

In that case you should be able to do something like:

import hyperspy.api as hs

s = hs.load("large_b4_dataset.b4", lazy=True) # creates large 1 chunk dataset
s.rechunk() # Will just automatically chose chunks of about 100 mb
s.save("large_dataset.hspy") # or if you want things to go faster try .zspy although it is harder to move data around.

It's not a great solution but otherwise some substantial effort is probably going to be needed to rewrite the loading function.

@ericpre
Copy link
Member

ericpre commented Mar 12, 2024

@AbaPanu From my understanding the Broker M4 files are compressed but not chunked? @ericpre is that correct?

Sorry, I have no idea, the reader has been written by @sem-geologist.

With the bruker reader, it is also possible to downsampleor crop the high energy range: https://hyperspy.org/rosettasciio/user_guide/supported_formats/bruker.html#rsciio.bruker.file_reader which can help with reducing the dimensionality of the data.

In any case, it would be good to add an example in the rechunk docstring and possible some hyperlink from the user guide to make easier to find it.

@AbaPanu
Copy link
Author

AbaPanu commented Mar 12, 2024

Thanks for the fast reply.
About "the Broker M4 files are compressed but not chunked?", I think that is correct, at least it seems to be the explanation to why the default "Lazy" mode only creates one chunk out of the entire 3D file, which it would not do were there any predefined chunks(?).
The s[4].rechunk() command initially works (s = entire bcf file, being composed of a list of 5 subunits, the fifth of those subunits [4] beeing the EDS data; x , y, counts/energy). After telling hyperspy to "rechunk" the original EDS-numpy-array (e.g. 34 GB) is restructured in x/y into 300 chunks with 130 MB, while the EDS-signal dimension is unchanged (which is what I hoped would happen). However, when trying to extract only specific pixels from the file and work with the corresponding spectra, hyperspy "kills" itself. The same thing happens when trying to save the file as hspy or zspy. I don't understand why it still crashes, in the rechunked Lazy mode it should at max have to read the information from 4 chunks at a time, which is well manageable with my computer (64 GiB RAM).
I will probably have to think about croping the energy range as @ericpre suggested. Within a µXRF spectrum this potentially results in valuable data loss.
In order to give some more context and maybe expose some of the errors I make that could effect the procedure, here is what I try to do:

import hyperspy.api as hs
from pathlib import Path
import numpy as np
import matplotlib.pyplot as plt

###############
## Functions ##
###############


def load_bcf_hypermap():
    for file_path in Path.cwd().glob('*.bcf'):
        print(file_path)
        bcf_hypermap = hs.load(file_path, lazy=True)
    return bcf_hypermap

def average_spectra_per_roi(file_path,bcf_hypermap,map_size_mu,map_size_pix):
    roi_array_pix = np.loadtxt(file_path, skiprows=1, dtype=int)              # loading txt files containing the xy coordinates as an array of predefined ROIs in the bcf map
    roi_array_mu = roi_array_pix*(map_size_mu/map_size_pix)                   # converting the pixel-ROI-coordinates into scale(µm)-ROI-coordinates in order to be able to use the Point2DROI attribute on the ROIs (this circumvents the problem that "Fancy indexing" is not jet implemented in Lazy-mode hyperspy/hyperspy#3324, but is not really elegant)
    sum_spectrum = 'empty'
    for i in range(0, len(roi_array_mu), 1):
        roi_pixel_i = hs.roi.Point2DROI(roi_array_mu[i,0], roi_array_mu[i,1])    # fancy indexing workaround
        print('point', i)
        spectrum_i = roi_pixel_i(bcf_hypermap[4])
        if sum_spectrum != 'empty':
            sum_spectrum = sum_spectrum + spectrum_i                             # summing up all spectra/pixel in the predefined ROI
        else:
            sum_spectrum = spectrum_i

    average_spectrum = sum_spectrum/(i+1)                                        #averaging the sum-spectrum in order to get a sort of "bulk-spectrum" of the corresponding ROI
    average_spectrum.compute()
#    plt.plot(average_spectrum)
#    plt.show()
    return sum_spectrum, average_spectrum


##################
## Main program ##
##################


bcf_hypermap = load_bcf_hypermap()
bcf_hypermap[4].rechunk()
print(bcf_hypermap[4].data)
map_size_mu = float(input('Enter the center x-value of the right most pixel:\n'))         # lenght of the x dimension in µm of the map-area, needed for the conversion pixel-->µm
print(bcf_hypermap)
map_size_pix = float(input('Enter the number of pixels on the x-axis:\n'))                # lenght of the x dimension in pixels of the map-area, needed for the conversion pixel-->µm,
dir_path = Path('./rois')                                                                 # reads all ROI.txt files in a specific folder
for file_path in dir_path.glob('*.txt'):
    print(Path(file_path).stem)
    sum_spectrum, average_spectrum = average_spectra_per_roi(file_path,bcf_hypermap,map_size_mu,map_size_pix)
    average_spectrum.save(str(Path(file_path).stem) + '.msa', encoding = 'utf8')          # saves the averaged sumspectrum to a msa file
´´´

@CSSFrancis
Copy link
Member

CSSFrancis commented Mar 13, 2024

@AbaPanu the reason why things are dying is partially because hyperspy is trying to process things in multiple threads so there might be some duplication. I would start with something like this:

bcf_hypermap = hs.load(file_path, lazy=True)
bcf_hypermap.rechunk()
bcf_hypermap.save("bcf_hypermap.zspy")

# Now run things normally:

bcf_hypermap = hs.load("bcf_hypermap.zspy")
# ....

I know that kind of seems stupid but it should (hopefully) stop the duplication which is killing you right now.

If that doesn't work you can try:

with dask.config.set(scheduler='single-threaded'):
    bcf_hypermap = hs.load(file_path, lazy=True)
    bcf_hypermap.rechunk()
    bcf_hypermap.save("bcf_hypermap.zspy")

# Now run things normally:

bcf_hypermap = hs.load("bcf_hypermap.zspy")
# ....

Which will force things to run serially which might reduce the amount of memory used.

Is 32 GB the size of the data compressed or uncompressed? It's possible that the data is larger than 64GB uncompressed which might be part of the problem.

Hope this helps!

@AbaPanu
Copy link
Author

AbaPanu commented Mar 13, 2024

@CSSFrancis I tried all of the above suggested:

bcf_hypermap = hs.load(file_path, lazy=True)
bcf_hypermap.rechunk()        #creates 289 chunks (shape = x:156, y:209, E:4096) see attached png
bcf_hypermap.save("bcf_hypermap.zspy")    #crashes after approximately 25 to 30 seconds of work, which creates a .zspy folder including several subfolders. If I reload this bcf_hypermap.zspy:
bcf_hypermap.save("bcf_hypermap.zspy")   #ERROR | Hyperspy | If this file format is supported, please report this error to the HyperSpy developers. (hyperspy.io:579), KeyError: 'original_metadata'

Screenshot_rechunk

with dask.config.set(scheduler='single-threaded'):
    bcf_hypermap = hs.load(file_path, lazy=True)
    bcf_hypermap.rechunk()       #works fine until here,
    bcf_hypermap.save("bcf_hypermap.zspy")    #and then crashes here after around 25 s of computing

with croping the energy range:

bcf_hypermap = hs.load(file_path, lazy=True, cutoff_at_kV=30) 
bcf_hypermap.rechunk()        #see attached png
#saving does crash too

Screenshot_rechunk_30kV

So the maximum bcf file size seems to be around 18 GB (for my home computer). If I load a file of that size, it takes as long to load it non-Lazy as it takes to calculate one sumspectrum in Lazy-mode (around 25 s). Is my assumption correct that although we are technically able to define chunks, when calculating hyperspy unpacks the entire bcf file and not only the chunks including the requested pixels?

@AbaPanu
Copy link
Author

AbaPanu commented Mar 13, 2024

It finally works! Halleluja!
For some reason it appears to be necessary to load the bcf file not only "lazy = True", but also "cutoff_at_kV=40" (in my files the energy range is anyway only 40 kV long, so the restriction is merely symbolical and causes no data-reduction). Then it is possible to save the file as a .zspy, reload that and sum/average the ROI spectra subsequently. It is even possible to use the fancy indexing previously established.
Thanks for your help @CSSFrancis and @ericpre!

## Functions ##
###############


def load_bcf_hypermap():
    for file_path in Path.cwd().glob('*.bcf'):
        print(file_path)
        bcf_hypermap = hs.load(file_path, lazy=True, cutoff_at_kV=40)
    return bcf_hypermap

def load_zspy_hypermap():
    for file_path in Path.cwd().glob('*.zspy'):
        print(file_path)
        zspy_hypermap = hs.load(file_path, lazy=False)
    return zspy_hypermap

def average_spectra_per_roi(file_path,zspy_hypermap):
    xy = np.loadtxt(file_path, skiprows=1, dtype=int)
    all_in_area = hs.signals.Signal1D(zspy_hypermap.data[xy[:,0], xy[:,1]])
    sum_spectrum = all_in_area.sum(axis=0)
    average_spectrum = sum_spectrum/len(xy)
    return sum_spectrum, average_spectrum


##################
## Main program ##
##################

bcf_hypermap = load_bcf_hypermap()
bcf_hypermap[4].rechunk()
bcf_hypermap[4].save('*.zspy')
zspy_hypermap = load_zspy_hypermap()
dir_path = Path('./rois')
for file_path in dir_path.glob('*.txt'):
    sum_spectrum, average_spectrum = average_spectra_per_roi(file_path,zspy_hypermap)
    plt.plot(average_spectrum)
    plt.show()
    average_spectrum.save(str(Path(file_path).stem) + '.msa', encoding = 'utf8')

@ericpre
Copy link
Member

ericpre commented Mar 13, 2024

Does it use a lot of memory when saving to zspy?

@sem-geologist
Copy link
Contributor

sem-geologist commented Mar 13, 2024

@AbaPanu From my understanding the Broker M4 files are compressed but not chunked? @ericpre is that correct?

Sorry, I have no idea, the reader has been written by @sem-geologist.

I think I should comment here a bit. The confusion could originate from some function/method naming in bcf parser code that it could be chunked. That is unfortunate as I was young, inexperienced and that was my first huge reverse engineering work. I had used wrongly the terminology in the code of "chunks", where actually as it is kind of virtual file system it should be called "blocks". bcf is built on virtual file system, it has such attributes as other file systems: table of content, address table of first file block, blocks have fixed size within single virtual file system (with single bcf). Every block contains header which contains pointer to next block. If bcf is compressed then files are compressed in zlib again in blocks, but zlib blocks are other size than filesystem blocks and few such blocks occupies the virtual file system block, and if it does not fit whole inside the virtual file block its overflowing part is in next block. As reader is made to work with version 1 and 2, and version 1 has no pixel address pointer table, thus random access of pixel was not implemented. The data of pixel is of dynamic length, thus if there is no such pixel table, the address position of of random pixel is unknown. This is why if You want to save whole file, you need enough of memory as whole hypercube needs to be parsed.

As I had limited resources (4GB RAM) on computer where I was testing writing and developing the code for bcf, I included some more tricks, like downsampling (or rather it should be called the pixel "down-binning", as it is summing counts) and cutoff at HV, and am happy that cutoff had solved your problem @AbaPanu . I was absolutely unaware Bruker uses it on XRF Tornado when writting the initial code, but by design It seemed to me possible that files could be much larger than those on SEM (It uses bcf for EBSD, where it can go to hundreds of GB, and single file system is also used for pan files). I will confess that I am not fluent in dusk, and that was stopper looking how to implement random access. The first point would be implementing pre-reader to make the pixel address table for version 1, when implementing random access to pixels would make sense.

Actually as I think now, such pre-reader could kick in the moment file is loaded with lazy=True. It would need to skim through whole Datacube initially (only if version 1) and gather addresses of pixel to make pixel pointer table. For version 2 that would not be needed as such table is ready made by Bruker as additional file inside the virtual file system of bcf. Such skimming would need just to jump from pixel header to next pixel header, as pixel headers has data length embedded. Albeit with Hard disk drive (not SSD) that would be slow as whole file would need to be block by block copied to memory. With SSD that should be faster than real parsing of the pixel data.

So the way how bcf are written is more like file system than chunked image. It contains potential of random access (which is needed for really lazy rechunk() to work, but it is not currently implemented.

@AbaPanu
Copy link
Author

AbaPanu commented Mar 14, 2024

"Does it use a lot of memory when saving to zspy?" @ericpre: yes, it uses almost all memory while saving the zspy-file and takes roughly 50 seconds to complete the task, but no crashing.

However, I stand corrected regarding the "Halleluja". On closer examination it becomes clear, that both versions of code have ISSUES:
Version 1:

def average_spectra_per_roi(file_path,bcf_hypermap,map_size_mu,map_size_pix):
    roi_array_pix = np.loadtxt(file_path, skiprows=1, dtype=int)              
    roi_array_mu = roi_array_pix*(map_size_mu/map_size_pix)             
    sum_spectrum = 'empty'
    for i in range(0, len(roi_array_mu), 1):
        roi_pixel_i = hs.roi.Point2DROI(roi_array_mu[i,0], roi_array_mu[i,1])    # fancy indexing workaround, FINDS THE PROPER AREAS from txt-files
        print('point', i)
        spectrum_i = roi_pixel_i(bcf_hypermap[4])
        if sum_spectrum != 'empty':
            sum_spectrum = sum_spectrum + spectrum_i                             # DOES NOT PRODUCE SUM-SPECTRA
            sum_spectrum = spectrum_i

    average_spectrum = sum_spectrum/(i+1)                                        # as a result: DOES NOT PRODUCE AV-SPECTRA
    average_spectrum.compute()
#    plt.plot(average_spectrum)
#    plt.show()
    return sum_spectrum, average_spectrum

I tried hs.stack(), but that did not work either so far.

Version 2:

def average_spectra_per_roi(file_path,zspy_hypermap):
    xy = np.loadtxt(file_path, skiprows=1, dtype=int)
    all_in_area = hs.signals.Signal1D(zspy_hypermap.data[xy[:,0], xy[:,1]])   #INDEXING MISSES THE ACTUAL COORDINATES from the txt.files
    sum_spectrum = all_in_area.sum(axis=0)
    average_spectrum = sum_spectrum/len(xy)               # IT PRODUCES PROPER SUM_SPECTRA, only from unknown regions of the map...
    return sum_spectrum, average_spectrum

I tried to transpose the [xy[:,0], xy[:,1]]-part (as suggested by @CSSFrancis ) which produces other sumspectra, but also not those that are assigned by the coordinates. If I tell it to: all_in_area = hs.signals.Signal1D(hypermap[812, 897]) and plt.plot(all_in_area) (where [812, 897] is a known composition and has been checked via navigator and signal plots of the zspy-file) it plots the spectrum of entirely different composition.
Can anybody tell me what goes wrong in the two code versions (or just one)?
Thanks!

@AbaPanu
Copy link
Author

AbaPanu commented Mar 23, 2024

Update on version 2:
Seems as if I just messed up the transposition the first time (no idea how I did that). It actually works like this:

    xy = np.loadtxt(file_path, skiprows=1, dtype=int)
    all_in_area = hs.signals.Signal1D(hypermap.data[xy[:,1], xy[:,0]])

Version 1 no progress so far, I am still very open for suggestions...

@ericpre
Copy link
Member

ericpre commented Mar 23, 2024

@AbaPanu, it is important to keep issue focused on what they are about, otherwise, when trying to understand them, it is confusing and not efficient, e.g. have to read about thing that are irrelevant to the issue at hand.

We will keep this thread focused on "lazy processing of large BCF file".

@sem-geologist
Copy link
Contributor

I started looking into this, and as far I had looked to Dask documentation, it should be possible to divide reading of BCF into smaller chunks for dask array assembly. I see two possibilities to achieve that: implement numpy like indexing for lazy retrieve of data, or use map_blocks from dask. It would really be handy to get a copy of such enormous dataset (tens of GB's) so I could experiment with.

@ericpre
Copy link
Member

ericpre commented Dec 2, 2024

I haven't used this format for a long time but I could try to dig into some of my old data! This would be ~1GB only but it would still be useful to specify chunks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants