-
Notifications
You must be signed in to change notification settings - Fork 28
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Rechunking large BCF files #241
Comments
@AbaPanu From my understanding the Broker M4 files are compressed but not chunked? @ericpre is that correct? Unfortunately that is the worst case scenario... Is there a computer with greater than 32GB of RAM (Probably the acquisition computer) that you can use? In that case you should be able to do something like:
It's not a great solution but otherwise some substantial effort is probably going to be needed to rewrite the loading function. |
Sorry, I have no idea, the reader has been written by @sem-geologist. With the bruker reader, it is also possible to downsampleor crop the high energy range: https://hyperspy.org/rosettasciio/user_guide/supported_formats/bruker.html#rsciio.bruker.file_reader which can help with reducing the dimensionality of the data. In any case, it would be good to add an example in the |
Thanks for the fast reply.
|
@AbaPanu the reason why things are dying is partially because bcf_hypermap = hs.load(file_path, lazy=True)
bcf_hypermap.rechunk()
bcf_hypermap.save("bcf_hypermap.zspy")
# Now run things normally:
bcf_hypermap = hs.load("bcf_hypermap.zspy")
# .... I know that kind of seems stupid but it should (hopefully) stop the duplication which is killing you right now. If that doesn't work you can try: with dask.config.set(scheduler='single-threaded'):
bcf_hypermap = hs.load(file_path, lazy=True)
bcf_hypermap.rechunk()
bcf_hypermap.save("bcf_hypermap.zspy")
# Now run things normally:
bcf_hypermap = hs.load("bcf_hypermap.zspy")
# .... Which will force things to run serially which might reduce the amount of memory used. Is 32 GB the size of the data compressed or uncompressed? It's possible that the data is larger than 64GB uncompressed which might be part of the problem. Hope this helps! |
@CSSFrancis I tried all of the above suggested:
with croping the energy range:
So the maximum bcf file size seems to be around 18 GB (for my home computer). If I load a file of that size, it takes as long to load it non-Lazy as it takes to calculate one sumspectrum in Lazy-mode (around 25 s). Is my assumption correct that although we are technically able to define chunks, when calculating hyperspy unpacks the entire bcf file and not only the chunks including the requested pixels? |
It finally works! Halleluja!
|
Does it use a lot of memory when saving to |
I think I should comment here a bit. The confusion could originate from some function/method naming in bcf parser code that it could be chunked. That is unfortunate as I was young, inexperienced and that was my first huge reverse engineering work. I had used wrongly the terminology in the code of "chunks", where actually as it is kind of virtual file system it should be called "blocks". bcf is built on virtual file system, it has such attributes as other file systems: table of content, address table of first file block, blocks have fixed size within single virtual file system (with single bcf). Every block contains header which contains pointer to next block. If bcf is compressed then files are compressed in zlib again in blocks, but zlib blocks are other size than filesystem blocks and few such blocks occupies the virtual file system block, and if it does not fit whole inside the virtual file block its overflowing part is in next block. As reader is made to work with version 1 and 2, and version 1 has no pixel address pointer table, thus random access of pixel was not implemented. The data of pixel is of dynamic length, thus if there is no such pixel table, the address position of of random pixel is unknown. This is why if You want to save whole file, you need enough of memory as whole hypercube needs to be parsed. As I had limited resources (4GB RAM) on computer where I was testing writing and developing the code for bcf, I included some more tricks, like downsampling (or rather it should be called the pixel "down-binning", as it is summing counts) and cutoff at HV, and am happy that cutoff had solved your problem @AbaPanu . I was absolutely unaware Bruker uses it on XRF Tornado when writting the initial code, but by design It seemed to me possible that files could be much larger than those on SEM (It uses bcf for EBSD, where it can go to hundreds of GB, and single file system is also used for pan files). I will confess that I am not fluent in dusk, and that was stopper looking how to implement random access. The first point would be implementing pre-reader to make the pixel address table for version 1, when implementing random access to pixels would make sense. Actually as I think now, such pre-reader could kick in the moment file is loaded with So the way how bcf are written is more like file system than chunked image. It contains potential of random access (which is needed for really lazy |
"Does it use a lot of memory when saving to zspy?" @ericpre: yes, it uses almost all memory while saving the zspy-file and takes roughly 50 seconds to complete the task, but no crashing. However, I stand corrected regarding the "Halleluja". On closer examination it becomes clear, that both versions of code have ISSUES:
I tried hs.stack(), but that did not work either so far. Version 2:
I tried to transpose the [xy[:,0], xy[:,1]]-part (as suggested by @CSSFrancis ) which produces other sumspectra, but also not those that are assigned by the coordinates. If I tell it to: |
Update on version 2:
Version 1 no progress so far, I am still very open for suggestions... |
@AbaPanu, it is important to keep issue focused on what they are about, otherwise, when trying to understand them, it is confusing and not efficient, e.g. have to read about thing that are irrelevant to the issue at hand. We will keep this thread focused on "lazy processing of large BCF file". |
I started looking into this, and as far I had looked to Dask documentation, it should be possible to divide reading of BCF into smaller chunks for dask array assembly. I see two possibilities to achieve that: implement numpy like indexing for lazy retrieve of data, or use |
I haven't used this format for a long time but I could try to dig into some of my old data! This would be ~1GB only but it would still be useful to specify chunks. |
Hello!
I work with µXRF BCF files that are around 35 GB large and therefore need to load them in "Lazy" mode. I discovered that hyperspy automatically only loads the files as one dask chunk, which is identical to the 3D dimensions of the "non-Lazy" numpy array (tested with smaller files that can be loaded non-lazy). As a result, I cannot do anything with files that are too large for non-lazy mode, because hyperspy crashes as soon as I .compute(). According to the documentation there are options to "rechunk" or "optimize" the chunks, but no hints on how to use them. Is it possible to rechunk during loading? If not, is it even possible to load large BCF files?
Thanks a lot!
The text was updated successfully, but these errors were encountered: