-
Notifications
You must be signed in to change notification settings - Fork 28
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Kernal crashing while saving lazy signal from ripple file #329
Comments
@jeinsle One thing to keep track of is what is the initial chunk configuration. I think that currently we don't allow the chunk size to be passed for memmaped datasets which can cause issues such as this. Tracking through HyperSpy: https://github.com/hyperspy/hyperspy/blob/8bc57e1a809668d54da8d4f355ab0b18520fa4ad/hyperspy/io.py#L632 --> https://github.com/hyperspy/hyperspy/blob/RELEASE_next_minor/hyperspy/_signals/lazy.py#L424 --> https://github.com/hyperspy/hyperspy/blob/8bc57e1a809668d54da8d4f355ab0b18520fa4ad/hyperspy/_signals/lazy.py#L440 We should allow a "chunks" parameter to explicitly dictate how the chunks are formed for binary files. More than likely your data is being loaded in a less than ideal chunking pattern and then once you do the transpose hyperspy will sometimes try to force your data into different chunks in order for things like the Maybe we can try and figure out a good workaround and then we can think about the right thing to do. ** This is untested so you might need to edit this workflow a bit... ** from rsciio.ripple import file_reader
import dask.array as da
import hyperspy.api as hs
file_dict = file_reader('Montaged_Map_Data-Sand_LA_EDS_500x_Montage_Data.rpl', lazy=True
lazy_data = da.from_array(file_dict.pop("data"), chunks = (-1,100, 1)) # You might need to play around with this...
s = hs.signals.Signal1D(data=lazy_data, **file_dict ).T
s.save('Montaged_Map_Data-Sand_LA_EDS_500x_Montage_Data.zspy') # I like .zspy for larger files usually...
s = hs.load("Montaged_Map_Data-Sand_LA_EDS_500x_Montage_Data.zspy") For future considerations we can add the ability to do distributed loading so you can use the dask-dashboard which is helpful for debugging these types of things. Is this a typical dataset size for you? Are you interested in taking larger datasets? If so then we can spend a bit of time streamlining/optimizing this workflow. |
@CSSFrancis thanks for the reply at moment I have sucessfully run the save command, but it did take like 1.5 hours to save as I think it was running in a single threaded state. It would be nice if this could be sped up.... I think it should be possiable given the fact that we are using DASK arrays. What I am trying to do with the rechunk here is preserve the energy axis on the data. This summer I was working with a different big data set, and did not seem to have as much bother, and chould actually run the chunk size closer to 1 GB (roughly 4x in the x and y directions) due to the ram and number of processors. For various reasons my current dataset offers new problems.
|
Just another potential thought but dask had a bug at the beginning of the year which would also affect the ripple file loader. So just make sure you have the most recent version of dask? I have some other comments on the chunking but have a meeting in a bit so I'll come back to this. |
@CSSFrancis it might make sense to organise a call and chat dask stuff, as I think there are a few items that I do not know what to call and think I have missed something in my slapdash reading of the docs. As noted saving immediatly on opening helps, but now I am running into problems when I try to scaling the data and prepare for some PCA-clustering piplineing. |
This is not surprising and this is most likely a limitation with h5py - with zspy, it will be much faster. @CSSFrancis, you are right, it would be good to add the |
@ericpre this is good to know, but really the time comment was more on the speed than the root cause as I was working with some rather big data sets this summer and not running into the kernal cashing... it is just odd. but going to do a test now of open and immediatly save with zspy and then merge in some metadata and see if that works better. |
Are you saying the same process wasn't crashing this summer but it is now? As @CSSFrancis mentioned above, it could be due to a regression in dask? Reading #266, the dask versions with the bug are between 2024.2.0 and 2024.6.0. If you want to save some metadata without having to save the file, you can use the |
Sure I'd be willing to set up a video call. It might be good to discuss a couple of things. I'm not overly surprised that the PCA- Clustering doesn't perform ideally with 300+ GB. The dask-code for running PCA is a bit less efficient than it could be. I think the dask-ml function might work faster. One thing to consider fairly seriously is if you need to run PCA on the entire dataset or you could run it on a subset and then apply that to the entire dataset. As dask-ml states Not Everyone needs Scalable ML-- Tools like sampling can be effective. |
Hi @ericpre yeah, I had a slightly larger map this summer that I managed to just load it and save as a hspy file with out any bother. I even managed to get it to sum along the signal axis and then save that output (which was now signifficantly smaller). That said last night I tried resaving the data using zspy like this:
As you can see from this screen shot it did run and I can reload the file, but the kernal had managed to crash (clear that mesage when I got in). So I am not really sure what is happening here. Even more confused as had not updateed my enviroment until yesterday to address the possiable dask rev error (I was at a previously non-recommended dask version) Question does the [ |
@CSSFrancis to clarify, the comment on my ML pipeline was more of, this is where I am going. And I agree that some sampling might be in order, the question becomes how best to sample when a data set is super heterogeneous? For this one I have some ideas that will need to leverage what I built this summer. Regardless, the issue is still related to how best convert a large ripple file into some kind of dask array? note some of the reason that this montage ripple file is so big is that Oxford Instruments chooses to use a particularly nonsensical naming convention for the individual tiles, with makes it hard to export each tile of the data set independently. |
Almost 3h still sounds one order of magnitude too slow...
This should be clarified in the docstring, this only write the numpy or dask array, anything else, including, |
@ericpre agree this is actually slower than what I got using the [.hspy] extension. so what I can get from the ripple is essentally the dask array with no meta data. We have written small script which rips the data out of the h5oina file and then maps it to the hyperspy keys. It has been this step where things have gone sideways. |
@ericpre yesterday I tried resaving the file with the automatic chunk size (i.e. essentally a stack of quasi energy filetered images) which is how it finally did sucessfully save. However, this time I then added in all the metadata and axes manager information and then saved as a new file. This time it took over 5 hours, for just adding in less than a megabytes worth of information? Any thoughts? this seems to have signifficantly slowed down after updating my dask as recomended above. |
Describe the bug
I have loaded a ripple file which is 366 GB in size as a lazy signal. when I try saving using the command
data.save()
it starts the process. Usually after getting through 20% of the data the kernal crashes.I have changed the chunk size from 1.1 GB, 785 mb, 384 mb and 96 mb. Also with the 96 mb chunk size I have tried using the command
dask.config.set(scheduler='single-threaded')
. All of these result in the kernal crashing.To Reproduce
Steps to reproduce the behavior:
Expected behavior
File should save as an HSPY.
Python environement:
Additional context
Note immediatly saving the data as a hyperspy does not result in this crash. however, I still need to test the behavior of saving a file once I add in metadata etc for this file.
The text was updated successfully, but these errors were encountered: