-
Notifications
You must be signed in to change notification settings - Fork 85
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Unexpected get_direct_beam_position
behaviour with center_of_mass
method.
#1079
Comments
I guess #1057 will change how #First, define the circular region of the data that should be used for centering
_max = signal.max(axis=[0, 1]) #Calculate maximum throughstack for illustration
_max.plot()
circle = hs.roi.CircleROI(128, 128, 17.5) #Create a circular ROI that you would like to use as a mask for `get_direct_beam_position(method="center_of_mass")`
circle.add_widget(_max)
s = circle(signal, axes=[2, 3]) #Get the part of the data that contains the circular ROI. This signal cannot be used directly due to missing data outside the circular ROI making the center of mass calculation return NaNs.
#Next, find the corresponding square region of that data
extent = s.axes_manager.signal_extent #Get the signal extent
rectangle = hs.roi.RectangularROI(extent[0], extent[2], extent[1], extent[3]) #Create a rectangular ROI based on the sliced signal extent
s = rectangle(signal, axes=[2,3]) #Replace the sliced/masked signal to work around the missing data.
#Finally, calculate the shifts of the data using the square signal and the circular mask
circle.add_widget(s, axes=[2, 3]) #Add the circular ROI to the new sliced signal
shifts = s.get_direct_beam_position(method='center_of_mass', mask=(circle.cx-s.axes_manager[2].offset, circle.cy-s.axes_manager[3].offset, circle.r)) #Calculate shifts
shifts.make_linear_plane() #Make the linear plane estimation
# [Optional] Center the datasets
signal.center_direct_beam(shifts=shifts) |
@emichr I think the solution is that it shouldn't be possible to pass the mask in addition to a signal slice. As far as the memory leak. It's fairly likely that the case is that dask isn't processing things opimally. What version of pyxem are you using? I changed how the COM was computed here #1005 which should remove a fair bit of the read amplification that was previously happening. In particuar the function was rechunking rather agressively which sometimes forces data to be held in memory longer than it should be. Another thing to consider is that your data should be saved in chunks only in the navigation dimensions. If that is not the case the following code should automatically rechunk: s.rechunk()
s.save("rechunked.zspy") |
@emichr The issue seems to be from a small bug in the code, at this line. |
I have created a fix for this issue in #1080. |
@CSSFrancis I am using pyxem 0.18.0, so I guess the changes you made to COM computation should be there. Thanks for the tips, Converting to .zpsy and using appropriate chunking might work better - I'll try that a bit! I might get a headache from working with .zspy files on a HPC cluster with restrictions on the number of files, but that's not really a pyxem headache :) Just to be clear though, i was not working on lazy data. It seems that when working with lazy data on a HPC cluster, the memory requirements are higher than I would expect (not sure why, but there are memory spikes above the size of the dataset when e.g. loading .mib data lazily and when calculating COM etc). So I figured I should use the memory I ask for anyway and work non-lazy. @sivborg Great, thanks for looking into this and squashing that bug! I guess the most optimal solution for this is to slice the signal down to the mask size when a mask is given, but that might be for the upcoming pull requests for 0.19.0 or 1.0.0? Maybe the unit tests for beam centering with COM should be looked into as well to see if they can be improved to catch similar bugs? |
To be a bit more quantitative here: When loading a 31.3 GB .mib dataset with |
Part of this is fixed by #1080 |
I think the bigger issue in In my opinion, we should only have one, common, way of doing this. I think should be some slicing, followed by some kind of round mask. Which I think makes the most sense for this kind of data. We should make sure that the data is sliced before any "masking", to avoid using too much memory.
@CSSFrancis, I don't think that is always the optimal chunking. For example, if you want to just use the direct beam, and ignore the rest of the signal space, having chunks in the signal dimension means you will not have to load the full dataset from the harddrive.
@emichr, You could also try the zarr zipstore. It seems to work pretty well, at least locally on my own computer: hyperspy/rosettasciio#249 (comment)
In case someone else bumps into the same issue in the future, we figured out the origins of this bug. It is due to a change in A temporary fix until things are fixed is to downgrade to |
@magnunor I'm definetly in the camp of 1. Make everything run and 2. Make everything run fast. Chunking equally in all dimensions with the way that the Disk IO as a bottleneck is a lot less of a problem now with something like zarr than it was previously. Spining disk harddrives read at ~160 MB/sec and you can pretty easily put 10 of them in an array to get over 1 GB/sec. Paired with something like factor of 10 for good compression most likely the CPU becomes the bottleneck or things like passing data between workers in dask becomes an issue on rechunking. Of course I say this and I think dask/distributed#7507 should solve some of the rechunking problems dask has There are a couple other fun things I've found over the years:
So while yes for this case you are correct that slicing and then doing the DPC is faster it comes with some additional considerations and potentially some workflows that no longer work. I've been starting to write up a 4D STEM Hardware/ Software guide that I was going to publish as I think some of these things are a bit lost. Its a little bit more confusing when you add in a GPU to the mix trying to figure out what is the biggest issue. |
Describe the bug
When calling
shift = signal.get_direct_beam_position(method="center_of_mass")
with thehalf_square_width
/signal_slice
andmask
keyword arguments, the resulting shifts are all NaNs. I expect this issue is due to the mask coordinates not being correctly shifted/transforemed after slicing the signal internally.To Reproduce
The following code was run with Pyxem 0.18.0
The last plot clearly shows the method failing.
I tried some short debugging by picking the relevant parts of the pyxem code:
It looks like the last part works and gives the expected result. I.e. I expect the issue to be solved by a suitable offset/transformation of the mask coordinates to the new internally sliced signal. This also looks to be working with scaled axes:
The text was updated successfully, but these errors were encountered: