-
Notifications
You must be signed in to change notification settings - Fork 28
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Using zspy with database-format? #249
Comments
Can you be more specific with the issue on windows? Does it have to do with the number of files per directory, the nested structure of the directories or the specific software being used on windows? I usually copy folder without zipping and it works fine when synchronising using Dropbox, Onedrive, Nextcloud, etc. What are you using to share the data? |
@magnunor Another thing to consider is if windows is trying to compress the data even further. I think for linux systems it checks to see if the underlying data is compressed and won't "double" compress the data but it's fairly possible that windows doesn't handle that case nearly as well. |
Something that I've been meaning to try is using a S3 like file system and the FSStore class. People seem to really like that for partial reads over a network which might be of interest. Another thing to consider is that the v3 specification includes support for "sharding" which should be quite interesting as well and improves the performance for windows computers I think. |
Internal sharing is fine, but for example Zenodo or our website-based filesender can't handle folder-structures (at least not easily). I tested this a bit more, and the ZipStore seems to perform pretty good:
The codeSaving the data: from time import time
import zarr
import dask.array as da
import hyperspy.api as hs
dask_data = da.zeros(shape=(400, 400, 200, 200), chunks=(50, 50, 50, 50))
dask_data[:, :, 80:120, 80:120] = da.random.random((400, 400, 40, 40))
s = hs.signals.Signal2D(dask_data).as_lazy()
###########################
t0 = time()
store = zarr.NestedDirectoryStore('001_test_save_nested_dir.zspy')
s.save(store)
print("NestedDirectory store, save-time: {0}".format(time() - t0))
##########################
t0 = time()
store = zarr.ZipStore('001_test_save_zipstore.zspy')
s.save(store)
print("ZIP store, save-time: {0}".format(time() - t0)) Loading the data: from time import time
import zarr
import hyperspy.api as hs
##############################
t0 = time()
store = zarr.NestedDirectoryStore("001_test_save_nested_dir.zspy")
s = hs.load(store)
print("NestedDirectory {0}".format(time() - t0))
"""
##############################
t0 = time()
store = zarr.ZipStore('001_test_save_zipstore.zspy')
s = hs.load(store)
print("ZIP {0}".format(time() - t0)) |
The go-to file format for saving large files in HyperSpy is currently
.zspy
. It uses the Zarr library, to (by default) save the individual chunks in a dataset as individual files. This is throughzarr.NestedDirectoryStore
. Since the data is stored in individual files, python can both write and read the data in parallel. This makes it much faster compared to for example HDF5-files (.hspy
).However, one large downside with this way of storing the data, is that one can end up with several 1000 of individual files nested within a large number of folders. Sharing this with other people directly is tricky. While it is possible to zip the data, the default zip-reader/writer in Windows seems to struggle if the number of files becomes too large. In addition, it is tedious if the receiver has to uncompress the data before they can visualize it.
Zarr has support for several database formats, where some of them can handle parallel reading and/or writing. With this, it should be possible to get the parallel read/write, while simultaneously getting only one or two files.
I am not at all familiar with these types of database formats. So I wanted to see how they performed, and if they could be useful for working on and sharing large multidimensional datasets.
File saving
Making the dataset
Saving the datasets:
File loading
Then loading the same datasets
Note: run these separately, since the file is pretty large.
Results:
The text was updated successfully, but these errors were encountered: