Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Opening speed and initial caching #17

Open
bnlawrence opened this issue Dec 30, 2024 · 5 comments
Open

Opening speed and initial caching #17

bnlawrence opened this issue Dec 30, 2024 · 5 comments

Comments

@bnlawrence
Copy link
Collaborator

This supercedes #3 and is directly intended to address h5netcdf performance issues.

We want to make sure we don't spend too much time instantiating file instances and we get the timing right for reading properties including b-trees. Will report information here.

@bnlawrence
Copy link
Collaborator Author

Initial performance results with the current b-tree caching scheme (on variable instantiation in pyfive):

File Opening Time Comparison  da193o_25_day__grid_T_198807-198807.nc (ms)
h5py:    0.177002
pyfive:  0.150879
Variable instantiation for [tos]
h5py:    0.094238
pyfive:  0.434082
Access and calculation time for summation
h5py:   497.699707
pyfive: 520.595459
Total times
h5py:   497.970947
pyfive: 521.180420
File Opening Time Comparison  ch330a.pc19790301-def-short.nc (ms)
h5py:    0.095703
pyfive:  0.207275
Variable instantiation for [UM_m01s16i202_vn1106]
h5py:    0.073975
pyfive:  0.063965
Access and calculation time for summation
h5py:    8.051025
pyfive:  7.377930
Total times
h5py:    8.220703
pyfive:  7.649170

(This code is deliberately ensuring the tests are done from memory, not from disk, because file caching is hard to address. Small differences in time will matter in practice.)

@bnlawrence
Copy link
Collaborator Author

Some more figures, now including s3 access, where we will also have to think a bit about the influence of caching to be sure of what we are seeing, but for now:

File Opening Time Comparison  da193o_25_day__grid_T_198807-198807.nc  (ms, S3=False)
h5py:    0.155029
pyfive:  0.158936
Variable instantiation for [tos]
h5py:    0.093018
pyfive:  0.422119
Access and calculation time for summation
h5py:   504.198730
pyfive: 505.630127
Total times
h5py:   504.446777
pyfive: 506.211182
File Opening Time Comparison  ch330a.pc19790301-def-short.nc  (ms, S3=False)
h5py:    0.093994
pyfive:  0.191895
Variable instantiation for [UM_m01s16i202_vn1106]
h5py:    0.076172
pyfive:  0.067627
Access and calculation time for summation
h5py:   10.077148
pyfive:  7.379150
Total times
h5py:   10.247314
pyfive:  7.638672
File Opening Time Comparison  da193o_25_day__grid_T_198807-198807.nc  (ms, S3=True)
h5py:   16532.587891
pyfive:  0.408936
Variable instantiation for [tos]
h5py:    0.476074
pyfive:  2.509277
Access and calculation time for summation
h5py:   15575.990723
pyfive: 25531.353027
Total times
h5py:   32109.054688
pyfive: 25534.271240
File Opening Time Comparison  ch330a.pc19790301-def-short.nc  (ms, S3=True)
h5py:   16700.600098
pyfive: 68445.999023
Variable instantiation for [UM_m01s16i202_vn1106]
h5py:   66742.625977
pyfive:  0.409912
Access and calculation time for summation
h5py:   22.507080
pyfive: 20.614990
Total times
h5py:   83465.733154
pyfive: 68467.023926

@bnlawrence
Copy link
Collaborator Author

Ah, yes, well that really wasn't fair, because pyfive came second and got the benefit of caching, so here's some farer data which avoids reusing cached data:

(h5play) bnl28@MX6H7D9YGP bnl % python opening_speed.py
File Opening Time Comparison  da193o_25_day__grid_T_198807-198807.nc  (ms, S3=False)
h5py:    0.201941
pyfive:  0.191927
Variable instantiation for [tos]
h5py:    0.108957
pyfive:  0.396967
Access and calculation time for summation
h5py:   526.242733
pyfive: 533.447742
Total times
h5py:   526.553631
pyfive: 534.036636
File Opening Time Comparison  ch330a.pc19790301-def-short.nc  (ms, S3=False)
h5py:    0.129938
pyfive:  0.221729
Variable instantiation for [UM_m01s16i202_vn1106]
h5py:    0.090837
pyfive:  0.066280
Access and calculation time for summation
h5py:    6.794214
pyfive:  6.704807
Total times
h5py:    7.014990
pyfive:  6.992817
File Opening Time Comparison  da193o_25_day__grid_T_198807-198807.nc  (ms, S3=True)
h5py:   18064.429760
pyfive: 17285.735846
Variable instantiation for [tos]
h5py:    0.346184
pyfive:  4.059792
Access and calculation time for summation
h5py:   15244.802713
pyfive: 15527.326107
Total times
h5py:   33309.578657
pyfive: 32817.121744
File Opening Time Comparison  ch330a.pc19790301-def-short.nc  (ms, S3=True)
h5py:   16712.314129
pyfive: 84538.033009
Variable instantiation for [UM_m01s16i202_vn1106]
h5py:   50832.113028
pyfive:  0.406981
Access and calculation time for summation
h5py:   22.121906
pyfive: 26.489019
Total times
h5py:   67566.549063
pyfive: 84564.929008

@bnlawrence
Copy link
Collaborator Author

These results are somewhat perplexing though: if we look at the S3 data (given the first file is simple and small, the second is complex and bigger), and recognise that this is a calculation that requires all the data to move across home broadband, we see that the different ways of lazily loading things impact on either the opening or the variable instantiation, but scatter gun finding of information in the hdf5 file impacts strangely in the complex file. We should try a nicely packed version of it as well!

@bnlawrence
Copy link
Collaborator Author

The expected advantage here is that the pyfive library is completely threadsafe and we can do what we like in parallel with it. Next step is to see if that is a real advantage or not.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant