Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DRAFT] Pyfive Masterpiece #189

Draft
wants to merge 101 commits into
base: main
Choose a base branch
from
Draft
Changes from 13 commits
Commits
Show all changes
101 commits
Select commit Hold shift + click to select a range
e9f9ee6
A few wrinkles to sort out, some changes to pyfive needed, but this i…
Feb 29, 2024
eb9e853
bnl_test works with pyfive. The tests are borked because they're ridd…
Mar 3, 2024
0d7bf55
Gzip decompression working. Will document issues remaining in #188
Mar 4, 2024
d73fcd8
Split compression and filters
Mar 4, 2024
0c4272f
Slight doc improvement in hdf2numcodec
Mar 4, 2024
c20b259
Need to turn off decoding filter pipeline if there is none.
Mar 4, 2024
68e49d0
Support for contiguous data. Passing lots of tests now.
Mar 5, 2024
b745d5e
Remove the file context manager for now
Mar 5, 2024
74f8662
A couple more h5py lurkers and one more bug squashed.
Mar 5, 2024
b628a68
removed redundant test file
valeriupredoi Mar 5, 2024
1803e7f
removed redundant test cases
valeriupredoi Mar 5, 2024
f5a375c
making sure we know Pyfive is run haha
valeriupredoi Mar 5, 2024
d1c9282
fix bigger data test with native pyfive exception
valeriupredoi Mar 5, 2024
69c3dca
fix package test
valeriupredoi Mar 5, 2024
c50c3d2
mark lock test as xfail
valeriupredoi Mar 5, 2024
09b4b3f
toy for testing missing test
Mar 5, 2024
03254dc
Merge branches 'pyfive' and 'pyfive' of github.com:valeriupredoi/PyAc…
Mar 5, 2024
8a7e35b
turn off some printing for now
valeriupredoi Mar 5, 2024
cacb94e
starting to understand the test flakiness
valeriupredoi Mar 5, 2024
ce0479c
Addresses issues with missing data and with thread safety (basically …
Mar 5, 2024
39884bc
Merge branch 'pyfive' of github.com:valeriupredoi/PyActiveStorage int…
Mar 5, 2024
02d6120
Fixed the regression wrt files with no chunking introduced by thread …
Mar 6, 2024
2ffd644
Made some changes to Active just so we can use masked data as a stand…
Mar 6, 2024
408558f
fix test compression
valeriupredoi Mar 6, 2024
3294052
fix the other compression test
valeriupredoi Mar 6, 2024
ca30bb2
fix mocker tests
valeriupredoi Mar 6, 2024
c1b2823
minor fix to active
valeriupredoi Mar 6, 2024
3c1b420
ass test todo
valeriupredoi Mar 6, 2024
589c772
fix last test
valeriupredoi Mar 6, 2024
c022040
checkout and install pyfive in dev mode
valeriupredoi Mar 6, 2024
016242e
fix Bryans test
valeriupredoi Mar 6, 2024
9bfb099
install Pyfive in OSX GHA too
valeriupredoi Mar 6, 2024
68827eb
run the s3 tests too with Minio
valeriupredoi Mar 6, 2024
1d925dc
run the s3 tests too with remote reductionist
valeriupredoi Mar 6, 2024
400e230
Fixed incorrect handling of missing data in reductionist encode_missing
Mar 6, 2024
afe3546
it's not a test!
Mar 6, 2024
44980b7
Fixed HDF5 NetCDF attribute issue
Mar 6, 2024
98a74c5
A little more care was needed ...
Mar 6, 2024
4bea4bb
new pyfive File API
valeriupredoi Mar 6, 2024
cf1ff2d
print out the attrs
valeriupredoi Mar 6, 2024
2808dc9
print out file attrs
valeriupredoi Mar 6, 2024
38add11
V made me do it
Mar 7, 2024
277762d
use correct attributes compression and shuffle
valeriupredoi Mar 7, 2024
5bd0fe1
use issue60 branch
valeriupredoi Mar 7, 2024
74beccd
Unit test for createding reductionist json without s3
Mar 8, 2024
e91d02c
Reductionist was incorrectly encoding offset and size, and the test f…
Mar 8, 2024
3bb0b8b
Will this break actions?
Mar 8, 2024
12d0f0b
Merge branch 'pyfive' of github.com:valeriupredoi/PyActiveStorage int…
Mar 8, 2024
7a03fa4
ignore bnl for testing
valeriupredoi Mar 8, 2024
aa1a481
correct imports
valeriupredoi Mar 8, 2024
e5430fd
fix test with correct loaders and imports
valeriupredoi Mar 8, 2024
dd57664
same for this one
valeriupredoi Mar 8, 2024
f317149
clean test module
valeriupredoi Mar 8, 2024
4dafa3e
reduce test activity
valeriupredoi Mar 8, 2024
e9aff71
plop Pyfive in the PR test wkflow too
valeriupredoi Mar 8, 2024
08c8b87
Adding the ability for Active to optionally self report metrics
Mar 11, 2024
2475b18
Merge branch 'pyfive' of github.com:valeriupredoi/PyActiveStorage int…
Mar 11, 2024
20acfdd
Adding some experimental data
Mar 12, 2024
54967d2
Plotting and data code.
Mar 12, 2024
20b36d1
Added more organised metrics to Active, available after a getitem fro…
Mar 14, 2024
ddcff5c
Active _version1 and _version2 differ again
Mar 14, 2024
df6b5bb
Active v1 on S3 now works. Stopping default verbosity from reductionist
Mar 19, 2024
ef001ee
test runner
Mar 23, 2024
7c1e704
all-zero chunks
davidhassell Mar 25, 2024
209d3fa
new test file
davidhassell Mar 25, 2024
620b403
fix previously failing Anon bkt test
valeriupredoi Mar 25, 2024
b49ffa0
timeOut for v2, v1 does the pancakes now
valeriupredoi Mar 25, 2024
e99a509
timeOut for v2, v1 does the pancakes now
valeriupredoi Mar 25, 2024
d1dc658
and again v2
valeriupredoi Mar 25, 2024
b43dd21
another pancake
valeriupredoi Mar 25, 2024
4816902
Merge pull request #196 from valeriupredoi/zero-chunk
valeriupredoi Mar 25, 2024
b5d5925
zero count
davidhassell Mar 26, 2024
8b46b4b
fix test
valeriupredoi Mar 27, 2024
e4f2067
Merge pull request #197 from valeriupredoi/zero-chunk
valeriupredoi Apr 25, 2024
648684a
Merge branch 'main' into pyfive
valeriupredoi Jul 22, 2024
2152ac5
add a conda list cmd
valeriupredoi Jul 22, 2024
0a4297a
Merge branch 'main' into pyfive
valeriupredoi Aug 5, 2024
74868b7
Merge branch 'main' into pyfive
valeriupredoi Aug 12, 2024
4b18a17
Merge branch 'main' into pyfive
valeriupredoi Oct 24, 2024
5f67387
Merge branch 'main' into pyfive
valeriupredoi Jan 20, 2025
8b08743
start work to port new Pyfive
valeriupredoi Jan 20, 2025
ab64817
correct test for new Pyive API
valeriupredoi Jan 20, 2025
5c8095e
Conforming to current index structure a bit better
Jan 21, 2025
fe88026
new GH repo and branch for Pyfive in GAs
valeriupredoi Jan 21, 2025
d3a8f92
Merge branch 'fix_pyfive_branch' of https://github.com/NCAS-CMS/PyAct…
valeriupredoi Jan 21, 2025
e68a0fe
Use the new pyfive method to get chunk position in file. Includes bad…
Jan 21, 2025
ca5022e
Merge remote-tracking branch 'refs/remotes/origin/fix_pyfive_branch' …
Jan 21, 2025
a728ef0
fix the mocker json test
valeriupredoi Jan 21, 2025
3df5b36
new fh method
valeriupredoi Jan 21, 2025
762d85a
small typo
valeriupredoi Jan 21, 2025
f3c35e0
Cleared up the compression problem
Jan 21, 2025
41c9ce9
Merge remote-tracking branch 'refs/remotes/origin/fix_pyfive_branch' …
Jan 21, 2025
bd1f06f
Merge pull request #234 from NCAS-CMS/fix_pyfive_branch
valeriupredoi Jan 21, 2025
2fabbe3
reinstate tests
valeriupredoi Jan 21, 2025
68851e4
remove bnl references
valeriupredoi Jan 22, 2025
1d13c6a
remove workflow that runs bnl bucket test
valeriupredoi Jan 22, 2025
29fd648
rplace bnl with relevant info
valeriupredoi Jan 22, 2025
4792342
remove commented out Bryan server
valeriupredoi Jan 22, 2025
43ea853
rm ref Bryan
valeriupredoi Jan 22, 2025
549711c
Merge branch 'main' into pyfive
valeriupredoi Feb 11, 2025
e73712a
use wacasoft branch of Pyfive fork
valeriupredoi Feb 11, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 3 additions & 3 deletions .github/workflows/run-test-push.yml
Original file line number Diff line number Diff line change
@@ -29,12 +29,12 @@ jobs:
use-mamba: true
- run: conda --version
- run: python -V
- name: Install development version of bnlawrence/Pyfive:issue60
- name: Install development version of NCAS-CMS/Pyfive:h5netcdf
run: |
cd ..
git clone https://github.com/bnlawrence/pyfive.git
git clone https://github.com/NCAS-CMS/pyfive.git
cd pyfive
git checkout issue60
git checkout h5netcdf
pip install -e .
- run: pip install -e .
- run: conda list
12 changes: 6 additions & 6 deletions .github/workflows/run-tests.yml
Original file line number Diff line number Diff line change
@@ -34,12 +34,12 @@ jobs:
use-mamba: true
- run: conda --version
- run: python -V
- name: Install development version of bnlawrence/Pyfive:issue60
- name: Install development version of NCAS-CMS/Pyfive:h5netcdf
run: |
cd ..
git clone https://github.com/bnlawrence/pyfive.git
git clone https://github.com/NCAS-CMS/pyfive.git
cd pyfive
git checkout issue60
git checkout h5netcdf
pip install -e .
- run: conda list
- run: pip install -e .
@@ -66,12 +66,12 @@ jobs:
use-mamba: true
- run: conda --version
- run: python -V
- name: Install development version of bnlawrence/Pyfive:issue60
- name: Install development version of NCAS-CMS/Pyfive:h5netcdf
run: |
cd ..
git clone https://github.com/bnlawrence/pyfive.git
git clone https://github.com/NCAS-CMS/pyfive.git
cd pyfive
git checkout issue60
git checkout h5netcdf
pip install -e .
- run: conda list
- run: mamba install -c conda-forge git
6 changes: 3 additions & 3 deletions .github/workflows/test_s3_minio.yml
Original file line number Diff line number Diff line change
@@ -56,12 +56,12 @@ jobs:
python-version: ${{ matrix.python-version }}
miniforge-version: "latest"
use-mamba: true
- name: Install development version of bnlawrence/Pyfive:issue60
- name: Install development version of NCAS-CMS/Pyfive:h5netcdf
run: |
cd ..
git clone https://github.com/bnlawrence/pyfive.git
git clone https://github.com/NCAS-CMS/pyfive.git
cd pyfive
git checkout issue60
git checkout h5netcdf
pip install -e .
- name: Install PyActiveStorage
run: |
6 changes: 3 additions & 3 deletions .github/workflows/test_s3_remote_reductionist.yml
Original file line number Diff line number Diff line change
@@ -51,12 +51,12 @@ jobs:
python-version: ${{ matrix.python-version }}
miniforge-version: "latest"
use-mamba: true
- name: Install development version of bnlawrence/Pyfive:issue60
- name: Install development version of NCAS-CMS/Pyfive:h5netcdf
run: |
cd ..
git clone https://github.com/bnlawrence/pyfive.git
git clone https://github.com/NCAS-CMS/pyfive.git
cd pyfive
git checkout issue60
git checkout h5netcdf
pip install -e .
- name: Install PyActiveStorage
run: |
29 changes: 16 additions & 13 deletions activestorage/active.py
Original file line number Diff line number Diff line change
@@ -5,6 +5,7 @@
import urllib
import pyfive
import time
from pyfive.h5d import StoreInfo

import s3fs

@@ -307,8 +308,8 @@ def _get_selection(self, *args):
name = self.ds.name
dtype = np.dtype(self.ds.dtype)
# hopefully fix pyfive to get a dtype directly
array = pyfive.ZarrArrayStub(self.ds.shape, self.ds.chunks)
ds = self.ds._dataobjects
array = pyfive.indexing.ZarrArrayStub(self.ds.shape, self.ds.chunks)
ds = self.ds.id

self.metric_data['args'] = args
self.metric_data['dataset shape'] = self.ds.shape
@@ -318,7 +319,7 @@ def _get_selection(self, *args):
else:
compressor, filters = decode_filters(ds.filter_pipeline , dtype.itemsize, name)

indexer = pyfive.OrthogonalIndexer(*args, array)
indexer = pyfive.indexing.OrthogonalIndexer(*args, array)
out_shape = indexer.shape
#stripped_indexer = [(a, b, c) for a,b,c in indexer]
drop_axes = indexer.drop_axes and keepdims
@@ -334,7 +335,7 @@ def _from_storage(self, ds, indexer, chunks, out_shape, out_dtype, compressor, f
out = []
counts = []
else:
out = np.empty(out_shape, dtype=out_dtype, order=ds.order)
out = np.empty(out_shape, dtype=out_dtype, order=ds._order)
counts = None # should never get touched with no method!

# Create a shared session object.
@@ -364,10 +365,10 @@ def _from_storage(self, ds, indexer, chunks, out_shape, out_dtype, compressor, f

if ds.chunks is not None:
t1 = time.time()
ds._get_chunk_addresses()
# ds._get_chunk_addresses()
t2 = time.time() - t1
self.metric_data['indexing time (s)'] = t2
self.metric_data['chunk number'] = len(ds._zchunk_index)
# self.metric_data['chunk number'] = len(ds._zchunk_index)
chunk_count = 0
t1 = time.time()
with concurrent.futures.ThreadPoolExecutor(max_workers=self._max_threads) as executor:
@@ -464,15 +465,17 @@ def _process_chunk(self, session, ds, chunks, chunk_coords, chunk_selection, cou
#FIXME: Do, we, it's not actually used?

"""

offset, size, filter_mask = ds.get_chunk_details(chunk_coords)

# retrieve coordinates from chunk index
storeinfo = ds.get_chunk_info_from_chunk_coord(chunk_coords)
offset, size = storeinfo.byte_offset, storeinfo.size
self.data_read += size

if self.storage_type == 's3' and self._version == 1:

tmp, count = reduce_opens3_chunk(ds.fh, offset, size, compressor, filters,
tmp, count = reduce_opens3_chunk(ds._fh, offset, size, compressor, filters,
self.missing, ds.dtype,
chunks, ds.order,
chunks, ds._order,
chunk_selection, method=self.method
)

@@ -499,7 +502,7 @@ def _process_chunk(self, session, ds, chunks, chunk_coords, chunk_selection, cou
size, compressor, filters,
self.missing, np.dtype(ds.dtype),
chunks,
ds.order,
ds._order,
chunk_selection,
operation=self._method)
else:
@@ -518,7 +521,7 @@ def _process_chunk(self, session, ds, chunks, chunk_coords, chunk_selection, cou
size, compressor, filters,
self.missing, np.dtype(ds.dtype),
chunks,
ds.order,
ds._order,
chunk_selection,
operation=self._method)
elif self.storage_type=='ActivePosix' and self.version==2:
@@ -531,7 +534,7 @@ def _process_chunk(self, session, ds, chunks, chunk_coords, chunk_selection, cou
# although we will version changes.
tmp, count = reduce_chunk(self.filename, offset, size, compressor, filters,
self.missing, ds.dtype,
chunks, ds.order,
chunks, ds._order,
chunk_selection, method=self.method)

if self.method is not None:
2 changes: 1 addition & 1 deletion activestorage/hdf2numcodec.py
Original file line number Diff line number Diff line change
@@ -28,7 +28,7 @@ def decode_filters(filter_pipeline, itemsize, name):
for filter in filter_pipeline:

filter_id=filter['filter_id']
properties = filter['client_data_values']
properties = filter['client_data']


# We suppor the following
11 changes: 6 additions & 5 deletions tests/test_reductionist_json.py
Original file line number Diff line number Diff line change
@@ -18,9 +18,9 @@ def __init__(self, f, v):
self.f = pyfive.File(f)
ds = self.f[v]
self.dtype = np.dtype(ds.dtype)
self.array = pyfive.ZarrArrayStub(ds.shape, ds.chunks or ds.shape)
self.array = pyfive.indexing.ZarrArrayStub(ds.shape, ds.chunks or ds.shape)
self.missing = get_missing_attributes(ds)
ds = ds._dataobjects
ds = ds.id
self.ds = ds
def __getitem__(self, args):
if self.ds.filter_pipeline is None:
@@ -30,12 +30,13 @@ def __getitem__(self, args):
if self.ds.chunks is not None:
self.ds._get_chunk_addresses()

indexer = pyfive.OrthogonalIndexer(args, self.array)
indexer = pyfive.indexing.OrthogonalIndexer(args, self.array)
for chunk_coords, chunk_selection, out_selection in indexer:
offset, size, filter_mask = self.ds.get_chunk_details(chunk_coords)
storeinfo = self.ds.get_chunk_info_from_chunk_coord(chunk_coords)
offset, size = storeinfo.byte_offset, storeinfo.size
jd = reductionist.build_request_data('a','b','c',
offset, size, compressor, filters, self.missing, self.dtype,
self.array._chunks,self.ds.order,chunk_selection)
self.array._chunks,self.ds._order,chunk_selection)
js = json.dumps(jd)
return None