Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Switch Index to be a frozenset #257

Merged
merged 3 commits into from
Jan 12, 2022

Conversation

rabernat
Copy link
Contributor

@rabernat rabernat commented Jan 12, 2022

Simplification suggested by Stephan in google/xarray-beam#32 (comment).

Update:

I was amazed how simple it was to make this change. I just swapped our our Index class with

Index = frozenset[DimIndex]

and mostly everything just worked, requiring no change to how indexes are created (e.g. idx = Index((some, stuff))). That was a nice surprise.

However, this PR has turned up some nasty mypy-related stuff, and also showed that that simple approach doesn't work in 3.8. In 3.8, we would need to do something like

Index = FrozenSet[DimIndex]

However, if we do that, we get the error TypeError: Type FrozenSet cannot be instantiated; use frozenset() instead when we try to instantiate the Index.

I would like to say we will just drop python 3.8. However, afaik, beam does not yet work with 3.9. So it looks like we will need some workarounds.

Does anyone know the best way to solve this?

Index = FrozenSet[DimIndex]
else:
# But if we just do this, it won't pass mypy 😖
Index = frozenset[DimIndex]
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tests have shown that this does not work on python 3.8, even with from __future__ import annotations

@rabernat
Copy link
Contributor Author

Ok I think I found the solution:

class Index(FrozenSet[DimIndex]):
    pass

@rabernat
Copy link
Contributor Author

The upstream-dev tests have surfaced a new issue

        # https://github.com/pangeo-forge/pangeo-forge-recipes/issues/151
        # Requires >4 GiB of memory to run.
        pd = pytest.importorskip("pandas")
        distributed = pytest.importorskip("distributed")
    
        dates = pd.date_range("2020-05-31T00:00:00", "2021-05-31T23:59:59", freq="30min")
        time_concat_dim = ConcatDim("time", dates, nitems_per_file=1)
        pattern = FilePattern(_make_filename_for_memory_usage_test, time_concat_dim,)
    
        recipe = XarrayZarrRecipe(
            pattern, xarray_open_kwargs={"group": "Grid", "decode_coords": "all"}, inputs_per_chunk=1,
        )
    
>       delayed = recipe.to_dask()

tests/recipe_tests/test_graph_memory.py:49: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
pangeo_forge_recipes/recipes/base.py:28: in to_dask
    return DaskPipelineExecutor.compile(self._compiler())
pangeo_forge_recipes/executors/dask.py:75: in compile
    delayed = Delayed(prev_token[0], hlg)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

self = Delayed('finalize_target-2f31500925ec7e187d227c54ab925c75')
key = 'finalize_target-2f31500925ec7e187d227c54ab925c75'
dsk = HighLevelGraph with 5 layers.
<dask.highlevelgraph.HighLevelGraph object at 0x7f63f0f6e2b0>
 0. config
 1. cache_input
 2. prepare_target
 3. store_chunk
 4. finalize_target

length = None, layer = None

    def __init__(self, key, dsk, length=None, layer=None):
        self._key = key
        self._dask = dsk
        self._length = length
    
        # NOTE: Layer is used by `to_delayed` in other collections, but not in normal Delayed use
        self._layer = layer or key
        if isinstance(dsk, HighLevelGraph) and self._layer not in dsk.layers:
>           raise ValueError(
                f"Layer {self._layer} not in the HighLevelGraph's layers: {list(dsk.layers)}"
            )
E           ValueError: Layer finalize_target-2f31500925ec7e187d227c54ab925c75 not in the HighLevelGraph's layers: ['config', 'cache_input', 'prepare_target', 'store_chunk', 'finalize_target']

Apparently the way we are constructing high-level graphs is not compatible with the latest dask release.

I'm inclined to merge this, let the master tests fail, and open a new issue for it.

@rabernat
Copy link
Contributor Author

/run-test-tutorials

@cisaacstern
Copy link
Member

I'm inclined to merge this, let the master tests fail, and open a new issue for it.

👍 upstream-dev tests ftw

@rabernat rabernat merged commit 6194c72 into pangeo-forge:master Jan 12, 2022
@cisaacstern
Copy link
Member

Out of curiosity, is there a one-ish sentence way to encapsulate the advantage of a frozenset vs. a tuple here? I wasn't able to immediately grasp the motivation from the linked xarray-beam thread.

Comment on lines -95 to -115
class Index(tuple):
"""A tuple of ``DimIndex`` objects.
The order of the indexes doesn't matter for comparision."""

def __new__(self, args: Iterable[DimIndex]):
# This validation really slows things down because we call Index a lot!
# if not all((isinstance(a, DimIndex) for a in args)):
# raise ValueError("All arguments must be DimIndex.")
# args_set = set(args)
# if len(set(args_set)) < len(tuple(args)):
# raise ValueError("Duplicate argument detected.")
return tuple.__new__(Index, args)

def __str__(self):
return ",".join(str(dim) for dim in self)

def __eq__(self, other):
return (set(self) == set(other)) and (len(self) == len(other))

def __hash__(self):
return hash(frozenset(self))
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This class was already basically doing the same thing as a frozenset! So this change simplifies our Index type--a key data structure--to use a built-in type. Simpler and less code to maintain.

For context, this is a tiny part of the changes that will move us in the direction of #256.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants