Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Recipe hashes #349

Merged
merged 24 commits into from
Apr 30, 2022
Merged

Recipe hashes #349

merged 24 commits into from
Apr 30, 2022

Conversation

cisaacstern
Copy link
Member

WIP which aims to close #269 when complete.

Part of a solution for

pangeo-forge/cmip6-feedstock#2 is an example of a current, actual feedstock PR for which this would be useful.

@rabernat
Copy link
Contributor

Note that python's built in hash function is not deterministic across invocations. I would consider using a different hashing function. For example, you could pickle objects to bytes and compute the SHA-2 of those.

@cisaacstern
Copy link
Member Author

Note that python's built in hash function is not deterministic across invocations. I would consider using a different hashing function. For example, you could pickle objects to bytes and compute the SHA-2 of those.

I now see that this is noted in the official docs

By default, the __hash__() values of str and bytes objects are “salted” with an unpredictable random value. Although they remain constant within an individual Python process, they are not predictable between repeated invocations of Python.

But these same docs still recommend using builtin hash to define custom __hash__ methods in the style

def __hash__(self):
    return hash((self.name, self.nick, self.color))

My guess is that because __hash__ is meant for testing equality within a given Python session, the variability between sessions may not be important. This is how I've been imagining using this feature.

But perhaps you're envisioning writing these hashes to the database, in which case consistency matters?

@cisaacstern
Copy link
Member Author

Based on https://eng.lyft.com/hashing-and-equality-in-python-2ea8c738fb9d I'm now wondering if this PR would be better if titled Recipe equality and the implementation should be for __eq__ logical equality, and not __hash__, because apparently implementing __hash__ on mutable objects is not regarded as a good idea.

@rabernat
Copy link
Contributor

But perhaps you're envisioning writing these hashes to the database, in which case consistency matters?

Yes exactly. How else could we compare against previous runs of the recipe, which will occur not only in a different session but in a completely different vm, etc.

Recipe equality sounds good to me, but I wonder if it relies on hashing under the hood. Having a deterministic SHA-2 hash for each recipe would be a cool feature to have regardless.

@cisaacstern
Copy link
Member Author

cisaacstern commented Apr 27, 2022

This now demonstrates a presumably rather naive implementation of what should be consistent SHA-2 hashes for both XarrayZarrRecipe (this should probably be moved into BaseRecipe) and FilePattern. Here, hash calls the newly-added custom __hash__ method on each of these objects, which itself calls hashlib.sha256:

In [1]: from pangeo_forge_recipes.patterns import pattern_from_file_sequence
   ...: from pangeo_forge_recipes.recipes import XarrayZarrRecipe
   ...: 
   ...: pattern = pattern_from_file_sequence(["a", "b"], "time")
   ...: recipe = XarrayZarrRecipe(pattern, target_chunks={"time": 1})

In [2]: hash(pattern)
Out[2]: 1353826159998106392

In [3]: hash(recipe)
Out[3]: 803547776625927664

Python requires __hash__ to return an integer, so we convert the SHA-2 hexdigest to int before returning it.

I'm making some opinionated choices in these methods, for example:

  • XarrayZarrRecipe.__hash__ excludes self.storage_config from consideration
  • FilePattern.__hash__ doesn't consider self.format_function + self.combine_dims directly, but rather takes self.items() (plus the nitems_per_input arg for each of the self.concat_dims) as a proxy for these attributes

Beyond these opinionated choices, both the Python official docs and various blog posts seem to advise rather extreme caution when considering coercing a mutable object to be hashable. Everyone seems to agree that making the object itself frozen is a better path. So @rabernat, what do you think:

  1. Proceed with caution in making mutable objects hashable (I'm somewhat inclined to try this); or,
  2. Pause on this and explore a refactor to move self.storage_config out of the recipe classes entirely (which I believe is the main barrier to making recipe classes frozen, and therefore hashable "without risk")

While option 1 does come with some risk of downstream bugs, I'll note that the official docs don't say never to do this, just to be careful. And though option 2 is less risky in terms of downstream bugs, the opportunity cost of undertaking such a major refactor right now feels like a significant risk in itself, albeit of a different kind.

Regarding what bugs we are exposing ourself to by hashing mutable objects, it seems the main concerns here are the fact that frozenset, set, and dict rely on __hash__ for lookups, so if a mutable object (with a custom __hash__) is used as a dict key, for example, and then that object is mutated, lookups in that dict may break. And the same would go for membership checks in frozensets and sets.

@rabernat
Copy link
Contributor

Nice work exploring this complex issue Charles.

I think your suggestion of excluding storage_config from the recipe's hash is a good one regardless of how we implement it. I think of that as something set at runtime, not part of the recipe itself. The code should reflect this somehow.

In retrospect it might have been good for us to sync up a bit before you got into implementing this. If we had, i would probably have clarified the following point: I don't think we need to use Python's built-in hash function here. I think our needs are a bit different from what that was intended for. So I think we can bypass a bit of boilerplate and find a simpler solution.

If I wanted to compute a hash of a recipe, I would do it like this

serlialized = pickle.dumps(recipe)
sha256 = hashlib.sha256(serialized).hexdigest()

This is literally just taking the python object, pickling it into bytes, and computing the hash of those bytes.

Let's link up tomorrow and discuss the options here.

@cisaacstern
Copy link
Member Author

cisaacstern commented Apr 27, 2022

A few observations from experimenting with this approach just now:

  1. Only functions defined at the top level of a module can be pickled, therefore pickle.dumps(recipe) throws an error if recipe was defined with pattern_from_file_sequence. But this is actually less consequential than...
  2. The output of hashlib.sha256(pickle.dumps(recipe)).hexdigest() is not consistent between interpreter sessions:
    # test_rec.py
    
    from pangeo_forge_recipes.patterns import FilePattern, ConcatDim
    from pangeo_forge_recipes.recipes import XarrayZarrRecipe
     
    def format_function(time):
        return f"{time}.nc"
    
    pattern = FilePattern(format_function, ConcatDim("time", [1, 2]))
    recipe = XarrayZarrRecipe(pattern, target_chunks={"time": 1})
    # first interpreter session
    In [1]: import hashlib, pickle; from test_rec import recipe
       ...: hashlib.sha256(pickle.dumps(recipe)).hexdigest()
    Out[1]: '9052fbc0563fa94d916dcbcc332b821f76467b37462b0574719851002f7f3b84'
    # second interpreter session; recipe module, environment, and machine unchanged
    In [1]: import hashlib, pickle; from test_rec import recipe
       ...: hashlib.sha256(pickle.dumps(recipe)).hexdigest()
    Out[1]: 'e1dfbdf8627ae8e02791879e10038215606a159198adf261922dde97ba009cd7'

My guess is that pickle is not designed to produce outputs with deterministic hashes, but rather only to allow for recovery of functionally equivalent objects. I do agree that staying away from built-in hash()/__hash__ saves us a lot of potential problems and boilerplate. Perhaps instead we want to borrow the approach of __hash__ (hashing a tuple of instance attributes), but use it within some method with a novel name, even just recipe.sha()?

@rabernat
Copy link
Contributor

  1. Only functions defined at the top level of a module can be pickled, therefore pickle.dumps(recipe) throws an error if recipe was defined with pattern_from_file_sequence

This might be a case for using cloudpickle instead. (That's what Dask uses to send such functions over the wire.)

My guess is that pickle is not designed to produce outputs with deterministic hashes, but rather only to allow for recovery of functionally equivalent objects

Interesting discovery. I wonder what the source is of this variability. Let's discuss at our meeting.

@martindurant
Copy link
Contributor

I wonder what the source is of this variability.

Pickle definitely makes no such guarantee. My simplest guess is that the order of the objects in the output changes, but if could be any number of effects. You would end up defining your own pickle hooks (__getstate__ or __reduce__), which would look a lot like reimplementing the hash thing again. Dask has a similar problem, but consistency between sessions is not critical there either - it uses the string representation of the set of arguments to a task, plus any objects might define their own dask keys. fsspec's instance keys (used in cache files names possibly) have the same problem - and there is a discussion on whether we should have a strictly repeatable value or not.

I think it's worth saying that you can probably find a way to guarantee that when two recipe realisations have the same hash, they are equivalent, but not that different hashes are necessarily different.

@rabernat
Copy link
Contributor

This blog post is extremely useful. https://death.andgravity.com/stable-hashing It explains why the approach we have tried so far don't work. It suggests serializing the class to json and then computing the hash of the json as bytes.

That also won't work out of the box here because our dataclasses contain other dataclasses and / or python functions (preprocess, etc.) and FilePattern objects.

I think the way forward is something like this, in pseudocode

d = recipe.to_dict()
# now traverse the dictionary and hash everything that is not json serializable
d_ser
for k, v in d.items():
    if is_json_serializable(v):
        d_ser[k] = v
    elif:
        d_ser[k] = v.sha256()  # we have to define that method, e.g. for filepattern
    else:
        raise ValueError("Don't know how to serialize this")

So if we can figure out how to compute the sha256 for a FilePattern in a deterministic way, I think we should be good.

This will probably fail on recipes with preprocessing functions. Maybe that's okay for now...we punt and say that if a recipe has preprocessing, it can't be hashed and therefore has to always be rerun. Would that be an okay tradeoff?

If not, we need to solve the problem of how to deterministically serialize / hash a generic python function.

@rabernat
Copy link
Contributor

In terms of implementation.

  • Forget about hash() and .__hash__()
  • Implement a method .sha256() on the Recipe base class. It will be something very short like
     def sha265(self):
         from serialization import dataclass_sha265
         return dataclass_sha265(self)
  • Put the logic of hashing / serialization in a standalone module maybe called recipes/serialization.py so it can be unit tested.
  • That logic will look a lot like what we discussed in Recipe hashes #349 (comment)

@rabernat
Copy link
Contributor

rabernat commented Apr 27, 2022

Here is the idea we just discussed for encoding the FilePattern as a blockchain

import pandas as pd
from hashlib import sha256
from dataclasses import asdict
from json import dumps
from datetime import datetime
from enum import Enum

from pangeo_forge_recipes.patterns import ConcatDim, FilePattern

dates = pd.date_range('1981-09-01', '2022-02-01', freq='D')

URL_FORMAT = (
    "https://www.ncei.noaa.gov/data/sea-surface-temperature-optimum-interpolation/"
    "v2.1/access/avhrr/{time:%Y%m}/oisst-avhrr-v02r01.{time:%Y%m%d}.nc"
)

def make_url(time):
    return URL_FORMAT.format(time=time)

time_concat_dim = ConcatDim("time", dates, nitems_per_file=1)
pattern = FilePattern(make_url, time_concat_dim)

def json_default(thing):
    # custom serializer for FileType, CombineOp, etc.
    if isinstance(thing, Enum):
        return thing.value
    raise TypeError(f"object of type {type(thing).__name__} not serializable")

# https://death.andgravity.com/stable-hashing
def dict_to_sha256(thing):
    b = dumps(
        thing,
        default=json_default,
        ensure_ascii=False,
        sort_keys=True,
        indent=None,
        separators=(',', ':'),
    )
    return sha256(b.encode('utf-8')).digest()

def dimindex_to_sha256(di):
    d = asdict(di)
    del d['sequence_len']  # don't want this in the hash; could change
    return dict_to_sha256(d)

def pattern_blockchain(pattern):
    # this contains general aspects of the file pattern
    # we exclude the file pattern and concat dims because they are generated by iterating
    # if they are generated in a different way, we ultimately don't care
    # The only exception is nitems_per_file...need some way to include that
    root = {
        'fsspec_open_kwargs': pattern.fsspec_open_kwargs,
        'query_string_secrets': pattern.query_string_secrets,
        'file_type': pattern.file_type,
        'is_opendap': pattern.is_opendap,
    }

    root_sha256 = dict_to_sha256(root)

    blockchain = [root_sha256]
    for k, v in pattern.items():
        key_hash = b''.join(sorted([dimindex_to_sha256(dimindex) for dimindex in k]))
        value_hash = sha256(v.encode('utf-8')).digest()
        new_hash = key_hash + value_hash
        new_block = sha256(blockchain[-1] + new_hash).digest()
        blockchain.append(new_block)
        
    return blockchain

chain = pattern_blockchain(pattern)
last_hash = chain[-1]

# now make a new file pattern that extends the old one
new_dates = pd.date_range('1981-09-01', '2022-05-01', freq='D')
new_pattern = FilePattern(make_url, ConcatDim("time", new_dates, nitems_per_file=1))
new_chain = pattern_blockchain(new_pattern)

for k, h in zip(new_pattern, new_chain):
    if h == last_hash:
        print(f"Found a match for {h.hex()}!")
        break
        
print(k)
Found a match for e22bdc8a078cec796d6925a3f5dc1996daffe2a2ff4fac2383b3c01706c766d6!
Index({DimIndex(name='time', index=14764, sequence_len=14853, operation=<CombineOp.CONCAT: 2>)})

This approach only makes sense for FilePatterns with a single concat dim. We would need to take care to make the merge dim no break it.

@cisaacstern cisaacstern reopened this Apr 27, 2022
@cisaacstern
Copy link
Member Author

@rabernat, do you think recipe.sha256() should be independent of self.file_pattern , as it is for storage config?

Adapting the blog, I'm excluding keys in BaseRecipe._hash_exclude_ from the recipe.sha256() calculation. Initially, I had

_hash_exclude_ = ["file_pattern", "storage_config"]

but in the last commit I walked this back to remove "file_pattern", because XarrayZarrRecipe.__post_init__ makes a number of self.file_pattern-based assignments

self.concat_dim = self.file_pattern.concat_dims[0]
self.cache_metadata = any(
[v is None for v in self.file_pattern.concat_sequence_lens.values()]
)
self.nitems_per_input = self.file_pattern.nitems_per_input[self.concat_dim]
if self.file_pattern.file_type == FileType.opendap:
if self.cache_inputs:
raise ValueError("Can't cache opendap inputs.")
else:
self.cache_inputs = False

so even if we exclude self.file_pattern from recipe.sha256(), the self.file_pattern-derived assignments in __post_init__ still mean that the recipe SHA will be influenced by the file pattern (via self.concat_dim, etc.). Some options:

  1. Include self.file_pattern.sha256().hex() in the calculation of recipe.sha256(). This is what I'm doing now, and is probably the easiest path.
  2. If we want recipe.sha256() to be independent of self.file_pattern, we could:
    • Replace all of these assignments with @propertys, which are not included when dataclasses are passed to asdict:

      from dataclasses import asdict, dataclass
      
      @dataclass
      class A:
         a: int = 1
       
         @property
         def b(self):
             return self.a + 1
      
      asdict(A())  # -> {'a': 1}
    • Append all of the derived attributes to self._hash_exclude_:

      # recipes/xarray_zarr.py
      
      @dataclass
      class XarrayZarrRecipe(  # ... 
      
        def __post_init__(self):
            # file pattern-derived assignments, etc.
            self._hash_exclude_ += ["concat_dim", "cache_metadata", "nitems_per_input",  "cache_inputs"]

Thoughts?

@cisaacstern
Copy link
Member Author

cisaacstern commented Apr 28, 2022

Personally, I think it actually makes a lot of sense to include self.file_pattern.sha256().hex() as part of the calculation of the recipe SHA. This is easier to implement and, in practical terms, it also means fewer checks at registration-time.

When re-rerunning recipes in a feedstock,

  • with independent pattern and recipe SHAs, we need to compare both of these hashes against the database every time, for every recipe in the feedstock, to determine whether to re-run. Only if both match, can we consider skipping the re-run.
  • with recipe SHAs dependent on pattern SHAs, we can just check recipe SHAs against the database as a first step for all recipes. If the recipe SHA matches, we can skip re-running without ever checking the pattern SHA. If the recipe SHA is not equal to the prior database record and we are not appending, we can also probably skip checking pattern SHAs and just re-run. If recipe SHAs are mismatched and we are appending, then we can compare the new pattern SHA against the prior one to determine where to pick up with appending.

Is there any advantage to independent recipe and pattern SHAs that I'm missing?

Edit: I guess this is less important than the practical implications, but logically a recipe can't be instantiated without a file pattern, so it makes sense that this should be part of is unique SHA identifier.

@rabernat
Copy link
Contributor

  • with recipe SHAs dependent on pattern SHAs, we can just check recipe SHAs against the database as a first step for all recipes.

I just reviewed this, and I am very fine with this ☝️ decision.

@cisaacstern
Copy link
Member Author

@rabernat, this is ready for review. I've opened the placeholders

to indicate what remains to do before we can close pangeo-forge/user-stories#3.

Aside from the release notes, I haven't updated the docs because I haven't been thinking of this as a user-facing feature. If you see a good place to mention it in the docs, I'm happy to write something up.

Copy link
Contributor

@rabernat rabernat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is looking great! Just a few minor comments.

docs/pangeo_forge_recipes/development/release_notes.md Outdated Show resolved Hide resolved
pangeo_forge_recipes/patterns.py Outdated Show resolved Hide resolved
pangeo_forge_recipes/recipes/base.py Outdated Show resolved Hide resolved
pangeo_forge_recipes/serialization.py Outdated Show resolved Hide resolved
pangeo_forge_recipes/serialization.py Outdated Show resolved Hide resolved
tests/test_serialization.py Show resolved Hide resolved
@cisaacstern
Copy link
Member Author

cisaacstern commented Apr 30, 2022

@rabernat, codecov is working in my browser now. Based on

https://app.codecov.io/gh/pangeo-forge/pangeo-forge-recipes/compare/349/tree/pangeo_forge_recipes

it looks like we have near-100% line coverage on the additions in this PR, including of

  • FilePattern.sha256
  • BaseRecipe.sha256
  • new helpers in .serialization and .patterns

AFAICT, the only new line that's not covered is the TypeError in either_encode_or_hash.

@cisaacstern
Copy link
Member Author

the only new line that's not covered is the TypeError in either_encode_or_hash.

a5bcf1e should cover this

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Make recipe class instances hashable?
3 participants