Skip to content

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Caching robustness and determinism #3259

Closed
leventov opened this issue Dec 20, 2024 · 8 comments
Closed

Caching robustness and determinism #3259

leventov opened this issue Dec 20, 2024 · 8 comments

Comments

@leventov
Copy link

Hashing strategy for "execution path" refs could lead to inconsistent notebook state

If automatic cell re-run is off, and an upstream cell is non-deterministic, then upon re-running the upstream cell the notebook is in inconsistent state and if at this moment the notebook is closed, then upon restart there would be no way to know for Marimo runtime that the cache result for the downstream cell are invalid.

Although BlockHasher's doc mentions this (albeit this isn't even a part of user-facing docs!), the phrase "sources of non-determinism are not accounted for in this implementation, and are left to the user" is not helpful. The core of Marimo's caching logic is exactly the place to deal with this, at least in a good fraction of cases.

If the upstream cell is cached persistently (most likely because all cells are cached persistently, i.e., via #3054), then cached content hashing is vital for correctness and corruption prevention (see #3176). Then, if the upstream cell has been run in the current marimo notebook runtime (either forcefully, or because its own dependencies are invalidated), its results (i.e., "content") have also been serialised and cached, and thus the content hash for these latest results has also been recorded.

This means that we can include upstream execution path's content hash along with its module hash in the calculation of the block hash cheaply and at the same handle non-determinism much better.

File inputs

For even better handling of non-determinism, "file inputs" to cells should be tracked and also being part of the block hash (#3258). Thus, if the non-deterministic cell writes out different contents to these non-python files but has the same "content", the downstream cells would still record a cache miss and re-run.

@dmadisetti
Copy link
Collaborator

dmadisetti commented Dec 20, 2024

For more determinism guarantees, see experimental strict mode: https://github.com/marimo-team/marimo/releases/tag/0.6.20


If automatic cell re-run is off, and an upstream cell is non-deterministic, then upon re-running the upstream cell the notebook is in inconsistent state and if at this moment the notebook is closed, then upon restart there would be no way to know for Marimo runtime that the cache result for the downstream cell are invalid.

Not really, because execution path is really the fallback.

cached content hashing is vital

Yep, that's the primary hashing check.

Caching does the best it can by relying on the contents of the variables it uses, and only falls back on execution mode if it is unable to hash the variable contents. It is possible to have an unhashable datatype that mutates between executions- but then your notebook is not deterministic regardless of persistent_cache.

As long as the variable data is hashable, then cache should be deterministic. If execution path is utilized- then as a limitation of python it is not possible to guarantee determinism, since file state and hidden memory can accumulate in ways that marimo cannot detect.

You can mitigate against this by using strict mode- which requires a deep copy between each cell execution, and also has some other scheduling guarantees.

But this of course is flawed in terms of performance.


There's potential for a COW (copy on write mode), that might be an intermediate here- but I think this might be a little tricky to properly implement.


albeit this isn't even a part of user-facing docs!

BlockHasher is more of an implementation detail than anything stable an end-user should utilize, but maybe a whitepaper is in order

@dmadisetti
Copy link
Collaborator

I do like your latter suggestion of tracking file contents.
I think that this would be a good feature when additional files are tracked.

@leventov
Copy link
Author

leventov commented Dec 21, 2024

Caching does the best it can by relying on the contents of the variables it uses, and only falls back on execution mode if it is unable to hash the variable contents. It is possible to have an unhashable datatype that mutates between executions- but then your notebook is not deterministic regardless of persistent_cache.

Could you please point to the place in code that does this?

Between these lines in BlockHasher.__init__():

marimo/marimo/_save/hash.py

Lines 357 to 386 in 1a1db27

# Hold on to each ref type
self.content_refs = set(refs)
self.execution_refs = set(refs)
self.context_refs = set(refs)
# Default type, means that there are no references at all.
cache_type: CacheType = "Pure"
# TODO: Consider memoizing the serialized contents and hashed cells,
# such that a parent cell's BlockHasher can be used to speed up the
# hashing of child.
# Collect references that will be utilized for a content hash.
content_serialization: dict[Name, bytes] = {}
if refs:
cache_type = "ContentAddressed"
refs, content_serialization, stateful_refs = (
self.collect_for_content_hash(
refs, scope, ctx, scoped_refs, apply_hash=False
)
)
self.stateful_refs |= stateful_refs
self.content_refs -= refs
# If there are still unaccounted for references, then fallback on
# execution hashing.
if refs:
cache_type = "ExecutionPath"
refs = self.hash_and_dequeue_execution_refs(refs)
self.execution_refs -= refs | self.content_refs

And collect_for_content_hash() impl:

marimo/marimo/_save/hash.py

Lines 453 to 482 in 1a1db27

def collect_for_content_hash(
self,
refs: set[Name],
scope: dict[str, Any],
ctx: Optional[RuntimeContext],
scoped_refs: set[Name],
apply_hash: bool = True,
) -> SerialRefs:
self._hash = None
refs, content_serialization, _ = (
self.serialize_and_dequeue_content_refs(refs, scope)
)
# If scoped refs are present, then they are unhashable
# and we should fallback to normal hash or fail.
if unhashable := (refs & scoped_refs) - self.execution_refs:
# pickle is a python default
import pickle
failed = []
exceptions = []
# By rights, could just fail here - but this final attempt should
# provide better user experience.
for ref in unhashable:
try:
_hashed = pickle.dumps(scope[ref])
content_serialization[ref] = type_sign(_hashed, "pickle")
refs.remove(ref)
except (pickle.PicklingError, TypeError) as e:
exceptions.append(e)
failed.append(ref)

I don't see how what you are saying takes place. Moreover, self.execution_refs are used inside collect_for_content_hash() impl essentially before this set is "fully" initialised in __init__, namely only after self.execution_refs = set(refs). Isn't this a bug?

@leventov
Copy link
Author

It is possible to have an unhashable datatype that mutates between executions- but then your notebook is not deterministic regardless of persistent_cache.

A remark here is that even if the notebook is knowingly non-deterministic, e.g., it involves LLM generations with non-zero temperature that are used as inputs in downstream cells, it's still good to definitely know at the marimo level (and figure-outable across sudden runtime shutdowns, restarts, etc.) whether these non-deterministic cells are consistent with each other, i.e., whether they "resulted (or could have resulted) from a blank-slate notebook run top-to-bottom, without using cache".

@leventov
Copy link
Author

primitives.is_pure_function

Returns primitives.is_pure_function returns True on calling with builtin open() (for opening files).

@dmadisetti
Copy link
Collaborator

dmadisetti commented Dec 21, 2024

Could you please point to the place in code that does this?

Just a little further down:

if apply_content_hash:

You can refer to the unit tests for edge cases if you are concerned: https://github.com/marimo-team/marimo/blob/1a1db277543da001fe8f1b0c84b7a9632c4645bd/tests/_save/test_hash.py

Re the proposed bug IIRC, execution_refs in collect_for_content_hash are for @cache and that conditional is meant to be a no op otherwise (@cache has a bit more aggressive cache invalidation)

Excited if you can think of a non-deterministic case! We did spend a fair bit of time thinking through edge cases

it's still good to definitely know at the marimo level

The hash mechanism is denoted as a prefix to the hash (e.g. E_ C_), and the "hash type" information is available on the resultant cache object. Did you have a different idea here?

primitives.is_pure_function returns True on calling with builtin open() (for opening files)

That is a good observation, I believe that is_pure_function also blindly considers library calls as pure as well. I think #3258 might be the best way forward here.

I'll read over the docs again, but they shouldn't sound like caching is totally bullet proof. I'm fairly certain we call out randomness, network-requests and file system concerns

@leventov
Copy link
Author

Re the proposed bug IIRC, execution_refs in collect_for_content_hash are for @cache and that conditional is meant to be a no op otherwise (@cache has a bit more aggressive cache invalidation)

How it's not a certainty that the if in line 467 is never taken?

  • execution_refs equals refs in line 362.
  • In line 462, refs only get shrinked (or remain the same)
  • Then, (refs & scoped_refs) - self.execution_refs - shrink it even further (find intersection with scoped_refs) and then remove execution_refs (still equal to the original refs) - the result must be empty.

serialize_and_dequeue_content_refs

The hash mechanism is denoted as a prefix to the hash (e.g. E_ C_), and the "hash type" information is available on the resultant cache object. Did you have a different idea here?

I don't see how serialize_and_dequeue_content_refs handles any case where the variable is a simple data class that I've defined. For example:

@dataclass
class A:
    text: str

@app.cell
def one():
    a = A(text=llm.prompt("blah", temperature=1))

@app.cell
def two():
    _ = llm.prompt("do something with " + a.text)

I don't see how the current code prevents the above app from falling in the trap that I described in the original message of this issue: cell "one" is re-run (and its persistent_cache is updated), then app crashes / laptop powers off / whatever, then marimo reloads the notebook and it will happily load the persistent cache for cell "two" without indication that it's inconsistent with cell "one"'s output.

is_pure_function

That is a good observation, I believe that is_pure_function also blindly considers library calls as pure as well. I think #3258 might be the best way forward here.

Ok, I'm now not sure at all what's the purpose of that is_pure_function call inside serialize_and_dequeue_content_refs(). Functions with side effects such as open are tolerated, and in fact that's often the point of caching to skip exactly such blocks that do something heavy and then write out those heavy results to disk, database, etc. But non-deterministic functions should also be tolerated because again, when I want a cell to be cached I'm pretty clear that I don't want to immediately make it moot by considering all reads of the current time, random generator calls, etc. to invalidate the cache (that is, always invalidate). Clearly, in the vast majority of cases, the user's intent will be to only invalidate the cache for the cell upon changes of explicit inputs from other cells, not the things that change all the time in the background.

So, I just don't understand what that call is supposed to "catch" there. (Let alone, as we found out, is_pure_function is a misnomer; so what is_pure_function actually does is also very unclear.)

@dmadisetti
Copy link
Collaborator

is_pure_function is meant to determine if on the surface level mutations can occur within a function.
The "point" of is_pure_function is reduce the need of relying on execution_cell hashes, by making certain functions content addressable. Maybe is_pure_enough_function would be a better name.

Re your simple example: Cell 2 will not run unless Cell 1 has, so I am unsure of the issue. In normal marimo, Cell 2 can be run manually if Cell 1 has failed (note a cache mechanism failure results in rerun)- but this behavior can be disabled with the before mentioned strict mode. Your example can also be written in a way such that there is no issue:

@dataclass
class NotContentAddressable:
    text: str

@app.cell
def one():
    with mo.persistent_cache("llm_call") as cache:
        a = NotContentAddressable(text=llm.prompt("blah", temperature=1))

@app.cell
def two():
    text = a.text
    with mo.persistent_cache("llm_feed") as cache:
        _ = llm.prompt("do something with " + text) # Will now be content hash vs execution hash since text is 'primitive'
    assert cache.cache_type == "ContentAddressed"

If your LLM is non-deterministic, then maybe consider utilizing a seed. Caching and non-determinism do not mix well. From your feature specs it seems like you would like something more like an archive with quick retrieval, which I am not certain falls under the current scope of the caching feature.


Re robustness- I appreciate your concern, but I do think it's a little misplaced unless you can provide concrete examples of where the hashing mechanism fails barring cases with side effects.

I'm moving this over to "discussion" since it seems that better documentation/ communication is required for caching- but I just made an issue for incorporating tracked files into cache (#3258).

But feel free to keep the discussion going !

@marimo-team marimo-team locked and limited conversation to collaborators Dec 21, 2024
@dmadisetti dmadisetti converted this issue into discussion #3270 Dec 21, 2024

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants