feat(api): avoid caching physical tables #9976

jcrist · 2024-08-30T18:50:30Z

This makes table_expr.cache() a no-op for physical tables and other expressions not worth caching (defined here as simple column subsets of physical tables).

t = con.table("my_table")

t2 = t.cache()  # this is a no-op since t points to an existing table
assert t2.op() is t.op()

subset = t.select("a", "b")
subset2 = subset.cache()  # this is a no-op, since subset is a simple selection of a concrete table
assert subset2.op() is subset.op()

expr = t.filter(t.a > 10)
expr2 = expr.cache()  # this isn't a no-op, since expr does some actual work
assert expr2.op() is not expr.op()

In the original issue (#6195) I wanted to only avoid caching things that were backed by an existing physical table (don't cache tables, do cache views, do cache expressions, ...). In practice writing _is_this_a_real_database_table for every backend would be annoying (maybe less annoying once the .ddl accessor lands). I'm also not sure if avoiding caching views is that bad.

We do make exceptions for two backends though - duckdb and pyspark. Both of these backends have lazy-loading for file inputs (meaning creating the ibis.Table doesn't load the data yet), and loading the data can sometimes be expensive. In both cases the output of read_parquet et. al. is backed by a TEMP VIEW. For these backends alone we also will cache temp views to work around this. This means that the pattern of:

t = con.read_parquet("test.parquet")  # this doesn't load data yet
t2 = t.cache()  # this does load data

will work for both these backends still. Other backends where read_parquet reads immediately (not as a view) will now have .cache() as a no-op (where before it would duplicate the data).

All that said, I'm not happy with how squishy the definition of "do we cache this expression" is. I think the above is pragmatic and easy to do, but it's also hard to explain since there's so much "it depends" going on. Mostly pushing this up for discussion for now (I haven't added tests yet).

A few options that would make this simpler/more uniform/easier to explain:

Always cache anything that isn't backed by a true database TABLE (or a simple column subset of one). This would require more (simple) functions across our backends to check what kind of thing was backing an ops.DatabaseTable, but the uniformity there may be worth it.
Always skip caching anything backed by a PhysicalTable. This makes the choice much easier, since it requires no backend introspection. For duckdb/pyspark/others add an optional kwarg to control whether the data is loaded immediately or lazily (feat(io): support way to ensure read_* apis create physical tables in the backend #9931). I know I closed that one in favor of feat: make Table.cache() a no-op for tables that are already concrete in a backend #6195, but now I'm waffling. This moves the decision about how to load data to the caller rather than after the call, which feels better to me (more localized control). Could be a backend-specific kwarg (so not there for all our backends), or something more general. Possible names: cache=True, view=False, lazy=False, idk.
Maybe scrap all of this and leave the current behavior. I opened the original issue with some ML use cases in mind where I wanted to make use of .cache() uniformly to ensure I was working on a physical table for efficiency, but now I'm not sure if the complications here are worth it.

Currently leaning towards option 2, but could hear arguments for any of them.

jcrist · 2024-08-30T19:41:04Z

cc @cpcloud and @gforsyth for thoughts on if any of this is sane.

cpcloud · 2024-09-03T17:52:20Z

ibis/backends/__init__.py

+        while isinstance(op, (ops.Project, ops.DropColumns)):
+            if isinstance(op, ops.Project) and not all(
+                isinstance(v, ops.Field) for v in op.values.values()
+            ):


Most of this code is unrelated to the backend and is about some property of an expression.

Can we define this as an attribute on ops.Relation subclasses? Seems like it'd be much less squishy that way.

Meaning the decision is consistent and only based on the operation type (no query in the backend needed)? That's option 2 listed above.

cpcloud · 2024-09-03T17:56:46Z

ibis/backends/duckdb/__init__.py

@@ -1589,6 +1589,24 @@ def _get_schema_using_query(self, query: str) -> sch.Schema:
            }
        )

+    def _should_cache_physical_table(self, op: ops.PhysicalTable) -> bool:
+        if isinstance(op, (ops.DatabaseTable, ops.UnboundTable)):


This feels like a property we should encode on the operation itself instead of leaving it up to the backend.

Is that not possible for some reason?

I covered this in the PR body above. I agree that that would be one potentially nicer option (that's bullet 2 above).

feat(api): avoid caching already concrete tables

6c42221

cpcloud reviewed Sep 3, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(api): avoid caching physical tables #9976

feat(api): avoid caching physical tables #9976

jcrist commented Aug 30, 2024

jcrist commented Aug 30, 2024

cpcloud Sep 3, 2024

jcrist Sep 3, 2024

cpcloud Sep 3, 2024

jcrist Sep 3, 2024

feat(api): avoid caching physical tables #9976

Are you sure you want to change the base?

feat(api): avoid caching physical tables #9976

Conversation

jcrist commented Aug 30, 2024

jcrist commented Aug 30, 2024

cpcloud Sep 3, 2024

Choose a reason for hiding this comment

jcrist Sep 3, 2024

Choose a reason for hiding this comment

cpcloud Sep 3, 2024

Choose a reason for hiding this comment

jcrist Sep 3, 2024

Choose a reason for hiding this comment