Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(api): avoid caching physical tables #9976

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

jcrist
Copy link
Member

@jcrist jcrist commented Aug 30, 2024

Fixes #6195.

This makes table_expr.cache() a no-op for physical tables and other expressions not worth caching (defined here as simple column subsets of physical tables).

t = con.table("my_table")

t2 = t.cache()  # this is a no-op since t points to an existing table
assert t2.op() is t.op()

subset = t.select("a", "b")
subset2 = subset.cache()  # this is a no-op, since subset is a simple selection of a concrete table
assert subset2.op() is subset.op()

expr = t.filter(t.a > 10)
expr2 = expr.cache()  # this isn't a no-op, since expr does some actual work
assert expr2.op() is not expr.op()

In the original issue (#6195) I wanted to only avoid caching things that were backed by an existing physical table (don't cache tables, do cache views, do cache expressions, ...). In practice writing _is_this_a_real_database_table for every backend would be annoying (maybe less annoying once the .ddl accessor lands). I'm also not sure if avoiding caching views is that bad.

We do make exceptions for two backends though - duckdb and pyspark. Both of these backends have lazy-loading for file inputs (meaning creating the ibis.Table doesn't load the data yet), and loading the data can sometimes be expensive. In both cases the output of read_parquet et. al. is backed by a TEMP VIEW. For these backends alone we also will cache temp views to work around this. This means that the pattern of:

t = con.read_parquet("test.parquet")  # this doesn't load data yet
t2 = t.cache()  # this does load data

will work for both these backends still. Other backends where read_parquet reads immediately (not as a view) will now have .cache() as a no-op (where before it would duplicate the data).


All that said, I'm not happy with how squishy the definition of "do we cache this expression" is. I think the above is pragmatic and easy to do, but it's also hard to explain since there's so much "it depends" going on. Mostly pushing this up for discussion for now (I haven't added tests yet).

A few options that would make this simpler/more uniform/easier to explain:

  • Always cache anything that isn't backed by a true database TABLE (or a simple column subset of one). This would require more (simple) functions across our backends to check what kind of thing was backing an ops.DatabaseTable, but the uniformity there may be worth it.
  • Always skip caching anything backed by a PhysicalTable. This makes the choice much easier, since it requires no backend introspection. For duckdb/pyspark/others add an optional kwarg to control whether the data is loaded immediately or lazily (feat(io): support way to ensure read_* apis create physical tables in the backend #9931). I know I closed that one in favor of feat: make Table.cache() a no-op for tables that are already concrete in a backend #6195, but now I'm waffling. This moves the decision about how to load data to the caller rather than after the call, which feels better to me (more localized control). Could be a backend-specific kwarg (so not there for all our backends), or something more general. Possible names: cache=True, view=False, lazy=False, idk.
  • Maybe scrap all of this and leave the current behavior. I opened the original issue with some ML use cases in mind where I wanted to make use of .cache() uniformly to ensure I was working on a physical table for efficiency, but now I'm not sure if the complications here are worth it.

Currently leaning towards option 2, but could hear arguments for any of them.

@jcrist
Copy link
Member Author

jcrist commented Aug 30, 2024

cc @cpcloud and @gforsyth for thoughts on if any of this is sane.

Comment on lines +851 to +854
while isinstance(op, (ops.Project, ops.DropColumns)):
if isinstance(op, ops.Project) and not all(
isinstance(v, ops.Field) for v in op.values.values()
):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Most of this code is unrelated to the backend and is about some property of an expression.

Can we define this as an attribute on ops.Relation subclasses? Seems like it'd be much less squishy that way.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Meaning the decision is consistent and only based on the operation type (no query in the backend needed)? That's option 2 listed above.

@@ -1589,6 +1589,24 @@ def _get_schema_using_query(self, query: str) -> sch.Schema:
}
)

def _should_cache_physical_table(self, op: ops.PhysicalTable) -> bool:
if isinstance(op, (ops.DatabaseTable, ops.UnboundTable)):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This feels like a property we should encode on the operation itself instead of leaving it up to the backend.

Is that not possible for some reason?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I covered this in the PR body above. I agree that that would be one potentially nicer option (that's bullet 2 above).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

feat: make Table.cache() a no-op for tables that are already concrete in a backend
2 participants