-
Notifications
You must be signed in to change notification settings - Fork 198
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
leverage ibis expression for getting readablerelations #2046
Conversation
✅ Deploy Preview for dlt-hub-docs ready!
To edit notification comments on pull requests, go to your Netlify site configuration. |
Another note to self, we probably need to run columns names through the normalizer. Or we assume the user will use normalized names as they are present in the schema when building these expressions. |
df = items_table.df() | ||
assert len(df.index) == total_records | ||
|
||
df = double_items_table.df() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
regular dlt dataset execution methods (df, arrow, iter_arrow...) work everywhere
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is really cool! IMO we should keep ibis as optional dependency (it only works with python 3.10+). so we have two options:
- separate relation
- enable the proxy behavior if ibis is found, if not we fallback to the current behavior
i'd probably go for the second option. I'm just a little bit worried about the typing
in both cases we should implement a few common expressions we already have in our existing relation (limit, head, column selection).
regarding the schema: column lineage you can do with sqlglot. it makes sense to invest a little bit of time to understand how it is done:
- https://github.com/tobymao/sqlglot/blob/main/posts/ast_primer.md (btw. it seems that we are not finding tables in expression in a correct way)
- https://sqlglot.com/sqlglot/lineage.html
we can add sqlglot as a regular dependency. and use it everywhere we have sql SELECT statements.
b636df4
to
b9cf262
Compare
390f9a0
to
936db9c
Compare
…transpiling sql via postgres in some cases and selecting the right dialect in others
7762611
to
b6850e8
Compare
ff330b6
to
0eb6f58
Compare
|
||
def select(self, *columns: str) -> "ReadableIbisRelation": | ||
"""set which columns will be selected""" | ||
return self._proxy_expression_method("select", *columns) # type: ignore |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
all the other ibis methods are not defined here yet. we'd need to add them to the interface and raise an error if they are called when the regular relation is returned I think. wdyt?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we just need to type dlt.dataset
and pipeline.dataset
properly (see my other comment). and create a Protocol for ibis expressions to type this class
1a8b80e
to
acd329f
Compare
@rudolfix the only thing still missing here is proper typing for this ibis wrapper. The "brute force" way of doing it would be to define all the methods of the ibis expression we can make use of on the SupportsReadableRelation Interface, forward those calls in the ibirelation like we do with limit etc, and raise on the regular relation. Alternatively I could do some kind of wildcard typing to make the linter shut up, but it would be less strict. Maybe you can give me your opinion, if you don't mind, then I'll just decide on one. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
very cool! tldr;>
- we can track schema changes in 90% of cases easily (or at least as good as in upstream object)
- we can do better with typing but maybe in separate PR
- tests need to be better
- can we do without dynamic installation for ibis?
@@ -119,3 +137,37 @@ def create_ibis_backend( | |||
con = ibis.duckdb.from_connection(duck) | |||
|
|||
return con | |||
|
|||
|
|||
def create_unbound_ibis_table( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we move ibis
module to helpers? you are importing modules that are not in common. so we really need to refer to ibis in common?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did that (see the commit) but there is one place in common where the IbisBackend typing is imported for the readable relation.
"dlt.destinations.filesystem": "duckdb", # works | ||
"dlt.destinations.dremio": "presto", # works | ||
# NOTE: can we discover the current dialect in sqlalchemy? | ||
"dlt.destinations.sqlalchemy": "mysql", # may work |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
by running configure()
on factory with partial
option. OFC we need the dialect part somewhere configured ie in partial connection string (without password)
|
||
def select(self, *columns: str) -> "ReadableIbisRelation": | ||
"""set which columns will be selected""" | ||
return self._proxy_expression_method("select", *columns) # type: ignore |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we just need to type dlt.dataset
and pipeline.dataset
properly (see my other comment). and create a Protocol for ibis expressions to type this class
|
||
|
||
# TODO: provide ibis expression typing for the readable relation | ||
class ReadableIbisRelation(BaseReadableDBAPIRelation): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there any protocol or base class in ibis
that we can add as a base to get correct typing? otherwise we'd need to add all the methods ourselves.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, there is nothing to mix in, I had already looked at many parts of the ibis definition. For now I have added a brute force typing so the warnings go away...
from dlt.destinations.dataset.dataset import ReadableDBAPIDataset | ||
|
||
|
||
def dataset( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do you want to improve typing in this PR? if so we could
- make
SupportsReadableDataset
generic where it is parametrized by relation type (bound to SupportsReadableRelation) - use overload on the literal
TDatasetType
when in auto/ibis you return SupportsReadableDataset parametrized by ReadableIbisRelation, otherwise by just SupportsReadableRelation - we can make (2) dynamic dependent if ibis is present
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you maybe explain how (3) works? I don't understand how mypy can lint something else if a certain library is installed..
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Anyway as mentioned above, I have now brute forced typing, but i'm happy to improve this if I understand how to do (3)
|
||
def __getitem__(self, columns: Union[str, Sequence[str]]) -> "ReadableIbisRelation": | ||
# casefold column-names | ||
columns = [self.sql_client.capabilities.casefold_identifier(col) for col in columns] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is column getter right? so you will receive ibis Column
object(s) right? then maybe select them from the schema like we do in regular relation?
this will cover many cases already at make ibis backward compatible with base expression
def compute_columns_schema(self) -> TTableSchemaColumns: | ||
"""provide schema columns for the cursor, may be filtered by selected columns""" | ||
# TODO: provide column lineage tracing with sqlglot lineage | ||
return None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe we could try some simple implementation:
- we preserve table schema until first expression is made
- we preserve table schema for
head
limit
andselect
method - maybe we also recognize a few exceptions that do not modify schema ie
filter
and free form filter expressions as here: https://ibis-project.org/tutorials/ibis-for-pandas-users#filtering-rows
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you mean with "free form filter"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
order_by and limit also are included, I have done this now.
if attr is None: | ||
raise AttributeError(f"'{self.__class__.__name__}' object has no attribute '{name}'") | ||
|
||
if not callable(attr): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think ibis supports a column getter with a dot notation. my take would be to
- start supporting it upstream
- modify the schema accordingly
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure what you mean, the column getter dot notation is already working, see the tests, I'm using it there all the time for filtering and joining etc.
tests/load/test_read_interfaces.py
Outdated
# ensure ibis is installed for these tests | ||
import subprocess | ||
|
||
subprocess.check_call( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is hardcore. arent we able to use dependency group for ibis?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I basically want to avoid running some subset of tests only with ibis, but I'll just add a group for ibis and make it available for all destination tests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
assert len(df.index) == 5 | ||
|
||
# check chained expression with join, column selection, order by and limit | ||
joined_table = ( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you do not need to test all the expressions but you should test various expression forms:
- those that return Table (like here)
- those that select a column
- a single column getter with . notation
- a free form filters:
expr = penguins.bill_length_mm > 37.0
- materializations like their dataframe and arrow getters
- there are also expressions that add and remove columns to schema
- expressions that return Expr (not Table - those do some weird things, I'm not sure SQL can be generated for them)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have added quite a few tests from the "ibis for sql users" page in the ibis docs. changed a few things in the ibisrelation implementation to make this work. I think most cases should actually work now. Let me know if you think something is missing there or you have an idea what to test next.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I ammended it a bit more, I think it should be quite good now.
# Conflicts: # .github/workflows/test_destinations.yml # .github/workflows/test_local_destinations.yml # dlt/destinations/dataset.py # poetry.lock # tests/load/test_read_interfaces.py
b10bd64
to
e50b5b5
Compare
6a132d4
to
3886638
Compare
a202fa8
to
289e289
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! we'll solve ibis typing somehow later
* add ibis dataset in own class for now * make error clearer * fix some linting and fix broken test * make most destinations work with selecting the right db and catalog, transpiling sql via postgres in some cases and selecting the right dialect in others * add missing motherduck and sqlalchemy mappings * casefold identifiers for ibis wrapper calss * re-organize existing dataset code to prepare ibis relation integration * integrate ibis relation into existing code * re-order tests * fall back to default dataset if table not in schema * make dataset type selectable * add dataset type selection test and fix bug in tests * update docs for ibis expressions use * ensure a bunch of ibis operations continue working * add some more tests and typings * fix typing (with brute force get_attr typing..) * move ibis to dependency group * move ibis stuff to helpers * post devel merge, put in change from dataset, update lockfile * add ibis to sqlalchemy tests * improve docs a bit * fix ibis dep group * fix dataset snippets * fix ibis version * add support for column schema in certion query cases --------- Co-authored-by: Marcin Rudolf <[email protected]>
* add ibis dataset in own class for now * make error clearer * fix some linting and fix broken test * make most destinations work with selecting the right db and catalog, transpiling sql via postgres in some cases and selecting the right dialect in others * add missing motherduck and sqlalchemy mappings * casefold identifiers for ibis wrapper calss * re-organize existing dataset code to prepare ibis relation integration * integrate ibis relation into existing code * re-order tests * fall back to default dataset if table not in schema * make dataset type selectable * add dataset type selection test and fix bug in tests * update docs for ibis expressions use * ensure a bunch of ibis operations continue working * add some more tests and typings * fix typing (with brute force get_attr typing..) * move ibis to dependency group * move ibis stuff to helpers * post devel merge, put in change from dataset, update lockfile * add ibis to sqlalchemy tests * improve docs a bit * fix ibis dep group * fix dataset snippets * fix ibis version * add support for column schema in certion query cases --------- Co-authored-by: Marcin Rudolf <[email protected]>
Description
Implements support for using ibis expressions to generate sql statements for relations. To this end this PR implements a new type of readable relation which gets used that wraps an ibis unboundtable expression, but still accesses data the old way.
Todos: