-
Notifications
You must be signed in to change notification settings - Fork 93
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(datasets): add dataset to load/save with Ibis #560
Conversation
fd523cb
to
a52246b
Compare
🔥 |
c76fd22
to
98db223
Compare
Signed-off-by: Deepyaman Datta <[email protected]>
Signed-off-by: Deepyaman Datta <[email protected]>
Signed-off-by: Deepyaman Datta <[email protected]>
Signed-off-by: Deepyaman Datta <[email protected]>
Signed-off-by: Deepyaman Datta <[email protected]>
Signed-off-by: Deepyaman Datta <[email protected]>
Signed-off-by: Deepyaman Datta <[email protected]>
6f05651
to
0c0bc6b
Compare
Before I start reviewing: I'm fine pushing this to |
I would go further @astrojuanlu and say this should be in a |
but also thank you for the push this is great! |
@@ -49,6 +49,28 @@ huggingface-hfdataset = ["datasets", "huggingface_hub"] | |||
huggingface-hftransformerpipelinedataset = ["transformers"] | |||
huggingface = ["kedro-datasets[huggingface-hfdataset,huggingface-hftransformerpipelinedataset]"] | |||
|
|||
ibis-bigquery = ["ibis-framework[bigquery]"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if we could do a CI script to keep this bit in sync
@astrojuanlu Intuitively, what would domain-specific subpackages look like?
For both @astrojuanlu and @datajoely (and anybody else on the TSC): If we do go down the route of |
I think |
|
||
.. code-block:: pycon | ||
|
||
>>> from kedro_datasets.ibis import TableDataset |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it be possible to add an local example here with some dummy dataframe / duckdb? I know there is a blog post, but just looking at the docs/docstring I think it's not that easy to get how an user should use this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ideally we'd have both a yaml
and python
example.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just fyi, when connecting to ibis there are two major "paradigms", either a filebased connection or a db connection. From my limited knowledge I think the configuration arguments are quite different.
For reference, I am currently connecting to mssql dbs like this:
conn_table = TableDataset(
connection={
"host": "xxx.sql.azuresynapse.net",
"database": "xxx",
"query": {"driver": "ODBC Driver 17 for SQL Server"},
"backend": "mssql",
},
table_name="xxx",
load_args={"schema": "xxx"},
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the documentation here should probably clarify what the different arguments map to in the ibis connection object.
in the case of a filepath
being provided, it will map to the backend method read_{file_format}
, e.g. https://ibis-project.org/backends/duckdb#ibis.backends.duckdb.Backend.read_parquet
otherwise it will get the table from ibis.{backend}.Backend.table
where the Backend object is obtained through ibis.{backend}.connect
e.g. https://ibis-project.org/backends/mssql
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I forgot to add an example, it seems. 😅
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the documentation here should probably clarify what the different arguments map to in the ibis connection object.
Added much more detail on this!
reader = getattr(self.connection, f"read_{self._file_format}") | ||
return reader(self._filepath, self._table_name, **self._load_args) | ||
else: | ||
return self.connection.table(self._table_name) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
currently when loading tables thru the mssql
backend I am doing conn.table(table_name="XXX", schema="A").
an easy solution would be to add the load args into the conn.table
call and specify the schema as a load_arg
, but imo this is suboptimal, as the write and save schema should usually (in my experience, my assertion could be wrong) be the same
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@gforsyth not sure if you have any insight here wrt the upcoming changes in ibis around schema and db hierarchy
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so starting in 9.0, schema
will be deprecated as a kwarg. We will be using the word "database" to refer to a collection of tables and the word "catalog" to refer to a collection of "database".
For specifying only the database
, you would do:
conn.table(table_name="XX", database="A")
if the database is located in a catalog (like, say dbo
or sys
) you would do either of:
conn.table(table_name="XX", database="dbo.A")
or
conn.table(table_name="XX", database=("dbo", "A"))
(schema
will still work in 9.0, it will just raise a FutureWarning)
@property | ||
def connection(self) -> BaseBackend: | ||
cls = type(self) | ||
key = tuple(sorted(self._connection_config.items())) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I get an error unhashable type: 'dict'
when using this connection method
TableDataset(
connection={
"host": "xxx.sql.azuresynapse.net",
"database": "xxx",
"query": {"driver": "ODBC Driver 17 for SQL Server"},
"backend": "mssql",
},
table_name="xxx",
load_args={"schema": "xxx"},
)
manually dropping the query
kwarg before doing the tuple
call works as a temporary workaround to get a working connection, but is obviously not ideal
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@inigohidalgo Good catch! I've done some digging, and opted to use json.dumps(self._connection_config, sort_keys=True)
instead. I think this should handle the case you brought up.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ugh, that doesn't work, for non- serialization keys...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, fixed and added some test cases!
Signed-off-by: Deepyaman Datta <[email protected]>
Signed-off-by: Deepyaman Datta <[email protected]>
c085312
to
bbec217
Compare
Signed-off-by: Deepyaman Datta <[email protected]>
@noklam I've updated it to use our canonical |
Signed-off-by: Deepyaman Datta <[email protected]>
Signed-off-by: Deepyaman Datta <[email protected]>
This reverts commit 1fcb01a. Signed-off-by: Deepyaman Datta <[email protected]>
Signed-off-by: Deepyaman Datta <[email protected]>
Signed-off-by: Deepyaman Datta <[email protected]>
Signed-off-by: Deepyaman Datta <[email protected]>
Signed-off-by: Deepyaman Datta <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I haven't tried to use the dataset yet, but the code looks all good to me! Thanks for adding the detailed examples and doc-strings, they're great! ⭐
connection: Configuration for connecting to an Ibis backend. | ||
load_args: Additional arguments passed to the Ibis backend's | ||
`read_{file_format}` method. | ||
save_args: Additional arguments passed to the Ibis backend's |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any docs we can link to that specify which values a user can supply to materialized
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not aware of anywhere it's documented centrally (i.e. not on a backend-specific basis) in Ibis. @lostmygithubaccount any chance you know?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we do not, this is on my TODO list to add API docs for alongside the other read_*
and to_*
methods
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good! @merelcht if it's OK then, we'll improve the documentation on the Ibis side, and then whenever that happens, we can make the dataset docs reference that.
Signed-off-by: Elena Khaustova <[email protected]>
So happy to see this merged 🔥 let's make some noise! |
* feat(datasets): add dataset to load/save with Ibis Signed-off-by: Deepyaman Datta <[email protected]> * build(datasets): fix typos in definition of extras Signed-off-by: Deepyaman Datta <[email protected]> * build(datasets): include Ibis backend requirements Signed-off-by: Deepyaman Datta <[email protected]> * test(datasets): implement save and reload for Ibis Signed-off-by: Deepyaman Datta <[email protected]> * test(datasets): check `ibis.TableDataset.exists()` Signed-off-by: Deepyaman Datta <[email protected]> * test(datasets): test extra load and save args work Signed-off-by: Deepyaman Datta <[email protected]> * test(datasets): verify config and materializations Signed-off-by: Deepyaman Datta <[email protected]> --------- Signed-off-by: Deepyaman Datta <[email protected]> Signed-off-by: tgoelles <[email protected]>
Description
Officially add Ibis
TableDataset
introduced in https://kedro.org/blog/building-scalable-data-pipelines-with-kedro-and-ibis.Development notes
Checklist
RELEASE.md
file