Support for inserting data in ibis.TableDataset #834

vishu1994 · 2024-09-14T09:59:56Z

Description

In ETL pipelines, loading transformed data into various data warehouses is a critical requirement. Currently, the ibis.TableDataset connector in Kedro does not support data insertion into Ibis backends.

Context

Why is this change important to me?

We are developing ETL pipelines in our organization, and inserting records into data warehouses is an essential requirement. At present, without support for data insertion, we must bypass the Kedro DataCatalog and rely on external ORM tools to handle native data storage operations, such as SQLAlchemy , dataset etc .

How would I use it?

Supporting data insertion in ibis.TableDataset would allow us to maintain a clean and consistent pipeline, avoiding the need for custom load operations within nodes. This would simplify the workflow and allow Kedro to manage the complete I/O process.

How can it benefit other users?

By enabling this feature, users could avoid writing custom loading logic, thereby keeping their pipelines cleaner and more efficient. This would enhance Kedro's usability in scenarios where heavy I/O operations are involved, particularly for teams working with data warehouses or similar storage backends.

The text was updated successfully, but these errors were encountered:

deepyaman · 2024-09-14T22:32:04Z

Sounds good! I'm going to assign you, since you've expressed interest in contributing to Kedro, and I think this is a great starting point. Happy to help provide guidance (and I think anybody on the Kedro team can also help answer questions, as this should be fairly standard to add).

ibis.TableDataset currently works by calling create_table or create_view here: https://github.com/kedro-org/kedro-plugins/blob/kedro-datasets-4.1.0/kedro-datasets/kedro_datasets/ibis/table_dataset.py#L181

You will need to figure out an ergonomic way to specify that it's going to be an "insert" operation. One possible way is to define a mode argument, similar to https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrameWriter.mode.html#pyspark.sql.DataFrameWriter.mode or https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.to_csv.html mode. I feel like this would be pretty familiar to Kedro users, but I also haven't given much thought to alternatives so far. :)

Please feel free to further discuss how you want to implement it here, or raise a PR with an initial stab that we can discuss—whatever works best for you!

deepyaman assigned vishu1994 Sep 14, 2024

deepyaman mentioned this issue Sep 14, 2024

feat(datasets): support setting Ibis table schemas #833

Open

deepyaman mentioned this issue Sep 14, 2024

Support for Upsert functionality in ibis.TableDataset #835

Open

deepyaman mentioned this issue Sep 24, 2024

Enhance current integration between Kedro and Ibis kedro-org/kedro#4190

Open

10 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support for inserting data in ibis.TableDataset #834

Support for inserting data in ibis.TableDataset #834

vishu1994 commented Sep 14, 2024

deepyaman commented Sep 14, 2024

Support for inserting data in ibis.TableDataset #834

Support for inserting data in ibis.TableDataset #834

Comments

vishu1994 commented Sep 14, 2024

Description

Context

Why is this change important to me?

How would I use it?

How can it benefit other users?

deepyaman commented Sep 14, 2024