DataSetError when trying to save dataframe to ParquetDataSet #1286

meaningfromdata · 2022-02-24T18:21:59Z

meaningfromdata
Feb 24, 2022

I'm trying to save a pandas dataframe to a parquet file in local storage as the output of a node. However, when I run the pipeline I am getting a DataSetError. The error message seems to indicate that there is a problem with the keyword arguments in the save_args. It doesn't like any of the keywords including "file_scheme", "has_nulls" and "engine" (all those I've tried listed below generate the DataSetError).

I am following the example for putting parquet files in the catalog.yml from https://kedro.readthedocs.io/en/stable/kedro.extras.datasets.pandas.ParquetDataSet.html#kedro.extras.datasets.pandas.ParquetDataSet

Here is what my catalog entry looks like:

cohort:
  type: pandas.ParquetDataSet
  filepath: data/02_intermediate/cohort.parquet
  load_args:
    engine: pyarrow
    use_nullable_dtypes: True
  save_args:
    file_scheme: hive
    has_nulls: False
    engine: pyarrow

Any help getting this to work would be appreciated. I am using kedro 0.17.6 with Python 3.7.11, pandas 1.3.5 and pyarrow 6.0.1.

meaningfromdata · 2022-02-24T20:45:19Z

meaningfromdata
Feb 24, 2022
Author

Update: When I eliminate load_args and save_args and just specify the type and filepath this works to save and load from parquet.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DataSetError when trying to save dataframe to ParquetDataSet #1286

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment

{{title}}

Select a reply

DataSetError when trying to save dataframe to ParquetDataSet #1286

meaningfromdata Feb 24, 2022

Replies: 1 comment

meaningfromdata Feb 24, 2022 Author

meaningfromdata
Feb 24, 2022

meaningfromdata
Feb 24, 2022
Author