create dataset to/from datetime #884

SwePalm · 2021-09-09T13:40:06Z

SwePalm
Sep 9, 2021

Hi,
just starting to pick up on Kedro, super excited after first test-pipelines. Very easy to fit my existing code inte nodes!

But i have searched for a design-pattern that i could use as template for my use-case, but not found any good.
I have a usecase where data is stored in partions (paths) based on year/month/day/hour, and every hour holds several files.

I would like to build a dataset that includes a selection of multiple years/months/days/hours.
Where the "selection" can be different for every run of the pipeline (think: include data from date/time - to date/time).
A use-case that must exist elsewere, i think...but i have not found a good reference for a kedro implementation on how to create a dataset like this.

After reading a bit about PartitionedDataSet, my idea is to create a node that will generate a list of paths/files (based on datetime input, mayby as run params and hooks) and use this as file_path argument for a PartitionedDataSet.

But i guess this is a "normal" usecase, so there might be some existing solutions to this usecase that you know about.
Any pointers is accepted with gratefullness!

Otherwise i "just" need to figure out if i should update the catalog from a node...and test if i should update an existing dataset in the catalog with the file_path argument, or create a new dataset in the catalog every time i run the pipeline.
There is so many options :-)

datajoely · 2021-09-09T14:38:48Z

datajoely
Sep 9, 2021
Collaborator

So I think you're on the right track - I'd recommend that you don't manipulate your catalog at runtime. You can do this via a hook, but it's funky. Could this be achieved with a runtime parameter that dynamically filters your data?

You could do something like this: kedro run --params data_start:'2020-01-01',date_end:'2020-12-31'

4 replies

datajoely Sep 9, 2021
Collaborator

We also have a specialised version of the PartitionedDataSet called the IncrementalDataSet that may do some of what you need:
https://kedro.readthedocs.io/en/stable/kedro.io.IncrementalDataSet.html

datajoely Sep 9, 2021
Collaborator

I'm also a little intrigued by allowing users to define a 'date spine' which just does this in the background:
pd.DataFrame(pd.date_range(start='2021-01-01', end='2021-12-31', freq='m'), columns=['date_spine'])

SwePalm Sep 9, 2021
Author

Yes, think run --param will be an good first start, but moving towards "production"" i have to look at environment variables for deploying the pipeline with docker (a plan, at least for now) I might expole building a frontend with streamlit as well, like to make it simple for users :-)

SwePalm Sep 10, 2021
Author

I looked at IncrementalDataSet, and it might be useful for storing intermediate data, when reading large date/time-spans (with higher risk of failure), but for my usecase i need data for the whole time-span, not designed to work with increments today :-(
But using IncrementalDataset as a intermediate step might not be a bad thing, thanks for the inspiration!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

create dataset to/from datetime #884

{{title}}

Replies: 1 comment 4 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

create dataset to/from datetime #884

SwePalm Sep 9, 2021

Replies: 1 comment · 4 replies

datajoely Sep 9, 2021 Collaborator

datajoely Sep 9, 2021 Collaborator

datajoely Sep 9, 2021 Collaborator

SwePalm Sep 9, 2021 Author

SwePalm Sep 10, 2021 Author

SwePalm
Sep 9, 2021

Replies: 1 comment 4 replies

datajoely
Sep 9, 2021
Collaborator

datajoely Sep 9, 2021
Collaborator

datajoely Sep 9, 2021
Collaborator

SwePalm Sep 9, 2021
Author

SwePalm Sep 10, 2021
Author