From 4b8d9b575ec851d6c1ff30a1cebe19bbc5486cc7 Mon Sep 17 00:00:00 2001 From: Elena Khaustova Date: Tue, 7 Jan 2025 15:51:16 +0000 Subject: [PATCH 1/5] Updated Partitioned dataset lazy saving docs Signed-off-by: Elena Khaustova --- docs/source/data/partitioned_and_incremental_datasets.md | 6 +++++- 1 file changed, 5 insertions(+), 1 deletion(-) diff --git a/docs/source/data/partitioned_and_incremental_datasets.md b/docs/source/data/partitioned_and_incremental_datasets.md index 3ac91f83dc..57b0ad7727 100644 --- a/docs/source/data/partitioned_and_incremental_datasets.md +++ b/docs/source/data/partitioned_and_incremental_datasets.md @@ -213,7 +213,7 @@ Writing to an existing partition may result in its data being overwritten, if th ### Partitioned dataset lazy saving `PartitionedDataset` also supports lazy saving, where the partition's data is not materialised until it is time to write. -To use this, simply return `Callable` types in the dictionary: +To use this, simply wrap your object with `lambda` function in the dictionary before return: ```python from typing import Any, Dict, Callable @@ -234,6 +234,10 @@ def create_partitions() -> Dict[str, Callable[[], Any]]: } ``` +```{note} +Other `Callable` types but `lambda` provided will be ignored and processed as is without apllying lazy saving. +``` + ```{note} When using lazy saving, the dataset will be written _after_ the `after_node_run` [hook](../hooks/introduction). ``` From b2a432bb786d51e70b70c28f8e7ecf4721042815 Mon Sep 17 00:00:00 2001 From: Elena Khaustova Date: Tue, 7 Jan 2025 15:52:39 +0000 Subject: [PATCH 2/5] Updated release notes Signed-off-by: Elena Khaustova --- RELEASE.md | 1 + 1 file changed, 1 insertion(+) diff --git a/RELEASE.md b/RELEASE.md index 552fa27f41..4c594ce14e 100644 --- a/RELEASE.md +++ b/RELEASE.md @@ -10,6 +10,7 @@ * Added `node` import to the pipeline template. * Update error message when executing kedro run without pipeline. * Safeguard hooks when user incorrectly registers a hook class in settings.py. +* Updated `Partitioned dataset lazy saving` docs page. ## Breaking changes to the API ## Documentation changes From 0e759709341453fbce31f4d7b6a0e00384e8bca0 Mon Sep 17 00:00:00 2001 From: Elena Khaustova Date: Tue, 7 Jan 2025 16:06:24 +0000 Subject: [PATCH 3/5] Fixed typo Signed-off-by: Elena Khaustova --- docs/source/data/partitioned_and_incremental_datasets.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/source/data/partitioned_and_incremental_datasets.md b/docs/source/data/partitioned_and_incremental_datasets.md index 57b0ad7727..d039be07fc 100644 --- a/docs/source/data/partitioned_and_incremental_datasets.md +++ b/docs/source/data/partitioned_and_incremental_datasets.md @@ -235,7 +235,7 @@ def create_partitions() -> Dict[str, Callable[[], Any]]: ``` ```{note} -Other `Callable` types but `lambda` provided will be ignored and processed as is without apllying lazy saving. +Other `Callable` types but `lambda` provided will be ignored and processed as is without applying lazy saving. ``` ```{note} From 000934c062b7ccc539caa6a80db5931d76b23f21 Mon Sep 17 00:00:00 2001 From: Elena Khaustova Date: Thu, 16 Jan 2025 11:57:46 +0000 Subject: [PATCH 4/5] Updated docs based on new solution Signed-off-by: Elena Khaustova --- .../partitioned_and_incremental_datasets.md | 21 ++++++++++++++++--- 1 file changed, 18 insertions(+), 3 deletions(-) diff --git a/docs/source/data/partitioned_and_incremental_datasets.md b/docs/source/data/partitioned_and_incremental_datasets.md index d039be07fc..b948f585cb 100644 --- a/docs/source/data/partitioned_and_incremental_datasets.md +++ b/docs/source/data/partitioned_and_incremental_datasets.md @@ -175,6 +175,7 @@ new_partitioned_dataset: path: s3://my-bucket-name dataset: pandas.CSVDataset filename_suffix: ".csv" + save_lazily: True ``` Here is the node definition: @@ -213,7 +214,7 @@ Writing to an existing partition may result in its data being overwritten, if th ### Partitioned dataset lazy saving `PartitionedDataset` also supports lazy saving, where the partition's data is not materialised until it is time to write. -To use this, simply wrap your object with `lambda` function in the dictionary before return: +To use this, simply return `Callable` types in the dictionary: ```python from typing import Any, Dict, Callable @@ -235,11 +236,25 @@ def create_partitions() -> Dict[str, Callable[[], Any]]: ``` ```{note} -Other `Callable` types but `lambda` provided will be ignored and processed as is without applying lazy saving. +When using lazy saving, the dataset will be written _after_ the `after_node_run` [hook](../hooks/introduction). ``` ```{note} -When using lazy saving, the dataset will be written _after_ the `after_node_run` [hook](../hooks/introduction). +Lazy saving is a default behaviour, meaning that if `Callable` type provided the dataset will be written _after_ the `after_node_run` hook. +``` + +In some cases, it might be useful to disable such a behaviour, for example, when your object is already `Callable`, like a Tensorflow model, and you do not mean to save it lazily. +To disable lazy saving set `save_lazily` parameter to `False`: + +```yaml +# conf/base/catalog.yml + +new_partitioned_dataset: + type: partitions.PartitionedDataset + path: s3://my-bucket-name + dataset: pandas.CSVDataset + filename_suffix: ".csv" + save_lazily: False ``` ## Incremental datasets From 4e94ec01a637d18a03ea4fa03946eb6e347573e5 Mon Sep 17 00:00:00 2001 From: Elena Khaustova Date: Fri, 17 Jan 2025 12:09:26 +0000 Subject: [PATCH 5/5] Applied revire comments Signed-off-by: Elena Khaustova --- docs/source/data/partitioned_and_incremental_datasets.md | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/docs/source/data/partitioned_and_incremental_datasets.md b/docs/source/data/partitioned_and_incremental_datasets.md index b948f585cb..f5acf16a05 100644 --- a/docs/source/data/partitioned_and_incremental_datasets.md +++ b/docs/source/data/partitioned_and_incremental_datasets.md @@ -240,11 +240,11 @@ When using lazy saving, the dataset will be written _after_ the `after_node_run` ``` ```{note} -Lazy saving is a default behaviour, meaning that if `Callable` type provided the dataset will be written _after_ the `after_node_run` hook. +Lazy saving is the default behaviour, meaning that if a `Callable` type is provided, the dataset will be written _after_ the `after_node_run` hook is executed. ``` -In some cases, it might be useful to disable such a behaviour, for example, when your object is already `Callable`, like a Tensorflow model, and you do not mean to save it lazily. -To disable lazy saving set `save_lazily` parameter to `False`: +In certain cases, it might be useful to disable lazy saving, such as when your object is already a `Callable` (e.g., a TensorFlow model) and you do not intend to save it lazily. +To disable the lazy saving set `save_lazily` parameter to `False`: ```yaml # conf/base/catalog.yml