Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update partitioned dataset lazy saving docs #4402

Merged
merged 7 commits into from
Jan 22, 2025
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions RELEASE.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,6 +14,7 @@
* Safeguard hooks when user incorrectly registers a hook class in settings.py.
* Fixed parsing paths with query and fragment.
* Remove lowercase transformation in regex validation.
* Updated `Partitioned dataset lazy saving` docs page.

## Breaking changes to the API
## Documentation changes
Expand Down
19 changes: 19 additions & 0 deletions docs/source/data/partitioned_and_incremental_datasets.md
Original file line number Diff line number Diff line change
Expand Up @@ -175,6 +175,7 @@
path: s3://my-bucket-name
dataset: pandas.CSVDataset
filename_suffix: ".csv"
save_lazily: True
```

Here is the node definition:
Expand Down Expand Up @@ -238,6 +239,24 @@
When using lazy saving, the dataset will be written _after_ the `after_node_run` [hook](../hooks/introduction).
```

```{note}
Lazy saving is the default behaviour, meaning that if a `Callable` type is provided, the dataset will be written _after_ the `after_node_run` hook is executed.
```

In certain cases, it might be useful to disable lazy saving, such as when your object is already a `Callable` (e.g., a TensorFlow model) and you do not intend to save it lazily.

Check notice on line 246 in docs/source/data/partitioned_and_incremental_datasets.md

View workflow job for this annotation

GitHub Actions / vale

[vale] docs/source/data/partitioned_and_incremental_datasets.md#L246

[Kedro.sentencelength] Try to keep your sentence length to 30 words or fewer.
Raw output
{"message": "[Kedro.sentencelength] Try to keep your sentence length to 30 words or fewer.", "location": {"path": "docs/source/data/partitioned_and_incremental_datasets.md", "range": {"start": {"line": 246, "column": 1}}}, "severity": "INFO"}

Check warning on line 246 in docs/source/data/partitioned_and_incremental_datasets.md

View workflow job for this annotation

GitHub Actions / vale

[vale] docs/source/data/partitioned_and_incremental_datasets.md#L246

[Kedro.abbreviations] Use 'for example' instead of abbreviations like 'e.g.,'.
Raw output
{"message": "[Kedro.abbreviations] Use 'for example' instead of abbreviations like 'e.g.,'.", "location": {"path": "docs/source/data/partitioned_and_incremental_datasets.md", "range": {"start": {"line": 246, "column": 112}}}, "severity": "WARNING"}

Check warning on line 246 in docs/source/data/partitioned_and_incremental_datasets.md

View workflow job for this annotation

GitHub Actions / vale

[vale] docs/source/data/partitioned_and_incremental_datasets.md#L246

[Kedro.weaselwords] 'lazily' is a weasel word!
Raw output
{"message": "[Kedro.weaselwords] 'lazily' is a weasel word!", "location": {"path": "docs/source/data/partitioned_and_incremental_datasets.md", "range": {"start": {"line": 246, "column": 171}}}, "severity": "WARNING"}
To disable the lazy saving set `save_lazily` parameter to `False`:

```yaml
# conf/base/catalog.yml

new_partitioned_dataset:
type: partitions.PartitionedDataset
path: s3://my-bucket-name
dataset: pandas.CSVDataset
filename_suffix: ".csv"
save_lazily: False
```

## Incremental datasets

{class}`IncrementalDataset<kedro-datasets:kedro_datasets.partitions.IncrementalDataset>` is a subclass of `PartitionedDataset`, which stores the information about the last processed partition in the so-called `checkpoint`. `IncrementalDataset` addresses the use case when partitions have to be processed incrementally, that is, each subsequent pipeline run should process just the partitions which were not processed by the previous runs.
Expand Down
Loading