Skip to content

Commit

Permalink
add factory docs
Browse files Browse the repository at this point in the history
Signed-off-by: Nok <[email protected]>
  • Loading branch information
noklam committed Nov 8, 2024
1 parent 6e3e4d1 commit 52840e2
Show file tree
Hide file tree
Showing 2 changed files with 42 additions and 1 deletion.
1 change: 1 addition & 0 deletions RELEASE.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,6 +14,7 @@
## Documentation changes
* Updated CLI autocompletion docs with new Click syntax.
* Standardised `.parquet` suffix in docs and tests.
* Added example to explains how dataset factories work.

## Community contributions
* [Hyewon Choi](https://github.com/hyew0nChoi)
Expand Down
42 changes: 41 additions & 1 deletion docs/source/data/kedro_dataset_factories.md
Original file line number Diff line number Diff line change
@@ -1,7 +1,47 @@
# Kedro dataset factories
You can load multiple datasets with similar configuration using dataset factories, introduced in Kedro `0.18.12`.

The syntax allows you to generalise your configuration and reduce the number of similar catalog entries by matching datasets used in your project's pipelines to dataset factory patterns.
The dataset factories introduce a syntax that allows you to generalise your configuration and reduce the number of similar catalog entries by matching datasets used in your project's pipelines to dataset factory patterns.

Check notice on line 4 in docs/source/data/kedro_dataset_factories.md

View workflow job for this annotation

GitHub Actions / vale

[vale] docs/source/data/kedro_dataset_factories.md#L4

[Kedro.sentencelength] Try to keep your sentence length to 30 words or fewer.
Raw output
{"message": "[Kedro.sentencelength] Try to keep your sentence length to 30 words or fewer.", "location": {"path": "docs/source/data/kedro_dataset_factories.md", "range": {"start": {"line": 4, "column": 1}}}, "severity": "INFO"}

For example:
```yaml
factory_data:
type: pandas.CSVDataset
filepath: data/01_raw/factory_data.csv
```
With dataset factory, it can be re-written as:
```yaml
{placeholder}_data:
type: pandas.CSVDataset
filepath: data/01_raw/{placeholder}.csv
```
In runtime, the pattern will be matched against the nodes.
```
...
node(
func=process_factory,
inputs="factory_data",
outputs=None,
),
...
```
It is similar to **regular expression** and reverse `f-string`. In this case, the name of dataset `factory_data` matches the pattern `{placeholder}_data` with the `_data` suffix, so it resolves `placeholder` to `factory`.

Check warning on line 30 in docs/source/data/kedro_dataset_factories.md

View workflow job for this annotation

GitHub Actions / vale

[vale] docs/source/data/kedro_dataset_factories.md#L30

[Kedro.toowordy] 'similar to' is too wordy
Raw output
{"message": "[Kedro.toowordy] 'similar to' is too wordy", "location": {"path": "docs/source/data/kedro_dataset_factories.md", "range": {"start": {"line": 30, "column": 7}}}, "severity": "WARNING"}

Similarly, if you update the name of the inputs:
```diff
- inputs="factory_data",
+ inputs="transaction_data",
```

It will be resolved as:
```yaml
transaction_data:
type: pandas.CSVDataset
filepath: data/01_raw/transaction_data.csv
```
```{warning}
Datasets are not included in the core Kedro package from Kedro version **`0.19.0`**. Import them from the [`kedro-datasets`](https://github.com/kedro-org/kedro-plugins/tree/main/kedro-datasets) package instead.
Expand Down

0 comments on commit 52840e2

Please sign in to comment.