Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update partitioned dataset lazy saving docs #4402

Open
wants to merge 6 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from 5 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions RELEASE.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,6 +14,7 @@
* Safeguard hooks when user incorrectly registers a hook class in settings.py.
* Fixed parsing paths with query and fragment.
* Remove lowercase transformation in regex validation.
* Updated `Partitioned dataset lazy saving` docs page.

## Breaking changes to the API
## Documentation changes
Expand Down
19 changes: 19 additions & 0 deletions docs/source/data/partitioned_and_incremental_datasets.md
Original file line number Diff line number Diff line change
Expand Up @@ -175,6 +175,7 @@
path: s3://my-bucket-name
dataset: pandas.CSVDataset
filename_suffix: ".csv"
save_lazily: True
```

Here is the node definition:
Expand Down Expand Up @@ -238,6 +239,24 @@
When using lazy saving, the dataset will be written _after_ the `after_node_run` [hook](../hooks/introduction).
```

```{note}
Lazy saving is a default behaviour, meaning that if `Callable` type provided the dataset will be written _after_ the `after_node_run` hook.
ElenaKhaustova marked this conversation as resolved.
Show resolved Hide resolved
```

In some cases, it might be useful to disable such a behaviour, for example, when your object is already `Callable`, like a Tensorflow model, and you do not mean to save it lazily.

Check notice on line 246 in docs/source/data/partitioned_and_incremental_datasets.md

View workflow job for this annotation

GitHub Actions / vale

[vale] docs/source/data/partitioned_and_incremental_datasets.md#L246

[Kedro.sentencelength] Try to keep your sentence length to 30 words or fewer.
Raw output
{"message": "[Kedro.sentencelength] Try to keep your sentence length to 30 words or fewer.", "location": {"path": "docs/source/data/partitioned_and_incremental_datasets.md", "range": {"start": {"line": 246, "column": 1}}}, "severity": "INFO"}

Check warning on line 246 in docs/source/data/partitioned_and_incremental_datasets.md

View workflow job for this annotation

GitHub Actions / vale

[vale] docs/source/data/partitioned_and_incremental_datasets.md#L246

[Kedro.Spellings] Did you really mean 'Tensorflow'?
Raw output
{"message": "[Kedro.Spellings] Did you really mean 'Tensorflow'?", "location": {"path": "docs/source/data/partitioned_and_incremental_datasets.md", "range": {"start": {"line": 246, "column": 124}}}, "severity": "WARNING"}

Check warning on line 246 in docs/source/data/partitioned_and_incremental_datasets.md

View workflow job for this annotation

GitHub Actions / vale

[vale] docs/source/data/partitioned_and_incremental_datasets.md#L246

[Kedro.weaselwords] 'lazily' is a weasel word!
Raw output
{"message": "[Kedro.weaselwords] 'lazily' is a weasel word!", "location": {"path": "docs/source/data/partitioned_and_incremental_datasets.md", "range": {"start": {"line": 246, "column": 173}}}, "severity": "WARNING"}
ElenaKhaustova marked this conversation as resolved.
Show resolved Hide resolved
To disable lazy saving set `save_lazily` parameter to `False`:
ElenaKhaustova marked this conversation as resolved.
Show resolved Hide resolved

```yaml
# conf/base/catalog.yml

new_partitioned_dataset:
type: partitions.PartitionedDataset
path: s3://my-bucket-name
dataset: pandas.CSVDataset
filename_suffix: ".csv"
save_lazily: False
```

## Incremental datasets

{class}`IncrementalDataset<kedro-datasets:kedro_datasets.partitions.IncrementalDataset>` is a subclass of `PartitionedDataset`, which stores the information about the last processed partition in the so-called `checkpoint`. `IncrementalDataset` addresses the use case when partitions have to be processed incrementally, that is, each subsequent pipeline run should process just the partitions which were not processed by the previous runs.
Expand Down
Loading