Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[feature] allow setting a default of execution caching disabled via a compiler CLI flag and env var #11092

Open
DharmitD opened this issue Aug 12, 2024 · 4 comments · May be fixed by #11222
Open
Assignees

Comments

@DharmitD
Copy link
Contributor

DharmitD commented Aug 12, 2024

Feature Area

/area backend
/area sdk

What feature would you like to see?

Kubeflow Pipelines has a caching feature that allows users to avoid re-running pipeline components (steps in the pipeline) if the system detects that such a component has previously run and its outputs (artifacts) could be reused. The goal is to save time and computation.

By default, the KFP compiler defaults to setting caching enabled on every Component/Task unless the pipeline author calls

task.set_caching_options(False)

In other words:

@dsl.pipeline(name='iris-training-pipeline')
def my_pipeline():
   task_1 = create_dataset()
   task_2 = create_dataset()
   task_1.set_caching_options(False)   <-- task 1 won’t enable caching, but task 2 will ...
                                           even though the author didn’t specify anything about task 2!

Caching disabled is a much more reasonable default.

DSL Example

Caching is controlled on each individual pipeline Component / Task.
Here is example KFP DSL code that disables caching for a single task:

@dsl.pipeline(name='iris-training-pipeline')
def my_pipeline():
   create_dataset_task = create_dataset()
   create_dataset_task.set_caching_options(False)      <-- this task won’t enable caching

Today, the KFP compiler defaults to setting caching enabled on every Component/Task unless the pipeline author calls task.set_caching_options(False)

In other words:

@dsl.pipeline(name='iris-training-pipeline')
def my_pipeline():
   task_1 = create_dataset()
   task_2 = create_dataset()
   task_1.set_caching_options(False)
   # task 1 won’t try to use the cache, but task 2 will ...
   # even though the author didn’t specify anything about task 2!

When we are done with this feature, this will be true:

@dsl.pipeline(name='iris-training-pipeline')
def my_pipeline():
   task_1 = create_dataset()
   task_2 = create_dataset()
   task_3 = create_dataset()
   task_3.set_caching_options(True)
   # tasks 1 and 2 don’t try to use the cache. Task 3 does try to use the cache.

What is the use case or pain point?

We need to fix the KFP compiler to stop enabling caching by default (by setting task.set_caching_options(True)) if the user didn’t ask for that. As described above, the effect of this behavior is that everything tries to use the cache by default, even though caching is disabled by default in the backend.

This might be a significant change, we wish to have a discussion with the KFP community, get consensus on this update and then proceed with making changes. Find a related issue here: #10839


Love this idea? Give it a 👍.

@DharmitD
Copy link
Contributor Author

/assign @DharmitD

@boarder7395
Copy link
Contributor

boarder7395 commented Aug 14, 2024

I see the pain here, but my org expects caching to be the default and requiring every component in a pipeline to enable it would be just as much of a pain as disabling it for each component. Alternative suggestion allow the default to be set at the pipeline level?

@dsl.pipeline(name='iris-training-pipeline', caching=False)
def my_pipeline():
   task_1 = create_dataset()
   task_2 = create_dataset()
   task_3 = create_dataset()
   task_3.set_caching_options(True)

@gregsheremeta
Copy link
Contributor

Alternative suggestion allow the default to be set at the pipeline level?

That's a good suggestion, and I think some day we'll get to implementing that. Ref: #10839

my org expects caching to be the default and requiring every component in a pipeline to enable it would be just as much of a pain as disabling it for each component

Yep, we brought this issue up at the August 14, 2024 KFP Community Meeting (agenda, recording), and that was the consensus feeling there too. I suggested an additive change whereby we could set a CLI flag or env var to set the default to disabled, and the meeting attendees were in favor of that. Hence #11142 .

@gregsheremeta
Copy link
Contributor

@DharmitD , per the last couple comments, can you edit the title of this issue?

[feature] Update DSL to have default set to caching disabled -> [feature] allow setting a default of execution caching disabled via a compiler CLI flag and env var

@DharmitD DharmitD changed the title [feature] Update DSL to have default set to caching disabled [feature] allow setting a default of execution caching disabled via a compiler CLI flag and env var Sep 17, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment