Merge pull request karmada-io#767 from zhzhuang-zju/fix

update application-failover.md
Arhell · Dec 23, 2024 · 1867953 · 1867953
2 parents 4f98655 + 0e44119
commit 1867953
Show file tree

Hide file tree

Showing 4 changed files with 100 additions and 28 deletions.
diff --git a/docs/userguide/failover/application-failover.md b/docs/userguide/failover/application-failover.md
@@ -190,7 +190,7 @@ spec:
 
 You can edit `suppressDeletion` to false in `gracefulEvictionTasks` to evict the application in the failed cluster after you confirm the failure.
 
-## Stateful Application Failover Support
+## Application State Preservation
 
 Starting from v1.12, the application-level failover feature adds support for stateful application failover, it provides a generalized way for users to define application state preservation in the context of cluster-to-cluster failovers.
 
@@ -200,27 +200,45 @@ In releases prior to v1.12, Karmada’s scheduling logic runs on the assumption
 
 `StatePreservation` is a field under `.spec.failover.application`, it defines the policy for preserving and restoring state data during failover events for stateful applications. When an application fails over from one cluster to another, this policy enables the extraction of critical data from the original resource configuration.
 
-It contains a list of `StatePreservationRule` configurations. Each rule specifies a JSONPath expression targeting specific pieces of state data to be preserved during failover events. An `AliasLabelName` is associated with each rule, serving as a label key when the preserved data is passed to the new cluster. You can define the state preservation policy:
+It contains a list of `StatePreservationRule` configurations. Each rule specifies a JSONPath expression targeting specific pieces of state data to be preserved during failover events. An `AliasLabelName` is associated with each rule, serving as a label key when the preserved data is passed to the new cluster. 
+
+As an example, in a Flink application, `jobID` is a unique identifier used to distinguish and manage different Flink jobs. Each Flink job is assigned a `jobID` when it is submitted to the Flink cluster. When a job fails, the Flink application can use the `jobID` to recover the state of the pre-failure job and continue execution from the point of failure. The configuration and steps are as follows:
 
 ```yaml
 apiVersion: policy.karmada.io/v1alpha1
 kind: PropagationPolicy
 metadata:
-  name: example-propagation
+  name: foo
 spec:
-  #...
+  resourceSelectors:
+    - apiVersion: flink.apache.org/v1beta1
+      kind: FlinkDeployment
+      name: foo
   failover:
     application:
       decisionConditions:
         tolerationSeconds: 60
       purgeMode: Immediately
       statePreservation:
         rules:
-          - aliasLabelName: pre-updated-replicas
-            jsonPath: "{ .updatedReplicas }"
+          - aliasLabelName: application.karmada.io/failover-jobid
+            jsonPath: "{ .jobStatus.jobID }"
+  placement:
+    clusterAffinity:
+      clusterNames:
+        - member1
+        - member2
+        - member3
+    spreadConstraints:
+      - maxGroups: 1
+        minGroups: 1
+        spreadByField: cluster
 ```
 
-The above configuration will parse the `updatedReplicas` field from the application `.status` before migration. Upon successful migration, the extracted data is then re-injected into the new resource, ensuring that the application can resume operation with its previous state intact.
+1. Before migration, the Karmada controller will grab the job IDs from the status of FlinkDeployment according to the `jsonPath` configured by the user.  
+2. During migration, the Karmada controller injects the extracted job ID into the Flink application configuration as a label, e.g. `application.karmada.io/failover-jobid : <jobID>`. 
+3. Kyverno running on a member cluster intercepts the FlinkDeployment creation request and gets the checkpoint data storage path for the job based on the `jobID`, e.g. `/<shared-path>/<job-namespace>/<jobId>/checkpoints/xxx`, then configures the `initialSavepointPath` to indicate that it will start from the save point.  
+4. The FlinkDeployment starts based on the checkpoint data under `initialSavepointPath`, thus inheriting the final state saved before the migration.
 
 This capability requires enabling the `StatefulFailoverInjection` feature gate. `StatefulFailoverInjection` is currently in `Alpha` and is turned off by default.
 

diff --git a/...cusaurus-plugin-content-docs/current/userguide/failover/application-failover.md b/...cusaurus-plugin-content-docs/current/userguide/failover/application-failover.md
@@ -192,7 +192,7 @@ spec:
 
 您可以在 gracefulEvictionTasks 中将 suppressDeletion 修改为 false，确认故障后驱逐故障集群中的应用。
 
-## 无状态应用 Failover 支持
+## 应用状态中继
 
 从 v1.12 开始，应用故障转移特性增加了对有状态应用故障转移的支持，它为用户提供了一种通用的方式来定义在集群间故障转移情境下的应用状态保留。
 
@@ -202,27 +202,45 @@ spec:
 
 `StatePreservation` 是 `.spec.failover.application` 下的一个字段, 它定义了在有状态应用的故障转移事件期间保留和恢复状态数据的策略。当应用从一个集群故障转移到另一个集群时，此策略使得能够从原始资源配置中提取关键数据。
 
-它包含了一系列 `StatePreservationRule` 配置。每条规则指定了一个 JSONPath 表达式，针对特定的状态数据片段，在故障转移事件中需要保留。每条规则都关联了一个 `AliasLabelName`，当保留的数据传递到新集群时，它作为标签键使用。你可以定义状态保留策略：
+它包含了一系列 `StatePreservationRule` 配置。每条规则指定了一个 JSONPath 表达式，针对特定的状态数据片段，在故障转移事件中需要保留。每条规则都关联了一个 `AliasLabelName`，当保留的数据传递到新集群时，它作为标签键使用。
+
+以 Flink 应用为例，在 Flink 应用中，`jobID` 是一个唯一的标识符，用于区分和管理不同的 Flink 作业（jobs）。每个 Flink 作业在提交到 Flink 集群时都会被分配一个 `jobID`。当作业发生故障时，Flink 应用可以利用 `jobID` 来恢复故障前作业的状态，从故障点处继续执行。具体的配置和步骤如下：
 
 ```yaml
 apiVersion: policy.karmada.io/v1alpha1
 kind: PropagationPolicy
 metadata:
-  name: example-propagation
+  name: foo
 spec:
-  #...
+  resourceSelectors:
+    - apiVersion: flink.apache.org/v1beta1
+      kind: FlinkDeployment
+      name: foo
   failover:
     application:
       decisionConditions:
         tolerationSeconds: 60
       purgeMode: Immediately
       statePreservation:
         rules:
-          - aliasLabelName: pre-updated-replicas
-            jsonPath: "{ .updatedReplicas }"
+          - aliasLabelName: application.karmada.io/failover-jobid
+            jsonPath: "{ .jobStatus.jobID }"
+  placement:
+    clusterAffinity:
+      clusterNames:
+        - member1
+        - member2
+        - member3
+    spreadConstraints:
+      - maxGroups: 1
+        minGroups: 1
+        spreadByField: cluster
 ```
 
-上述配置将在迁移前解析应用 `.status` 中的 `updatedReplicas` 字段。成功迁移后，提取的数据将重新注入到新资源中，确保应用程序能够以其之前的状态恢复操作。
+1. 迁移前，Karmada 控制器将按照用户配置的路径提取 job ID。  
+2. 迁移时，Karmada 控制器将提取的 job ID 以 label 的形式注入到 Flink 应用配置中，比如 `application.karmada.io/failover-jobid : <jobID>`。  
+3. 运行在成员集群的 Kyverno 拦截 Flink 应用创建请求，并根据 `jobID`  获取该 job 的 checkpoint 数据存储路径，比如  `/<shared-path>/<job-namespace>/<jobId>/checkpoints/xxx`，然后配置 `initialSavepointPath` 指示从save point 启动。  
+4. Flink 应用根据 `initialSavepointPath` 下的 checkpoint 数据启动，从而继承迁移前保存的最终状态。
 
 此功能需要启用 `StatefulFailoverInjection` 特性门控。`StatefulFailoverInjection` 目前处于 `Alpha` 阶段，默认情况下是关闭的。
 

diff --git a/...us-plugin-content-docs/version-v1.12/userguide/failover/application-failover.md b/...us-plugin-content-docs/version-v1.12/userguide/failover/application-failover.md
@@ -192,7 +192,7 @@ spec:
 
 您可以在 gracefulEvictionTasks 中将 suppressDeletion 修改为 false，确认故障后驱逐故障集群中的应用。
 
-## 无状态应用 Failover 支持
+## 应用状态中继
 
 从 v1.12 开始，应用故障转移特性增加了对有状态应用故障转移的支持，它为用户提供了一种通用的方式来定义在集群间故障转移情境下的应用状态保留。
 
@@ -202,27 +202,45 @@ spec:
 
 `StatePreservation` 是 `.spec.failover.application` 下的一个字段, 它定义了在有状态应用的故障转移事件期间保留和恢复状态数据的策略。当应用从一个集群故障转移到另一个集群时，此策略使得能够从原始资源配置中提取关键数据。
 
-它包含了一系列 `StatePreservationRule` 配置。每条规则指定了一个 JSONPath 表达式，针对特定的状态数据片段，在故障转移事件中需要保留。每条规则都关联了一个 `AliasLabelName`，当保留的数据传递到新集群时，它作为标签键使用。你可以定义状态保留策略：
+它包含了一系列 `StatePreservationRule` 配置。每条规则指定了一个 JSONPath 表达式，针对特定的状态数据片段，在故障转移事件中需要保留。每条规则都关联了一个 `AliasLabelName`，当保留的数据传递到新集群时，它作为标签键使用。
+
+以 Flink 应用为例，在 Flink 应用中，`jobID` 是一个唯一的标识符，用于区分和管理不同的 Flink 作业（jobs）。每个 Flink 作业在提交到 Flink 集群时都会被分配一个 `jobID`。当作业发生故障时，Flink 应用可以利用 `jobID` 来恢复故障前作业的状态，从故障点处继续执行。具体的配置和步骤如下：
 
 ```yaml
 apiVersion: policy.karmada.io/v1alpha1
 kind: PropagationPolicy
 metadata:
-  name: example-propagation
+  name: foo
 spec:
-  #...
+  resourceSelectors:
+    - apiVersion: flink.apache.org/v1beta1
+      kind: FlinkDeployment
+      name: foo
   failover:
     application:
       decisionConditions:
         tolerationSeconds: 60
       purgeMode: Immediately
       statePreservation:
         rules:
-          - aliasLabelName: pre-updated-replicas
-            jsonPath: "{ .updatedReplicas }"
+          - aliasLabelName: application.karmada.io/failover-jobid
+            jsonPath: "{ .jobStatus.jobID }"
+  placement:
+    clusterAffinity:
+      clusterNames:
+        - member1
+        - member2
+        - member3
+    spreadConstraints:
+      - maxGroups: 1
+        minGroups: 1
+        spreadByField: cluster
 ```
 
-上述配置将在迁移前解析应用 `.status` 中的 `updatedReplicas` 字段。成功迁移后，提取的数据将重新注入到新资源中，确保应用程序能够以其之前的状态恢复操作。
+1. 迁移前，Karmada 控制器将按照用户配置的路径提取 job ID。
+2. 迁移时，Karmada 控制器将提取的 job ID 以 label 的形式注入到 Flink 应用配置中，比如 `application.karmada.io/failover-jobid : <jobID>`。
+3. 运行在成员集群的 Kyverno 拦截 Flink 应用创建请求，并根据 `jobID`  获取该 job 的 checkpoint 数据存储路径，比如  `/<shared-path>/<job-namespace>/<jobId>/checkpoints/xxx`，然后配置 `initialSavepointPath` 指示从save point 启动。
+4. Flink 应用根据 `initialSavepointPath` 下的 checkpoint 数据启动，从而继承迁移前保存的最终状态。
 
 此功能需要启用 `StatefulFailoverInjection` 特性门控。`StatefulFailoverInjection` 目前处于 `Alpha` 阶段，默认情况下是关闭的。
 

diff --git a/versioned_docs/version-v1.12/userguide/failover/application-failover.md b/versioned_docs/version-v1.12/userguide/failover/application-failover.md
@@ -190,7 +190,7 @@ spec:
 
 You can edit `suppressDeletion` to false in `gracefulEvictionTasks` to evict the application in the failed cluster after you confirm the failure.
 
-## Stateful Application Failover Support
+## Application State Preservation
 
 Starting from v1.12, the application-level failover feature adds support for stateful application failover, it provides a generalized way for users to define application state preservation in the context of cluster-to-cluster failovers.
 
@@ -200,27 +200,45 @@ In releases prior to v1.12, Karmada’s scheduling logic runs on the assumption
 
 `StatePreservation` is a field under `.spec.failover.application`, it defines the policy for preserving and restoring state data during failover events for stateful applications. When an application fails over from one cluster to another, this policy enables the extraction of critical data from the original resource configuration.
 
-It contains a list of `StatePreservationRule` configurations. Each rule specifies a JSONPath expression targeting specific pieces of state data to be preserved during failover events. An `AliasLabelName` is associated with each rule, serving as a label key when the preserved data is passed to the new cluster. You can define the state preservation policy:
+It contains a list of `StatePreservationRule` configurations. Each rule specifies a JSONPath expression targeting specific pieces of state data to be preserved during failover events. An `AliasLabelName` is associated with each rule, serving as a label key when the preserved data is passed to the new cluster.
+
+As an example, in a Flink application, `jobID` is a unique identifier used to distinguish and manage different Flink jobs. Each Flink job is assigned a `jobID` when it is submitted to the Flink cluster. When a job fails, the Flink application can use the `jobID` to recover the state of the pre-failure job and continue execution from the point of failure. The configuration and steps are as follows:
 
 ```yaml
 apiVersion: policy.karmada.io/v1alpha1
 kind: PropagationPolicy
 metadata:
-  name: example-propagation
+  name: foo
 spec:
-  #...
+  resourceSelectors:
+    - apiVersion: flink.apache.org/v1beta1
+      kind: FlinkDeployment
+      name: foo
   failover:
     application:
       decisionConditions:
         tolerationSeconds: 60
       purgeMode: Immediately
       statePreservation:
         rules:
-          - aliasLabelName: pre-updated-replicas
-            jsonPath: "{ .updatedReplicas }"
+          - aliasLabelName: application.karmada.io/failover-jobid
+            jsonPath: "{ .jobStatus.jobID }"
+  placement:
+    clusterAffinity:
+      clusterNames:
+        - member1
+        - member2
+        - member3
+    spreadConstraints:
+      - maxGroups: 1
+        minGroups: 1
+        spreadByField: cluster
 ```
 
-The above configuration will parse the `updatedReplicas` field from the application `.status` before migration. Upon successful migration, the extracted data is then re-injected into the new resource, ensuring that the application can resume operation with its previous state intact.
+1. Before migration, the Karmada controller will grab the job IDs from the status of FlinkDeployment according to the `jsonPath` configured by the user.
+2. During migration, the Karmada controller injects the extracted job ID into the Flink application configuration as a label, e.g. `application.karmada.io/failover-jobid : <jobID>`.
+3. Kyverno running on a member cluster intercepts the FlinkDeployment creation request and gets the checkpoint data storage path for the job based on the `jobID`, e.g. `/<shared-path>/<job-namespace>/<jobId>/checkpoints/xxx`, then configures the `initialSavepointPath` to indicate that it will start from the save point.
+4. The FlinkDeployment starts based on the checkpoint data under `initialSavepointPath`, thus inheriting the final state saved before the migration.
 
 This capability requires enabling the `StatefulFailoverInjection` feature gate. `StatefulFailoverInjection` is currently in `Alpha` and is turned off by default.