chore: add json path doc (#1279)

Co-authored-by: Yiran <[email protected]> Co-authored-by: jeremyhi <[email protected]>
GreptimeTeam · Nov 13, 2024 · e23e36f · e23e36f
1 parent cb6a91a
commit e23e36f
Show file tree

Hide file tree

Showing 2 changed files with 275 additions and 4 deletions.
diff --git a/docs/user-guide/logs/pipeline-config.md b/docs/user-guide/logs/pipeline-config.md
@@ -50,6 +50,23 @@ We currently provide the following built-in Processors:
 - `urlencoding`: performs URL encoding/decoding on log data fields.
 - `csv`: parses CSV data fields in logs.
 
+Most processors have `field` or `fields` fields to specify the fields that need to be processed. Most processors will overwrite the original field after processing. If you do not want to affect the corresponding field in the original data, we can output the result to another field to avoid overwriting.
+
+When a field name contains `,`, the target field will be renamed. For example, `reqTimeSec, req_time_sec` means renaming the `reqTimeSec` field to `req_time_sec`, and the processed data will be written to the `req_time_sec` key in the intermediate state. The original `reqTimeSec` field is not affected. If some processors do not support field renaming, the renamed field name will be ignored and noted in the documentation.
+
+for example
+
+```yaml
+processors:
+  - letter:
+      fields:
+        - message, message_upper
+      method: upper
+      ignore_missing: true
+```
+
+the `message` field will be converted to uppercase and stored in the `message_upper` field.
+
 ### `date`
 
 The `date` processor is used to parse time fields. Here's an example configuration:
@@ -108,7 +125,7 @@ processors:
 
 In the above example, the configuration of the `dissect` processor includes the following fields:
 
-- `fields`: A list of field names to be split.
+- `fields`: A list of field names to be split, does not support field renaming.
 - `patterns`: The dissect pattern for splitting.
 - `ignore_missing`: Ignores the case when the field is missing. Defaults to `false`. If the field is missing and this configuration is set to `false`, an exception will be thrown.
 - `append_separator`: Specifies the separator for concatenating multiple fields with same field name together. Defaults to an empty string. See `+` modifier below.
@@ -260,7 +277,7 @@ processors:
 
 In the above example, the configuration of the `regex` processor includes the following fields:
 
-- `fields`: A list of field names to be matched.
+- `fields`: A list of field names to be matched. If you rename the field, the renamed fields will be combined with the capture groups in `patterns` to generate the result name.
 - `pattern`: The regular expression pattern to match. Named capture groups are required to extract corresponding data from the respective field.
 - `ignore_missing`: Ignores the case when the field is missing. Defaults to `false`. If the field is missing and this configuration is set to `false`, an exception will be thrown.
 
@@ -326,6 +343,125 @@ In the above example, the configuration of the `csv` processor includes the foll
 - `trim`: Whether to trim whitespace. Defaults to `false`.
 - `ignore_missing`: Ignores the case when the field is missing. Defaults to `false`. If the field is missing and this configuration is set to `false`, an exception will be thrown.
 
+### `json_path` (experimental)
+
+Note: The `json_path` processor is currently in the experimental stage and may be subject to change.
+
+The `json_path` processor is used to extract fields from JSON data. Here's an example configuration:
+
+```yaml
+processors:
+  - json_path:
+      fields:
+        - complex_object
+      json_path: "$.shop.orders[?(@.active)].id"
+      ignore_missing: true
+      result_index: 1
+```
+
+In the above example, the configuration of the `json_path` processor includes the following fields:
+
+- `fields`: A list of field names to be extracted.
+- `json_path`: The JSON path to extract.
+- `ignore_missing`: Ignores the case when the field is missing. Defaults to `false`. If the field is missing and this configuration is set to `false`, an exception will be thrown.
+- `result_index`: Specifies the index of the value in the extracted array to be used as the result value. By default, all values are included. The extracted value of the processor is an array containing all the values of the path. If an index is specified, the corresponding value in the extracted array will be used as the final result.
+
+#### JSON path syntax
+
+The JSON path syntax is based on the [jsonpath-rust](https://github.com/besok/jsonpath-rust) library.
+
+At this stage we only recommend using some simple field extraction operations to facilitate the extraction of nested fields to the top level.
+
+#### `json_path` example
+
+For example, given the following log data:
+
+```json
+{
+  "product_object": {
+    "hello": "world"
+  },
+  "product_array": [
+    "hello",
+    "world"
+  ],
+  "complex_object": {
+    "shop": {
+      "orders": [
+        {
+          "id": 1,
+          "active": true
+        },
+        {
+          "id": 2
+        },
+        {
+          "id": 3
+        },
+        {
+          "id": 4,
+          "active": true
+        }
+      ]
+    }
+  }
+}
+```
+
+Using the following configuration:
+
+```yaml
+processors:
+  - json_path:
+      fields:
+        - product_object, object_target
+      json_path: "$.hello"
+      result_index: 0
+  - json_path:
+      fields:
+        - product_array, array_target
+      json_path: "$.[1]"
+      result_index: 0
+  - json_path:
+      fields:
+        - complex_object, complex_target_1
+      json_path: "$.shop.orders[?(@.active)].id"
+  - json_path:
+      fields:
+        - complex_target_1, complex_target_2
+      json_path: "$.[1]"
+      result_index: 0
+  - json_path:
+      fields:
+        - complex_object, complex_target_3
+      json_path: "$.shop.orders[?(@.active)].id"
+      result_index: 1
+transform:
+  - fields:
+      - object_target
+      - array_target
+    type: string
+  - fields:
+      - complex_target_3
+      - complex_target_2
+    type: uint32
+  - fields:
+      - complex_target_1
+    type: json
+```
+
+The result will be:
+
+```json
+{
+  "object_target": "world",
+  "array_target": "world",
+  "complex_target_3": 4,
+  "complex_target_2": 4,
+  "complex_target_1": [1, 4]
+}
+```
+
 
 ## Transform
 

diff --git a/i18n/zh/docusaurus-plugin-content-docs/current/user-guide/logs/pipeline-config.md b/i18n/zh/docusaurus-plugin-content-docs/current/user-guide/logs/pipeline-config.md
@@ -52,6 +52,23 @@ Processor 由一个 name 和多个配置组成，不同类型的 Processor 配
 - `urlencoding`: 对 log 数据字段进行 URL 编解码。
 - `csv`: 对 log 数据字段进行 CSV 解析。
 
+大多数 Processor 都有 `field` 或 `fields` 字段，用于指定需要被处理的字段。大部分 Processor 处理完成后会覆盖掉原先的 field。如果你不想影响到原数据中的对应字段，我们可以把结果输出到其他字段来避免覆盖。
+
+当字段名称包含 `,` 时，该字段将被重命名。例如，`reqTimeSec, req_time_sec` 表示将 `reqTimeSec` 字段重命名为 `req_time_sec`，处理完成后的数据将写入中间状态的 `req_time_sec` 字段中。原始的 `reqTimeSec` 字段不受影响。如果某些 Processor 不支持字段重命名，则重命名字段名称将被忽略，并将在文档中注明。
+
+例如：
+
+```yaml
+processors:
+  - letter:
+      fields:
+        - message, message_upper
+      method: upper
+      ignore_missing: true
+```
+
+`message` 字段将被转换为大写并存储在 `message_upper` 字段中。
+
 ### `date`
 
 `date` Processor 用于解析时间字段。示例配置如下：
@@ -110,7 +127,7 @@ processors:
 
 如上所示，`dissect` Processor 的配置包含以下字段：
 
-- `fields`: 需要拆分的字段名列表。
+- `fields`: 需要拆分的字段名列表。不支持字段重命名。
 - `patterns`: 拆分的 dissect 模式。
 - `ignore_missing`: 忽略字段不存在的情况。默认为 `false`。如果字段不存在，并且此配置为 false，则会抛出异常。
 - `append_separator`: 对于多个追加到一起的字段，指定连接符。默认是一个空字符串。
@@ -262,7 +279,7 @@ processors:
 
 如上所示，`regex` Processor 的配置包含以下字段：
 
-- `fields`: 需要匹配的字段名列表。
+- `fields`: 需要匹配的字段名列表。如果重命名了字段，重命名后的字段名将与 `pattern` 中的命名捕获组名进行拼接。
 - `pattern`: 要进行匹配的正则表达式，需要使用命名捕获组才可以从对应字段中取出对应数据。
 - `ignore_missing`: 忽略字段不存在的情况。默认为 `false`。如果字段不存在，并且此配置为 false，则会抛出异常。
 
@@ -329,6 +346,124 @@ processors:
 - `trim`: 是否去除空格。默认为 `false`。
 - `ignore_missing`: 忽略字段不存在的情况。默认为 `false`。如果字段不存在，并且此配置为 false，则会抛出异常。
 
+### `json_path`（实验性）
+
+注意：`json_path` 处理器目前处于实验阶段，可能会有所变动。
+
+`json_path` 处理器用于从 JSON 数据中提取字段。以下是一个配置示例：
+
+```yaml
+processors:
+  - json_path:
+      fields:
+        - complex_object
+      json_path: "$.shop.orders[?(@.active)].id"
+      ignore_missing: true
+      result_index: 1
+```
+
+在上述示例中，`json_path` processor 的配置包括以下字段：
+
+- `fields`：要提取的字段名称列表。
+- `json_path`：要提取的 JSON 路径。
+- `ignore_missing`：忽略字段缺失的情况。默认为 `false`。如果字段缺失且此配置设置为 `false`，将抛出异常。
+- `result_index`：指定提取数组中要用作结果值的下标。默认情况下，包含所有值。Processor 提取的结果值是包含 path 中所有值的数组。如果指定了索引，将使用提取数组中对应的下标的值作为最终结果。
+
+#### JSON 路径语法
+
+JSON 路径语法基于 [jsonpath-rust](https://github.com/besok/jsonpath-rust) 库。
+
+在此阶段，我们仅推荐使用一些简单的字段提取操作，以便将嵌套字段提取到顶层。
+
+#### `json_path` 示例
+
+例如，给定以下日志数据：
+
+```json
+{
+  "product_object": {
+    "hello": "world"
+  },
+  "product_array": [
+    "hello",
+    "world"
+  ],
+  "complex_object": {
+    "shop": {
+      "orders": [
+        {
+          "id": 1,
+          "active": true
+        },
+        {
+          "id": 2
+        },
+        {
+          "id": 3
+        },
+        {
+          "id": 4,
+          "active": true
+        }
+      ]
+    }
+  }
+}
+```
+
+使用以下配置：
+
+```yaml
+processors:
+  - json_path:
+      fields:
+        - product_object, object_target
+      json_path: "$.hello"
+      result_index: 0
+  - json_path:
+      fields:
+        - product_array, array_target
+      json_path: "$.[1]"
+      result_index: 0
+  - json_path:
+      fields:
+        - complex_object, complex_target_1
+      json_path: "$.shop.orders[?(@.active)].id"
+  - json_path:
+      fields:
+        - complex_target_1, complex_target_2
+      json_path: "$.[1]"
+      result_index: 0
+  - json_path:
+      fields:
+        - complex_object, complex_target_3
+      json_path: "$.shop.orders[?(@.active)].id"
+      result_index: 1
+transform:
+  - fields:
+      - object_target
+      - array_target
+    type: string
+  - fields:
+      - complex_target_3
+      - complex_target_2
+    type: uint32
+  - fields:
+      - complex_target_1
+    type: json
+```
+
+结果将是：
+
+```json
+{
+  "object_target": "world",
+  "array_target": "world",
+  "complex_target_3": 4,
+  "complex_target_2": 4,
+  "complex_target_1": [1, 4]
+}
+```
 
 ## Transform