Skip to content

Latest commit

 

History

History
 
 

xpath

XPath Parser Plugin

The XPath data format parser parses different formats into metric fields using XPath expressions.

For supported XPath functions check the underlying XPath library.

NOTE: The type of fields are specified using XPath functions. The only exception are integer fields that need to be specified in a fields_int section.

Supported data formats

name data_format setting comment
Extensible Markup Language (XML) "xml"
Concise Binary Object Representation "xpath_cbor" see additional notes
JSON "xpath_json"
MessagePack "xpath_msgpack"
Protocol-buffers "xpath_protobuf" see additional parameters

Protocol-buffers additional settings

For using the protocol-buffer format you need to specify additional (mandatory) properties for the parser. Those options are described here.

xpath_protobuf_file (mandatory)

Use this option to specify the name of the protocol-buffer definition file (.proto).

xpath_protobuf_type (mandatory)

This option contains the top-level message file to use for deserializing the data to be parsed. Usually, this is constructed from the package name in the protocol-buffer definition file and the message name as <package name>.<message name>.

xpath_protobuf_import_paths (optional)

In case you import other protocol-buffer definitions within your .proto file (i.e. you use the import statement) you can use this option to specify paths to search for the imported definition file(s). By default the imports are only searched in . which is the current-working-directory, i.e. usually the directory you are in when starting telegraf.

Imagine you do have multiple protocol-buffer definitions (e.g. A.proto, B.proto and C.proto) in a directory (e.g. /data/my_proto_files) where your top-level file (e.g. A.proto) imports at least one other definition

syntax = "proto3";

package foo;

import "B.proto";

message Measurement {
    ...
}

You should use the following setting

[[inputs.file]]
  files = ["example.dat"]

  data_format = "xpath_protobuf"
  xpath_protobuf_file = "A.proto"
  xpath_protobuf_type = "foo.Measurement"
  xpath_protobuf_import_paths = [".", "/data/my_proto_files"]

  ...

xpath_protobuf_skip_bytes (optional)

This option allows to skip a number of bytes before trying to parse the protocol-buffer message. This is useful in cases where the raw data has a header e.g. for the message length or in case of GRPC messages.

This is a list of known headers and the corresponding values for xpath_protobuf_skip_bytes

name setting comment
GRPC protocol 5 GRPC adds a 5-byte header for Length-Prefixed-Messages
PowerDNS logging 2 Sent messages contain a 2-byte header containing the message length

Concise Binary Object Representation notes

Concise Binary Object Representation support numeric keys in the data. However, XML (and this parser) expects node names to be strings starting with a letter. To be compatible with these requirements, numeric nodes will be prefixed with a lower case n and converted to strings. This means that if you for example have a node with the key 123 in CBOR you will need to query n123 in your XPath expressions.

Configuration

[[inputs.file]]
  files = ["example.xml"]

  ## Data format to consume.
  ## Each data format has its own unique set of configuration options, read
  ## more about them here:
  ## https://github.com/influxdata/telegraf/blob/master/docs/DATA_FORMATS_INPUT.md
  data_format = "xml"

  ## PROTOCOL-BUFFER definitions
  ## Protocol-buffer definition file
  # xpath_protobuf_file = "sparkplug_b.proto"
  ## Name of the protocol-buffer message type to use in a fully qualified form.
  # xpath_protobuf_type = "org.eclipse.tahu.protobuf.Payload"
  ## List of paths to use when looking up imported protocol-buffer definition files.
  # xpath_protobuf_import_paths = ["."]
  ## Number of (header) bytes to ignore before parsing the message.
  # xpath_protobuf_skip_bytes = 0

  ## Print the internal XML document when in debug logging mode.
  ## This is especially useful when using the parser with non-XML formats like protocol-buffers
  ## to get an idea on the expression necessary to derive fields etc.
  # xpath_print_document = false

  ## Allow the results of one of the parsing sections to be empty.
  ## Useful when not all selected files have the exact same structure.
  # xpath_allow_empty_selection = false

  ## Get native data-types for all data-format that contain type information.
  ## Currently, CBOR, protobuf, msgpack and JSON support native data-types.
  # xpath_native_types = false

  ## Trace empty node selections for debugging
  ## This will only produce output in debugging mode.
  # xpath_trace = false

  ## Multiple parsing sections are allowed
  [[inputs.file.xpath]]
    ## Optional: XPath-query to select a subset of nodes from the XML document.
    # metric_selection = "/Bus/child::Sensor"

    ## Optional: XPath-query to set the metric (measurement) name.
    # metric_name = "string('example')"

    ## Optional: Query to extract metric timestamp.
    ## If not specified the time of execution is used.
    # timestamp = "/Gateway/Timestamp"
    ## Optional: Format of the timestamp determined by the query above.
    ## This can be any of "unix", "unix_ms", "unix_us", "unix_ns" or a valid Golang
    ## time format. If not specified, a "unix" timestamp (in seconds) is expected.
    # timestamp_format = "2006-01-02T15:04:05Z"
    ## Optional: Timezone of the parsed time
    ## This will locate the parsed time to the given timezone. Please note that
    ## for times with timezone-offsets (e.g. RFC3339) the timestamp is unchanged.
    ## This is ignored for all (unix) timestamp formats.
    # timezone = "UTC"

    ## Optional: List of fields to convert to hex-strings if they are
    ## containing byte-arrays. This might be the case for e.g. protocol-buffer
    ## messages encoding data as byte-arrays. Wildcard patterns are allowed.
    ## By default, all byte-array-fields are converted to string.
    # fields_bytes_as_hex = []

    ## Optional: List of fields to convert to base64-strings if they
    ## contain byte-arrays. Resulting string will generally be shorter
    ## than using hex encoding. Base64 encoding is RFC4648 compliant.
    # fields_bytes_as_base64 = []

    ## Tag definitions using the given XPath queries.
    [inputs.file.xpath.tags]
      name   = "substring-after(Sensor/@name, ' ')"
      device = "string('the ultimate sensor')"

    ## Integer field definitions using XPath queries.
    [inputs.file.xpath.fields_int]
      consumers = "Variable/@consumers"

    ## Non-integer field definitions using XPath queries.
    ## The field type is defined using XPath expressions such as number(), boolean() or string(). If no conversion is performed the field will be of type string.
    [inputs.file.xpath.fields]
      temperature = "number(Variable/@temperature)"
      power       = "number(Variable/@power)"
      frequency   = "number(Variable/@frequency)"
      ok          = "Mode != 'ok'"

In this configuration mode, you explicitly specify the field and tags you want to scrape out of your data.

A configuration can contain multiple xpath subsections for e.g. the file plugin to process the xml-string multiple times. Consult the XPath syntax and the underlying library's functions for details and help regarding XPath queries. Consider using an XPath tester such as xpather.com or Code Beautify's XPath Tester for help developing and debugging your query.

Configuration (batch)

Alternatively to the configuration above, fields can also be specified in a batch way. So contrary to specify the fields in a section, you can define a name and a value selector used to determine the name and value of the fields in the metric.

[[inputs.file]]
  files = ["example.xml"]

  ## Data format to consume.
  ## Each data format has its own unique set of configuration options, read
  ## more about them here:
  ## https://github.com/influxdata/telegraf/blob/master/docs/DATA_FORMATS_INPUT.md
  data_format = "xml"

  ## PROTOCOL-BUFFER definitions
  ## Protocol-buffer definition file
  # xpath_protobuf_file = "sparkplug_b.proto"
  ## Name of the protocol-buffer message type to use in a fully qualified form.
  # xpath_protobuf_type = "org.eclipse.tahu.protobuf.Payload"
  ## List of paths to use when looking up imported protocol-buffer definition files.
  # xpath_protobuf_import_paths = ["."]

  ## Print the internal XML document when in debug logging mode.
  ## This is especially useful when using the parser with non-XML formats like protocol-buffers
  ## to get an idea on the expression necessary to derive fields etc.
  # xpath_print_document = false

  ## Allow the results of one of the parsing sections to be empty.
  ## Useful when not all selected files have the exact same structure.
  # xpath_allow_empty_selection = false

  ## Get native data-types for all data-format that contain type information.
  ## Currently, protobuf, msgpack and JSON support native data-types
  # xpath_native_types = false

  ## Multiple parsing sections are allowed
  [[inputs.file.xpath]]
    ## Optional: XPath-query to select a subset of nodes from the XML document.
    metric_selection = "/Bus/child::Sensor"

    ## Optional: XPath-query to set the metric (measurement) name.
    # metric_name = "string('example')"

    ## Optional: Query to extract metric timestamp.
    ## If not specified the time of execution is used.
    # timestamp = "/Gateway/Timestamp"
    ## Optional: Format of the timestamp determined by the query above.
    ## This can be any of "unix", "unix_ms", "unix_us", "unix_ns" or a valid Golang
    ## time format. If not specified, a "unix" timestamp (in seconds) is expected.
    # timestamp_format = "2006-01-02T15:04:05Z"

    ## Field specifications using a selector.
    field_selection = "child::*"
    ## Optional: Queries to specify field name and value.
    ## These options are only to be used in combination with 'field_selection'!
    ## By default the node name and node content is used if a field-selection
    ## is specified.
    # field_name  = "name()"
    # field_value = "."

    ## Optional: Expand field names relative to the selected node
    ## This allows to flatten out nodes with non-unique names in the subtree
    # field_name_expansion = false

    ## Tag specifications using a selector.
    ## tag_selection = "child::*"
    ## Optional: Queries to specify tag name and value.
    ## These options are only to be used in combination with 'tag_selection'!
    ## By default the node name and node content is used if a tag-selection
    ## is specified.
    # tag_name  = "name()"
    # tag_value = "."

    ## Optional: Expand tag names relative to the selected node
    ## This allows to flatten out nodes with non-unique names in the subtree
    # tag_name_expansion = false

    ## Tag definitions using the given XPath queries.
    [inputs.file.xpath.tags]
      name   = "substring-after(Sensor/@name, ' ')"
      device = "string('the ultimate sensor')"

Please note: The resulting fields are always of type string!

It is also possible to specify a mixture of the two alternative ways of specifying fields. In this case explicitly defined tags and fields take precedence over the batch instances if both use the same tag/field name.

metric_selection (optional)

You can specify a XPath query to select a subset of nodes from the XML document, each used to generate a new metrics with the specified fields, tags etc.

For relative queries in subsequent queries they are relative to the metric_selection. To specify absolute paths, please start the query with a slash (/).

Specifying metric_selection is optional. If not specified all relative queries are relative to the root node of the XML document.

metric_name (optional)

By specifying metric_name you can override the metric/measurement name with the result of the given XPath query. If not specified, the default metric name is used.

timestamp, timestamp_format, timezone (optional)

By default the current time will be used for all created metrics. To set the time from values in the XML document you can specify a XPath query in timestamp and set the format in timestamp_format.

The timestamp_format can be set to unix, unix_ms, unix_us, unix_ns, or an accepted Go "reference time". Consult the Go time package for details and additional examples on how to set the time format. If timestamp_format is omitted unix format is assumed as result of the timestamp query.

The timezone setting will be used to locate the parsed time in the given timezone. This is helpful for cases where the time does not contain timezone information, e.g. 2023-03-09 14:04:40 and is not located in UTC, which is the default setting. It is also possible to set the timezone to Local which used the configured host timezone.

For time formats with timezone information, e.g. RFC3339, the resulting timestamp is unchanged. The timezone setting is ignored for all unix timestamp formats.

tags sub-section

XPath queries in the tag name = query format to add tags to the metrics. The specified path can be absolute (starting with /) or relative. Relative paths use the currently selected node as reference.

NOTE: Results of tag-queries will always be converted to strings.

fields_int sub-section

XPath queries in the field name = query format to add integer typed fields to the metrics. The specified path can be absolute (starting with /) or relative. Relative paths use the currently selected node as reference.

NOTE: Results of field_int-queries will always be converted to int64. The conversion will fail in case the query result is not convertible!

fields sub-section

XPath queries in the field name = query format to add non-integer fields to the metrics. The specified path can be absolute (starting with /) or relative. Relative paths use the currently selected node as reference.

The type of the field is specified in the XPath query using the type conversion functions of XPath such as number(), boolean() or string() If no conversion is performed in the query the field will be of type string.

NOTE: Path conversion functions will always succeed even if you convert a text to float!

field_selection, field_name, field_value (optional)

You can specify a XPath query to select a set of nodes forming the fields of the metric. The specified path can be absolute (starting with /) or relative to the currently selected node. Each node selected by field_selection forms a new field within the metric.

The name and the value of each field can be specified using the optional field_name and field_value queries. The queries are relative to the selected field if not starting with /. If not specified the field's name defaults to the node name and the field's value defaults to the content of the selected field node.

NOTE: field_name and field_value queries are only evaluated if a field_selection is specified.

Specifying field_selection is optional. This is an alternative way to specify fields especially for documents where the node names are not known a priori or if there is a large number of fields to be specified. These options can also be combined with the field specifications above.

NOTE: Path conversion functions will always succeed even if you convert a text to float!

field_name_expansion (optional)

When true, field names selected with field_selection are expanded to a path relative to the selected node. This is necessary if we e.g. select all leaf nodes as fields and those leaf nodes do not have unique names. That is in case you have duplicate names in the fields you select you should set this to true.

tag_selection, tag_name, tag_value (optional)

You can specify a XPath query to select a set of nodes forming the tags of the metric. The specified path can be absolute (starting with /) or relative to the currently selected node. Each node selected by tag_selection forms a new tag within the metric.

The name and the value of each tag can be specified using the optional tag_name and tag_value queries. The queries are relative to the selected tag if not starting with /. If not specified the tag's name defaults to the node name and the tag's value defaults to the content of the selected tag node. NOTE: tag_name and tag_value queries are only evaluated if a tag_selection is specified.

Specifying tag_selection is optional. This is an alternative way to specify tags especially for documents where the node names are not known a priori or if there is a large number of tags to be specified. These options can also be combined with the tag specifications above.

tag_name_expansion (optional)

When true, tag names selected with tag_selection are expanded to a path relative to the selected node. This is necessary if we e.g. select all leaf nodes as tags and those leaf nodes do not have unique names. That is in case you have duplicate names in the tags you select you should set this to true.

Examples

This example.xml file is used in the configuration examples below:

<?xml version="1.0"?>
<Gateway>
  <Name>Main Gateway</Name>
  <Timestamp>2020-08-01T15:04:03Z</Timestamp>
  <Sequence>12</Sequence>
  <Status>ok</Status>
</Gateway>

<Bus>
  <Sensor name="Sensor Facility A">
    <Variable temperature="20.0"/>
    <Variable power="123.4"/>
    <Variable frequency="49.78"/>
    <Variable consumers="3"/>
    <Mode>busy</Mode>
  </Sensor>
  <Sensor name="Sensor Facility B">
    <Variable temperature="23.1"/>
    <Variable power="14.3"/>
    <Variable frequency="49.78"/>
    <Variable consumers="1"/>
    <Mode>standby</Mode>
  </Sensor>
  <Sensor name="Sensor Facility C">
    <Variable temperature="19.7"/>
    <Variable power="0.02"/>
    <Variable frequency="49.78"/>
    <Variable consumers="0"/>
    <Mode>error</Mode>
  </Sensor>
</Bus>

Basic Parsing

This example shows the basic usage of the xml parser.

Config:

[[inputs.file]]
  files = ["example.xml"]
  data_format = "xml"

  [[inputs.file.xpath]]
    [inputs.file.xpath.tags]
      gateway = "substring-before(/Gateway/Name, ' ')"

    [inputs.file.xpath.fields_int]
      seqnr = "/Gateway/Sequence"

    [inputs.file.xpath.fields]
      ok = "/Gateway/Status = 'ok'"

Output:

file,gateway=Main,host=Hugin seqnr=12i,ok=true 1598610830000000000

In the tags definition the XPath function substring-before() is used to only extract the sub-string before the space. To get the integer value of /Gateway/Sequence we have to use the fields_int section as there is no XPath expression to convert node values to integers (only float).

The ok field is filled with a boolean by specifying a query comparing the query result of /Gateway/Status with the string ok. Use the type conversions available in the XPath syntax to specify field types.

Time and metric names

This is an example for using time and name of the metric from the XML document itself.

Config:

[[inputs.file]]
  files = ["example.xml"]
  data_format = "xml"

  [[inputs.file.xpath]]
    metric_name = "name(/Gateway/Status)"

    timestamp = "/Gateway/Timestamp"
    timestamp_format = "2006-01-02T15:04:05Z"

    [inputs.file.xpath.tags]
      gateway = "substring-before(/Gateway/Name, ' ')"

    [inputs.file.xpath.fields]
      ok = "/Gateway/Status = 'ok'"

Output:

Status,gateway=Main,host=Hugin ok=true 1596294243000000000

Additionally to the basic parsing example, the metric name is defined as the name of the /Gateway/Status node and the timestamp is derived from the XML document instead of using the execution time.

Multi-node selection

For XML documents containing metrics for e.g. multiple devices (like Sensors in the example.xml), multiple metrics can be generated using node selection. This example shows how to generate a metric for each Sensor in the example.

Config:

[[inputs.file]]
  files = ["example.xml"]
  data_format = "xml"

  [[inputs.file.xpath]]
    metric_selection = "/Bus/child::Sensor"

    metric_name = "string('sensors')"

    timestamp = "/Gateway/Timestamp"
    timestamp_format = "2006-01-02T15:04:05Z"

    [inputs.file.xpath.tags]
      name = "substring-after(@name, ' ')"

    [inputs.file.xpath.fields_int]
      consumers = "Variable/@consumers"

    [inputs.file.xpath.fields]
      temperature = "number(Variable/@temperature)"
      power       = "number(Variable/@power)"
      frequency   = "number(Variable/@frequency)"
      ok          = "Mode != 'error'"

Output:

sensors,host=Hugin,name=Facility\ A consumers=3i,frequency=49.78,ok=true,power=123.4,temperature=20 1596294243000000000
sensors,host=Hugin,name=Facility\ B consumers=1i,frequency=49.78,ok=true,power=14.3,temperature=23.1 1596294243000000000
sensors,host=Hugin,name=Facility\ C consumers=0i,frequency=49.78,ok=false,power=0.02,temperature=19.7 1596294243000000000

Using the metric_selection option we select all Sensor nodes in the XML document. Please note that all field and tag definitions are relative to these selected nodes. An exception is the timestamp definition which is relative to the root node of the XML document.

Batch field processing with multi-node selection

For XML documents containing metrics with a large number of fields or where the fields are not known before (e.g. an unknown set of Variable nodes in the example.xml), field selectors can be used. This example shows how to generate a metric for each Sensor in the example with fields derived from the Variable nodes.

Config:

[[inputs.file]]
  files = ["example.xml"]
  data_format = "xml"

  [[inputs.file.xpath]]
    metric_selection = "/Bus/child::Sensor"
    metric_name = "string('sensors')"

    timestamp = "/Gateway/Timestamp"
    timestamp_format = "2006-01-02T15:04:05Z"

    field_selection = "child::Variable"
    field_name = "name(@*[1])"
    field_value = "number(@*[1])"

    [inputs.file.xpath.tags]
      name = "substring-after(@name, ' ')"

Output:

sensors,host=Hugin,name=Facility\ A consumers=3,frequency=49.78,power=123.4,temperature=20 1596294243000000000
sensors,host=Hugin,name=Facility\ B consumers=1,frequency=49.78,power=14.3,temperature=23.1 1596294243000000000
sensors,host=Hugin,name=Facility\ C consumers=0,frequency=49.78,power=0.02,temperature=19.7 1596294243000000000

Using the metric_selection option we select all Sensor nodes in the XML document. For each Sensor we then use field_selection to select all child nodes of the sensor as field-nodes Please note that the field selection is relative to the selected nodes. For each selected field-node we use field_name and field_value to determining the field's name and value, respectively. The field_name derives the name of the first attribute of the node, while field_value derives the value of the first attribute and converts the result to a number.