Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add UTF-8 support in metric and label names #689

Merged
merged 6 commits into from
Oct 2, 2024

Conversation

fedetorres93
Copy link
Contributor

@fedetorres93 fedetorres93 commented Sep 12, 2024

As part of the initiative for adding UTF-8 support to Prometheus, this PR brings UTF-8 support for metric and label names as described in this proposal (for example: {"metric.name", "label.name"="value"}.

For UTF-8 label names defined in the URL path, an escaping syntax is used. For example, the label "label.name"="value" would be escaped like: /metrics/job/example/U__label_2e_name/value. This escaping is compatible with the base64 encoding for label values: /metrics/job/example/U__label_2e_name@base64/dmFsdWU=

UTF-8 support can be enabled by running the Pushgateway with the flag --push.enable-utf8-names=true

Addresses #623

@fedetorres93 fedetorres93 marked this pull request as ready for review September 12, 2024 16:01
@beorn7 beorn7 self-assigned this Sep 17, 2024
@beorn7
Copy link
Member

beorn7 commented Sep 18, 2024

Sorry, still fighting my backlog. I hope I get to review this tomorrow.

Copy link
Member

@beorn7 beorn7 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for doing this.
A few ideas about how to precisely control whether to escape or not. I hope I got it right, but happy to learn where I'm wrong.

In different news, this needs documentation (which currently is just in the ever growing README.md – feel free to add it there, breaking down the documentation into "proper" documentation files would be a task for another PR).

@@ -18,6 +18,7 @@ import (
"context"
"errors"
"fmt"
"github.com/prometheus/common/model"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This has to go into the next block (line 29).

main.go Outdated
@@ -17,6 +17,7 @@ import (
"compress/gzip"
"context"
"fmt"
"github.com/prometheus/common/model"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As above, has to go to line 39 or so.

main.go Outdated
@@ -73,6 +74,7 @@ func main() {
persistenceFile = app.Flag("persistence.file", "File to persist metrics. If empty, metrics are only kept in memory.").Default("").String()
persistenceInterval = app.Flag("persistence.interval", "The minimum interval at which to write out the persistence file.").Default("5m").Duration()
pushUnchecked = app.Flag("push.disable-consistency-check", "Do not check consistency of pushed metrics. DANGEROUS.").Default("false").Bool()
pushEscaped = app.Flag("push.enable-escaped-labels", "Allow escaped characters in label names in URLs.").Default("false").Bool()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's call this push.enable-utf8-names to mirror the feature flag in Prometheus, and also to correctly describe what we are doing. (Without the flag, the PGW will reject non-legacy characters in names even in the payload of a push.)
Variable name and doc string has to be adjusted accordingly.

main.go Show resolved Hide resolved
handler/push.go Outdated
@@ -180,19 +180,21 @@ func splitLabels(labels string) (map[string]string, error) {
for i := 0; i < len(components)-1; i += 2 {
name, value := components[i], components[i+1]
trimmedName := strings.TrimSuffix(name, Base64Suffix)
unescapedName := model.UnescapeName(trimmedName, model.ValueEncodingEscaping)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't even want to try to unescape if the corresponding flag is not set. To not refer to the flag value directly here, maybe set a package-level variable of type model.EscapingScheme to either model.ValueEncodingEscaping or model.NoEscaping and then use that variable here.

handler/push.go Outdated
if !model.LabelNameRE.MatchString(trimmedName) ||
!model.LabelName(unescapedName).IsValid() ||
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With this validity test, wo don't need the one in the previous line anymore, do wo?

@fedetorres93 fedetorres93 force-pushed the ftorres/utf-8 branch 2 times, most recently from ab2490d to 83c45eb Compare September 23, 2024 21:48
@fedetorres93
Copy link
Contributor Author

@beorn7 Thanks for your feedback, the requested changes were addressed.

Copy link
Member

@beorn7 beorn7 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code looks good (modulu doc comment on push.EscapingScheme. But we also need documentation. (As suggested previously, I believe it is OK for now to add a section to the ever growing README.md, keeping in mind that eventually we would like to break it up into "proper" documentation.)

handler/push.go Outdated
@@ -41,6 +41,8 @@ const (
Base64Suffix = "@base64"
)

var EscapingScheme = model.NoEscaping
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could benefit from a doc comment.

Signed-off-by: Federico Torres <[email protected]>
@fedetorres93
Copy link
Contributor Author

@beorn7 Sorry about the documentation, thanks for pointing it out, now it's added. Also added EscapingScheme comment.

Copy link
Member

@beorn7 beorn7 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool. I just couldn't resist and applied a lot of wordsmithing. Please see my suggestions.

README.md Outdated
@@ -320,6 +320,28 @@ Examples:

/metrics/job/titan/name@base64/zqDPgc6_zrzOt864zrXPjc-C

### UTF-8 support
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's be a bit more specific here and call this "UTF-8 support for metric and label names".
(UTF-8 is supported for label values already.)

README.md Outdated
Comment on lines 325 to 330
UTF-8 characters in metric and label names are supported by enabling
the `--push.enable-utf8-names` flag. The syntax is as follows:

{"some.metric", "foo.bar"="baz"}

For UTF-8 label names defined in the URL path, [an escaping syntax](https://github.com/prometheus/proposals/blob/main/proposals/2023-08-21-utf8.md#text-escaping) is used.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would not explain the syntax of the text exposition format here. (It also doesn't cover protobuf.)

With some wordsmithing included, I would propose the following:

Suggested change
UTF-8 characters in metric and label names are supported by enabling
the `--push.enable-utf8-names` flag. The syntax is as follows:
{"some.metric", "foo.bar"="baz"}
For UTF-8 label names defined in the URL path, [an escaping syntax](https://github.com/prometheus/proposals/blob/main/proposals/2023-08-21-utf8.md#text-escaping) is used.
Newer versions of the Prometheus exposition formats (text and protobuf)
support the full UTF-8 character set in metric and label names. The
Pushgateway only accepts special characters in names if the command line
flag `--push.enable-utf8-names` is set.
To allow special characters in label names that are part of the URL path, the flag also enables a
[specific encoding mechanism](https://github.com/prometheus/proposals/blob/main/proposals/2023-08-21-utf8.md#text-escaping). This is similar to the base64 encoding for label _values_ described above,
but works differently in detail for technical and historical reasons. As before, client libraries
should usually take care of the encoding. It works as follows:

README.md Outdated
Comment on lines 332 to 335
* Prefix the string with `U__`.
* All non-valid characters (i.e. characters other than letters, numbers, and underscores) will be encoded as underscores surrounding the Unicode value, like `_1F60A_`.
* All pre-existing underscores will become doubled: `__`.
* If a string should start with "U__" already, it will need to be escaped: `U___55_____`. (That's `U__` + `_55_` (for `U`) + `__` + `__`).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's stay consistent with regard to the form. You start with an imperative form and then use future tense.

Here my suggestion with a few things added:

Suggested change
* Prefix the string with `U__`.
* All non-valid characters (i.e. characters other than letters, numbers, and underscores) will be encoded as underscores surrounding the Unicode value, like `_1F60A_`.
* All pre-existing underscores will become doubled: `__`.
* If a string should start with "U__" already, it will need to be escaped: `U___55_____`. (That's `U__` + `_55_` (for `U`) + `__` + `__`).
* A label name containing encoded characters starts with `U__`.
* All non-standard characters (i.e. characters other than letters, numbers, and underscores) are encoded as underscores surrounding their Unicode value, like `_1F60A_`.
* All pre-existing underscores are encoded as a double-underscore: `__`.
* If a label name starts with `U__` already, these characters have to be encoded as well, resulting in `U___55_____`. (That's `U__` + `_55_` (for `U`) + `__` + `__`).
* A label name starting with `U__` in it's encoded form, but containing invalid sequences (e.g. `U__in_xxx_valid`) is left unchanged.

README.md Outdated
* All pre-existing underscores will become doubled: `__`.
* If a string should start with "U__" already, it will need to be escaped: `U___55_____`. (That's `U__` + `_55_` (for `U`) + `__` + `__`).

For example, the label `"foo.bar"="baz"` would be escaped like:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's better use "encoded" rather than "escaped" to only use one word throughout this file.

README.md Outdated

/metrics/job/example/U__foo_2e_bar/baz

This escaping is compatible with the base64 encoding for label values:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

escaping → encoding, see above.

This escaping is compatible with the base64 encoding for label values:

/metrics/job/example/U__foo_2e_bar@base64/YmF6

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's add something about the one edge case, e.g.

Note that this method has an unlikely edge case that is not handled properly: A pusher unaware of the encoding mechanism might use a label name that is also a valid encoded version of another label name. E.g. if a pusher intends to use the label name U__foo_2e_bar, but doesn't encode it as U___55_____foo__2e__bar, the Pushgateway will decode U__foo_2e_bar to foo.bar. This is the main reason why the decoding is opt-in via the --push.enable-utf8-names flag.

Signed-off-by: Federico Torres <[email protected]>
@fedetorres93
Copy link
Contributor Author

@beorn7 Thanks for the docs suggestions, done.

Copy link
Member

@beorn7 beorn7 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🚢

@beorn7 beorn7 merged commit c408b8e into prometheus:master Oct 2, 2024
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants