Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support UTF-8 in metric and label names #623

Open
ywwg opened this issue Mar 5, 2024 · 8 comments
Open

Support UTF-8 in metric and label names #623

ywwg opened this issue Mar 5, 2024 · 8 comments

Comments

@ywwg
Copy link
Member

ywwg commented Mar 5, 2024

As raised in prometheus/client_ruby#306, we will need some way to encode UTF-8 metric and label names in pushgateway URLs. This may be tricky because the current method does allow for encodings of job names and label values, and any solution would have to be backwards compatible with the current syntax. Thoughtful design work is needed.

See: https://github.com/prometheus/pushgateway#url

Feature request

Use case. Why is this important?

UTF-8 Support is coming to prometheus and we need to cover this usecase.

@fedetorres93
Copy link

The path for pushing metrics into the Pushgateway looks like

/metrics/job/<JOB_NAME>{/<LABEL_NAME>/<LABEL_VALUE>}

Currently, job names and label values can be encoded with base64url, in which case job or the label name must be suffixed with @base64. For example, using the grouping key job="example",first_label="foo",second_label="bar":

/metrics/job@base64/ZXhhbXBsZQ==/first_label@base64/Zm9v/second_label@base64/YmFy

Some ideas I considered for encoding UTF-8 label names in URLs, using the same base64url approach:

  • Encode the label name using base64url and add a new suffix (for example, @base64name) so that label names are decoded accordingly when the URL is parsed. We can have another suffix for the case where both the label name and the label value are base64url encoded, and keep the original @base64 for when only the value is encoded. For example, using the grouping key job="example",first.label="foo":
/metrics/job/example/Zmlyc3QubGFiZWw=@base64name/foo
  • Use a query parameter (like base64names=true) to indicate label names are base64url encoded. This can also coexist with the current @base64 suffix used for label values. Disadvantage: it will be all or nothing.
/metrics/job/example/Zmlyc3QubGFiZWw=/foo?base64names=true

Metric and label names in the request body are validated using functions from the common library that already support UTF-8.

@ywwg
Copy link
Member Author

ywwg commented Aug 28, 2024

is there content negotiation with calls to pushgateway, or do people just hit the endpoint? (What if someone sends either of these formats to an older endpoint that doesn't understand the new grammar?)

@fedetorres93
Copy link

is there content negotiation with calls to pushgateway, or do people just hit the endpoint? (What if someone sends either of these formats to an older endpoint that doesn't understand the new grammar?)

There's no content negotiation while pushing metrics into the Pushgateway. The URL is parsed as is, splitting the labels with / as the delimiter, then checking if the name contains the @base64 suffix and, once the suffix is trimmed, checking if the label name itself is valid. So if someone sends either of these formats to an older endpoint, the push would be rejected

@beorn7
Copy link
Member

beorn7 commented Aug 29, 2024

So yeah, the base64 trick was introduced prior to 1.0.0. So we never had the situation where a PGW of v1+ would be confronted with a newer version of the push protocol it couldn't understand.

Content negotiation with push is hard, because you need multiple round trips for it.

Ideally, we find a solution that still works for older v1+ PGWs transparently. But I have currently no idea how to do that.

Or we do something that newer pushers have to retry with a legacy version of the names if they get an error back…

@beorn7
Copy link
Member

beorn7 commented Aug 29, 2024

Idea: Let's just use the already proposed bespoke text escaping instead of using base64 escaping. A PGW with UTF-8 support enabled will simply unescape.

Pros:

  • It's just another escaping mechanism.
  • Better readable than base64 (only special characters appear "weird").
  • Backwards compatible (as was the intention when designing this escaping mechanism).
  • No new syntax required in the URL.

Cons:

  • Now we could have two different escaping mechanisms in the same URL (if a name and a value contains special characters). But this only matters if a human looks at the URL, or a URL has to be hand-crafted.
  • In the unlikely case of an existing name that looks like an escaped name (i.e. the original name starts with U__ and happens to have a valid escape sequences), we would falsely unescape.

@beorn7
Copy link
Member

beorn7 commented Aug 29, 2024

To not introduce a (technically) breaking change, we can make the UTF-8 support opt-in for now (finally feature flags in PGW! ;).

@ywwg
Copy link
Member Author

ywwg commented Aug 29, 2024

I think it is fine to look for "U__" and unescape automatically if we see it. We decided in the design that it is unlikely people will name metrics like that, and if they do and the unescaping fails, we just return the name as-is anyway

@fedetorres93
Copy link

Yeah, I think the escaping alternative makes more sense. +1 on the opt-in. I'll start looking into it, thanks for the feedback!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: Todo
Development

No branches or pull requests

3 participants