Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Percent encoding spec and : and / #39

Open
Tracked by #376
jdillon opened this issue Apr 6, 2018 · 33 comments
Open
Tracked by #376

Percent encoding spec and : and / #39

jdillon opened this issue Apr 6, 2018 · 33 comments
Labels
Ecma specification Work on the core specification PURL core specification Format and syntax that define PURL (excludes PURL type definitions) PURL encoding
Milestone

Comments

@jdillon
Copy link

jdillon commented Apr 6, 2018

The notes in the specification about percent encoding of ":" are a bit confusing:

the '#', '?', '@' and ':' characters must NOT be encoded when used as separators. They may need to be encoded elsewhere
the ':' type separator does not need to and must NOT be encoded. It is unambiguous unencoded everywhere

It seems like these 2 contradict either other. The former indicates that ":" may need to be encoded elsewhere. The later indicates that ":" is "unambiguous unencoded everywhere".

Similarly the qualifier component documentation says:

A value must be must be a percent-encoded string

... and does not mention anything about "/". But the test-suite-data.json references canonical_purl representations like repository_url=repo.spring.io/release.

And the percent-encoding docs state:

the '/' used as namespace/name and subpath segments separator does not need to and must NOT be percent-encoded. It is unambiguous unencoded everywhere

My interpretation of this boils down to... "/" is never encoded, but the language in the parts of the specification are unclear. The name, namespace and subpath parts are clear wrt to "/", which leaves the qualifier and version bits as vague as to if "/" is supposed to be percent-encoded or not.

@stevespringett
Copy link
Member

stevespringett commented Aug 16, 2018

With reference to:

pkg:docker/customer/dockerimage@sha256:244fd47e07d1004f0aed9c?repository_url=gcr.io

It is my opinion that the testsuite in the .NET, Python, and Java implementations as well as in the spec repo are incorrect and violate the spec. Whereas the testsuite in the Rust implementation adheres to the spec (as written).

Spec states:

  • The version is prefixed by a '@' separator when not empty
  • A version must be a percent-encoded string
  • the '#', '?', '@' and ':' characters must NOT be encoded when used as separators. They may need to be encoded elsewhere

Therefore, the canonicalized purl should be changed to:

pkg:docker/customer/dockerimage@sha256%3A244fd47e07d1004f0aed9c?repository_url=gcr.io

We need clarification on this.

@stevespringett
Copy link
Member

stevespringett commented Aug 16, 2018

With reference to:

pkg:Maven/org.apache.xmlgraphics/[email protected]?repositorY_url=repo.spring.io/release&classifier=sources

It is my opinion that the testsuite in the .NET, Python, and Java implementations as well as in the spec repo are incorrect and violate the spec. Whereas the testsuite in the Rust implementation adheres to the spec (as written).

Spec states:

  • the '=' qualifiers key/value separator must NOT be encoded
  • the '/' used as type/namespace/name and subpath segments separator does not need to and must NOT be percent-encoded. It is unambiguous unencoded everywhere
  • A value must be must be a percent-encoded string

Therefore, the canonicalized purl should be changed to:

pkg:maven/org.apache.xmlgraphics/[email protected]?classifier=sources&repository_url=repo.spring.io%2Frelease

We need clarification on this.

@stevespringett
Copy link
Member

Based on my understanding of the spec, there are a total of three test cases that violate the spec as its currently written. The Rust implementation has went ahead and changed its test cases, which I'm in agreement with. But we need clarification as I've run into a situation with the Java implementation which requires additional encoding, but the additional encoding breaks the above test cases. However, adopting the testcase changes made by the Rust impl makes the Java impl happy as well.

stevespringett added a commit to package-url/packageurl-java that referenced this issue Aug 17, 2018
…ing decoding to version, qualifier values, and subpath when parsing a purl string. Added missing encoding to name, version, qualifier value when canonicalizing. These corrections break the testsuites. Corrected testsuites (now varies from spec testsuites). package-url/purl-spec#39 . Added additional validation as part of  package-url/purl-spec#46
@stevespringett
Copy link
Member

@pombredanne this ticket should be resolved prior to 1.0 #50

@gernot-h
Copy link
Contributor

gernot-h commented Apr 3, 2019

While I also stumbled hard about this and also would welcome a clearer wording, I see no contradiction here and would consider current testcases correct.

First statement summarizes where the #', '?', '@' and ':' characters MUST NOT be encoded, but it makes no definite statement about where they MUST be encoded. This is further clarified in the following statement which clearly says that it is "unambiguous unencoded everywhere".

To keep purls readable for humans, I think encoding shall be avoided where possible. Think about Debian versions - something like pkg:deb/debian/file@1%3A5.30-1%2Bdeb9u2?arch=amd64 looks quite unreadable to me.

@pombredanne
Copy link
Member

@stevespringett
re

Therefore, the canonicalized purl should be changed to:
pkg:maven/org.apache.xmlgraphics/[email protected]?classifier=sources&repository_url=repo.spring.io%2Frelease
We need clarification on this.

let's update the tests to match the spec and ...

@gernot-h very good point. In fact on common packages such as scoped npms that have a namespace that has an @ sign prefix the encoding was added because it was encoded in the public npm registry. But this is no longer the case and this is not needed as there are never any ambiguities there.
And your debian example is also telling.
So let's devise the bare absolute minimum encoding that would be needed such that Package URLs stay non ambiguous and are the most readable in the majority of the cases.

@gernot-h
Copy link
Contributor

gernot-h commented Apr 9, 2019

@pombredanne Great, anything I can do to help sorting this out? Prepare some pull requests? (I will be on vacation for the next 1.5 weeks, but be interested to help afterwards)

Regarding the confusion about the two statements regarding ":", what about rephrasing the first statement to:

From:

the '#', '?', '@' and ':' characters must NOT be encoded when used as separators. They may need to be encoded elsewhere

To:

the '#', '?', '@' and ':' characters must NOT be encoded when used as separators. Some of them need to be encoded elsewhere as specified in the rules below.

@stevespringett stevespringett added the PURL core specification Format and syntax that define PURL (excludes PURL type definitions) label Apr 1, 2020
@stevespringett stevespringett added this to the 1.0 milestone Apr 1, 2020
@matt-phylum
Copy link
Contributor

The WHATWG URL spec has a great section on percent encoding that defines a set of character classes and their exact encoding rules. It would be great if the PURL spec defined its percent encoding rules in relation to those. For example, "the PURL path percent-encode set is the WHATWG URL path percent-encode set and U+0040 (@)".

@michaelbprice
Copy link

My interpretation of the rules when putting together a PR (#245) for a vcpkg purl type, was that any value in a key=value qualifier should be percent-encoded, and the tests I've added there reflect that decision. It tests that after parsing, the resulting values are not percent-encoded. IMO, readability of purl strings themselves are secondary to unambiguous and fast parsing by tools (which then might have a presentation layer that helps users understand the data).

@matt-phylum
Copy link
Contributor

This comment is about qualifiers only. The PURL spec is fairly clear (IMO) that colons should not be encoded in version numbers, but could be more explicit.

This ticket was originally about qualifiers and I've found a potentially serious problem with the way qualifiers are encoded and decoded.

Neither PURL nor the WHATWG or RFC URL specs say that : or / need to be encoded when used in a query string. Any character is allowed to be encoded, so technically it's allowable, even if not canonical, and the qualifier should parse correctly either way.

However, PURL qualifiers look suspiciously like x-www-form-urlencoded parameters, and people may be tempted to use the x-www-form-urlencoded rules for encoding and decoding qualifiers. The WHATWG x-www-form-url-encoded spec does not require them to be encoded when parsing, but says that : and / do need to be encoded when serializing (as well as many other characters that the PURL spec never mentions like ().

If the set of encoded characters were the only difference, it would be fine if some implementations were using x-www-form-urlencoded and some were not, but there is one critical difference with x-www-form-urlencoded: + is replaced with during parsing, so it must be encoded as %2B when serializing. The current PURL spec never says anything about special handling of +. If some PURL implementations implement qualifiers using x-www-form-urlencoded and some implement qualifiers exactly as written in the current PURL spec, the implementations will interoperate just fine most of the time, but for certain cases the meaning of the qualifier changes when passing from one implementation to the other.

  • Serializing git+https:// using a PURL qualifier serializer and then parsing using an x-www-form-urlencoded parser changes the value to git https://
  • Serializing my project-1.0.0.tar.gz using a x-www-form-urlencoded serializer that replaces spaces with + (not all of them do) and then parsing using a PURL qualifier parser changes the value to my+project-1.0.0.tar.gz.

Besides needing to be more explicit about what characters need to be encoded in what parts of the PURL, to avoid PURLs that have different meanings to different implementations, the spec needs to clearly specify whether the qualifiers are x-www-form-urlencoded or if they are a simpler format that is like x-www-form-urlencoded but not.

@jdalton
Copy link

jdalton commented Aug 20, 2024

FYI package-url/packageurl-js has taken the following approach in their v2.0.0 release.
(From comment package-url/packageurl-js#43 (comment)):

This is handled in package-url/packageurl-js#73 by using URLSearchParams to encode and then turning + into %20 for better portability. I sided with the Rust implementation.

Also leveraging standard URLSearchParams. Deferring to standard encoders like URLSearchParams and encodeURIComponent for base encoding and then applying tweaks allows for less chances of mistakes (I trust standard implementations over myself).

@matt-phylum
Copy link
Contributor

It's inconsistent in packageurl-js 2.0.0. Plus signs parse as plus signs, except for in qualifier values where they parse as spaces.

@jdalton
Copy link

jdalton commented Aug 20, 2024

@matt-phylum

Plus signs parse as plus signs, except for in qualifier values where they parse as spaces.

URLSearchParams (docs) are qualifier specific. I was referring to how qualifiers are encoded. I'm saying that in v2+ we encode the slashes in the qualifier as Rust does. We also follow Rust in that we do NOT encode spaces as +. The JS implementation is aligned with Rust on encoding of slashes as %2F and spaces as %20 in encoded qualifiers.

johnmhoran added a commit that referenced this issue Jan 29, 2025
Reference: #39 (comment)
Reference: #376
Signed-off-by: John M. Horan [email protected]
@jkugler
Copy link

jkugler commented Feb 5, 2025

We are also hitting this issue, and I'd like to help move it forward. Please see package-url/packageurl-python#152 (comment) and package-url/packageurl-python#152 (comment)

The spec says namespace (segments), names, versions, qualifier values, and subpaths must be percent-encoded. But, the test data contradict this (see the linked comments).

How can I help drive this to completion?

@jdalton
Copy link

jdalton commented Feb 5, 2025

@jkugler To help please review and provide any feedback on #361.

@jkugler
Copy link

jkugler commented Feb 5, 2025

@jdalton: I took a look. Looks good to me, but in my case, I'm dealing with : in name and namespace. I don't know if #361 is the right place, but a test case a : in the name or namespace fields would be a good addition.

johnmhoran added a commit that referenced this issue Feb 6, 2025
Reference: #39
Reference: #376
Signed-off-by: John M. Horan [email protected]
johnmhoran added a commit that referenced this issue Feb 6, 2025
Reference: #376
Reference: #39
Signed-off-by: John M. Horan <[email protected]>
@jdalton
Copy link

jdalton commented Feb 6, 2025

@jkugler

but in my case, I'm dealing with : in name and namespace

Which purl type are you working with?

@jkugler
Copy link

jkugler commented Feb 6, 2025

@jdalton In this case, it's a nuget package, but it could be anything. It's not really determined by type.

@matt-phylum
Copy link
Contributor

In this case it's a bogus NuGet package because of a Syft bug.

@jkugler
Copy link

jkugler commented Feb 6, 2025

PR submitted: package-url/packageurl-python#178

Just waiting for a review and release. :)

@jdalton
Copy link

jdalton commented Feb 7, 2025

I don’t necessarily think the spec should be changed for a Syft bug but there could be a nuget specific purl type normalization around it. Can you link to the Syft/nuget issue. I know in the JS ecosystem namespaces begin with @ which is represented as %40 but we’ll accept either and convert to %40. I could see : similarly escaped in the case of nuget

@matt-phylum
Copy link
Contributor

I don't think a spec change is required, but there was some related confusion about what the spec says about encoding.

There is a bug in packageurl-python where if the name component of a PURL contains a colon, an error occurs.

The Syft bug is described here: package-url/packageurl-python#152 (comment)

@jdalton
Copy link

jdalton commented Feb 7, 2025

Ah, okay! Thanks @jkugler @matt-phylum! Looks like the JS implementation gracefully handles the :'s and encodes them when toString()'ing the purl object.

@matt-phylum
Copy link
Contributor

matt-phylum commented Feb 7, 2025

and encodes them when toString()'ing the purl object.

That was actually the spec confusion. By my interpretation of the spec it is incorrect to encode them when building a PURL in canonical form because they are unambiguous and the spec says to encode characters that are not ascii or when they would be ambiguous.

Edit: Because of the cross-implementation testing I've done, I know a lot of implementations differ here, even compared to just the limited examples in the test suite in this repository, and I have avoided creating issues for these variances as long as the required characters are encoded and resultant PURL can be decoded correctly. The idea of having exactly one canonical form, while convenient for SBOM tools etc so not everything needs to understand PURL to determine PURL equality, may be too much. I'd rather say that the implementation needs to be able to handle either : or %3A in this position because that's how the parsing spec says it should work than say that every implementation needs to put : and not %3A in this position.

@jdalton
Copy link

jdalton commented Feb 7, 2025

@matt-phylum

That was actually the spec confusion.

Minor correction on my part, that was encoding the "name" component. For the "namespace" component the JS implementation converts "%3A" to ":". That also feeds into your point of inconsistency.

I'd rather say that the implementation needs to be able to handle either : or %3A in this position because that's how the parsing spec says it should work than say that every implementation needs to put : and not %3A in this position.

Sounds reasonable.

johnmhoran added a commit that referenced this issue Feb 9, 2025
Reference: #376
Reference: #39
Signed-off-by: John M. Horan <[email protected]>
johnmhoran added a commit that referenced this issue Feb 9, 2025
- Also updated the canonical_purl value for the new colon ':' test object
Reference: #376
Reference: #39
Signed-off-by: John M. Horan <[email protected]>
johnmhoran added a commit that referenced this issue Feb 9, 2025
Reference: #376
Reference: #39
Signed-off-by: John M. Horan <[email protected]>
johnmhoran added a commit that referenced this issue Feb 9, 2025
Reference: #376
Reference: #39
Signed-off-by: John M. Horan <[email protected]>
@jkugler
Copy link

jkugler commented Feb 9, 2025

Correct, I wasn't advocating for changing the spec, although some clarification does seem needed, so glad to see that is happening. But, packageurl-python does need a fix to properly decode names and namespaces with colons in them, so I opened package-url/packageurl-python#178

@johnmhoran
Copy link
Member

Looks to me like the core spec requires the entirety of a name, and the contents of all namespace segments, to be percent-encoded, colons included.

  • Each namespace segment must be a percent-encoded string
  • A name must be a percent-encoded string

If so, tests that don't reflect this (ditto for some tests with unencoded ':' and/or '/' in the qualifiers values) need to be corrected. I anticipate we'll address all of these as we work our way through the issue/PR backlog. Please check the recently-opened issues and PRs for a few component-related entries and share your thoughts!

@matt-phylum
Copy link
Contributor

Looks to me like the core spec requires the entirety of a name, and the contents of all namespace segments, to be percent-encoded, colons included.

* `Each namespace segment must be a percent-encoded string`

* `A name must be a percent-encoded string`

: is a percent-encoded string representing :. The tests are correct.

althonos/packageurl.rs, anchore/packageurl-go, giterlizzi/perl-URI-PackageURL, maennchen/purl, package-url/packageurl-dotnet, package-url/packageurl-php, package-url/packageurl-python, package-url/packageurl-swift, phylum-dev/purl, sonatype/package-url-java (10/14) agree that : is the correct encoding of :.

Only package-url/packageurl-go, package-url/packageurl-java, package-url/packageurl-js, package-url/packageurl-ruby (4/14) unnecessarily encode the colon as %3A.

pombredanne added a commit that referenced this issue Feb 12, 2025
Streamline the definition of the scheme,
including encoding of colon between scheme and type.

Add test for colon between scheme and type

Move non-core scheme realted discussion to the new FAQ

Suggested-by: @gernot-h
Reference: #376
Reference: #39
See-also: #389
Signed-off-by: Philippe Ombredanne <[email protected]>
Signed-off-by: John M. Horan <[email protected]>
@pombredanne
Copy link
Member

pombredanne commented Feb 12, 2025

althonos/packageurl.rs, anchore/packageurl-go, giterlizzi/perl-URI-PackageURL, maennchen/purl, package-url/packageurl-dotnet, package-url/packageurl-php, package-url/packageurl-python, package-url/packageurl-swift, phylum-dev/purl, sonatype/package-url-java (10/14) agree that : is the correct encoding of :.
Only package-url/packageurl-go, package-url/packageurl-java, package-url/packageurl-js, package-url/packageurl-ruby (4/14) unnecessarily encode the colon as %3A.

@matt-phylum Thanks. Just curious: do you have a script to automate all this? I think you may have mentioned it in the past.

@pombredanne
Copy link
Member

I am sorry for the confusion here. I still need to think deep about it, but we may have also some restrictions with using unencoded colons in names and namespaces, and possibly more places.

@matt-phylum
Copy link
Contributor

@matt-phylum Thanks. Just curious: do you have a script to automate all this? I think you may have mentioned it in the past.

Yes. https://github.com/phylum-dev/purl-survey has functions for parsing PURLs into their parts and formatting the parts into PURLs for all of these implementations.

@jdalton
Copy link

jdalton commented Feb 13, 2025

Thanks @matt-phylum! I'll patch packageurl-js to not encode the :.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Ecma specification Work on the core specification PURL core specification Format and syntax that define PURL (excludes PURL type definitions) PURL encoding
Projects
None yet
Development

No branches or pull requests

9 participants