-
Notifications
You must be signed in to change notification settings - Fork 170
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Percent encoding spec and : and / #39
Comments
With reference to:
It is my opinion that the testsuite in the .NET, Python, and Java implementations as well as in the spec repo are incorrect and violate the spec. Whereas the testsuite in the Rust implementation adheres to the spec (as written). Spec states:
Therefore, the canonicalized purl should be changed to:
We need clarification on this. |
With reference to:
It is my opinion that the testsuite in the .NET, Python, and Java implementations as well as in the spec repo are incorrect and violate the spec. Whereas the testsuite in the Rust implementation adheres to the spec (as written). Spec states:
Therefore, the canonicalized purl should be changed to:
We need clarification on this. |
Based on my understanding of the spec, there are a total of three test cases that violate the spec as its currently written. The Rust implementation has went ahead and changed its test cases, which I'm in agreement with. But we need clarification as I've run into a situation with the Java implementation which requires additional encoding, but the additional encoding breaks the above test cases. However, adopting the testcase changes made by the Rust impl makes the Java impl happy as well. |
…ing decoding to version, qualifier values, and subpath when parsing a purl string. Added missing encoding to name, version, qualifier value when canonicalizing. These corrections break the testsuites. Corrected testsuites (now varies from spec testsuites). package-url/purl-spec#39 . Added additional validation as part of package-url/purl-spec#46
@pombredanne this ticket should be resolved prior to 1.0 #50 |
While I also stumbled hard about this and also would welcome a clearer wording, I see no contradiction here and would consider current testcases correct. First statement summarizes where the #', '?', '@' and ':' characters MUST NOT be encoded, but it makes no definite statement about where they MUST be encoded. This is further clarified in the following statement which clearly says that it is "unambiguous unencoded everywhere". To keep purls readable for humans, I think encoding shall be avoided where possible. Think about Debian versions - something like pkg:deb/debian/file@1%3A5.30-1%2Bdeb9u2?arch=amd64 looks quite unreadable to me. |
let's update the tests to match the spec and ... @gernot-h very good point. In fact on common packages such as scoped npms that have a namespace that has an |
@pombredanne Great, anything I can do to help sorting this out? Prepare some pull requests? (I will be on vacation for the next 1.5 weeks, but be interested to help afterwards) Regarding the confusion about the two statements regarding ":", what about rephrasing the first statement to: From: the '#', '?', '@' and ':' characters must NOT be encoded when used as separators. They may need to be encoded elsewhere To: the '#', '?', '@' and ':' characters must NOT be encoded when used as separators. Some of them need to be encoded elsewhere as specified in the rules below. |
The WHATWG URL spec has a great section on percent encoding that defines a set of character classes and their exact encoding rules. It would be great if the PURL spec defined its percent encoding rules in relation to those. For example, "the PURL path percent-encode set is the WHATWG URL path percent-encode set and U+0040 (@)". |
My interpretation of the rules when putting together a PR (#245) for a vcpkg purl type, was that any |
This comment is about qualifiers only. The PURL spec is fairly clear (IMO) that colons should not be encoded in version numbers, but could be more explicit. This ticket was originally about qualifiers and I've found a potentially serious problem with the way qualifiers are encoded and decoded. Neither PURL nor the WHATWG or RFC URL specs say that However, PURL qualifiers look suspiciously like x-www-form-urlencoded parameters, and people may be tempted to use the x-www-form-urlencoded rules for encoding and decoding qualifiers. The WHATWG x-www-form-url-encoded spec does not require them to be encoded when parsing, but says that If the set of encoded characters were the only difference, it would be fine if some implementations were using x-www-form-urlencoded and some were not, but there is one critical difference with x-www-form-urlencoded:
Besides needing to be more explicit about what characters need to be encoded in what parts of the PURL, to avoid PURLs that have different meanings to different implementations, the spec needs to clearly specify whether the qualifiers are x-www-form-urlencoded or if they are a simpler format that is like x-www-form-urlencoded but not. |
* Add MLflow and Hugging Face model registries * clean up * Address PR comments * Ensure repository_url is always present
FYI
|
It's inconsistent in packageurl-js 2.0.0. Plus signs parse as plus signs, except for in qualifier values where they parse as spaces. |
|
Reference: #39 (comment) Reference: #376 Signed-off-by: John M. Horan [email protected]
We are also hitting this issue, and I'd like to help move it forward. Please see package-url/packageurl-python#152 (comment) and package-url/packageurl-python#152 (comment) The spec says namespace (segments), names, versions, qualifier values, and subpaths must be percent-encoded. But, the test data contradict this (see the linked comments). How can I help drive this to completion? |
Reference: #39 Reference: #376 Signed-off-by: John M. Horan [email protected]
Reference: #376 Reference: #39 Signed-off-by: John M. Horan <[email protected]>
@jdalton In this case, it's a nuget package, but it could be anything. It's not really determined by type. |
In this case it's a bogus NuGet package because of a Syft bug. |
PR submitted: package-url/packageurl-python#178 Just waiting for a review and release. :) |
I don’t necessarily think the spec should be changed for a Syft bug but there could be a nuget specific purl type normalization around it. Can you link to the Syft/nuget issue. I know in the JS ecosystem namespaces begin with @ which is represented as %40 but we’ll accept either and convert to %40. I could see : similarly escaped in the case of nuget |
I don't think a spec change is required, but there was some related confusion about what the spec says about encoding. There is a bug in packageurl-python where if the name component of a PURL contains a colon, an error occurs. The Syft bug is described here: package-url/packageurl-python#152 (comment) |
Ah, okay! Thanks @jkugler @matt-phylum! Looks like the JS implementation gracefully handles the :'s and encodes them when |
That was actually the spec confusion. By my interpretation of the spec it is incorrect to encode them when building a PURL in canonical form because they are unambiguous and the spec says to encode characters that are not ascii or when they would be ambiguous. Edit: Because of the cross-implementation testing I've done, I know a lot of implementations differ here, even compared to just the limited examples in the test suite in this repository, and I have avoided creating issues for these variances as long as the required characters are encoded and resultant PURL can be decoded correctly. The idea of having exactly one canonical form, while convenient for SBOM tools etc so not everything needs to understand PURL to determine PURL equality, may be too much. I'd rather say that the implementation needs to be able to handle either |
Minor correction on my part, that was encoding the "name" component. For the "namespace" component the JS implementation converts "%3A" to ":". That also feeds into your point of inconsistency.
Sounds reasonable. |
Reference: #376 Reference: #39 Signed-off-by: John M. Horan <[email protected]>
- Also updated the canonical_purl value for the new colon ':' test object Reference: #376 Reference: #39 Signed-off-by: John M. Horan <[email protected]>
Reference: #376 Reference: #39 Signed-off-by: John M. Horan <[email protected]>
Reference: #376 Reference: #39 Signed-off-by: John M. Horan <[email protected]>
Correct, I wasn't advocating for changing the spec, although some clarification does seem needed, so glad to see that is happening. But, packageurl-python does need a fix to properly decode names and namespaces with colons in them, so I opened package-url/packageurl-python#178 |
Looks to me like the core spec requires the entirety of a
If so, tests that don't reflect this (ditto for some tests with unencoded ':' and/or '/' in the |
althonos/packageurl.rs, anchore/packageurl-go, giterlizzi/perl-URI-PackageURL, maennchen/purl, package-url/packageurl-dotnet, package-url/packageurl-php, package-url/packageurl-python, package-url/packageurl-swift, phylum-dev/purl, sonatype/package-url-java (10/14) agree that Only package-url/packageurl-go, package-url/packageurl-java, package-url/packageurl-js, package-url/packageurl-ruby (4/14) unnecessarily encode the colon as |
Streamline the definition of the scheme, including encoding of colon between scheme and type. Add test for colon between scheme and type Move non-core scheme realted discussion to the new FAQ Suggested-by: @gernot-h Reference: #376 Reference: #39 See-also: #389 Signed-off-by: Philippe Ombredanne <[email protected]> Signed-off-by: John M. Horan <[email protected]>
@matt-phylum Thanks. Just curious: do you have a script to automate all this? I think you may have mentioned it in the past. |
I am sorry for the confusion here. I still need to think deep about it, but we may have also some restrictions with using unencoded colons in names and namespaces, and possibly more places. |
Yes. https://github.com/phylum-dev/purl-survey has functions for parsing PURLs into their parts and formatting the parts into PURLs for all of these implementations. |
Thanks @matt-phylum! I'll patch packageurl-js to not encode the |
The notes in the specification about percent encoding of ":" are a bit confusing:
It seems like these 2 contradict either other. The former indicates that ":" may need to be encoded elsewhere. The later indicates that ":" is "unambiguous unencoded everywhere".
Similarly the qualifier component documentation says:
... and does not mention anything about "/". But the test-suite-data.json references canonical_purl representations like
repository_url=repo.spring.io/release
.And the percent-encoding docs state:
My interpretation of this boils down to... "/" is never encoded, but the language in the parts of the specification are unclear. The name, namespace and subpath parts are clear wrt to "/", which leaves the qualifier and version bits as vague as to if "/" is supposed to be percent-encoded or not.
The text was updated successfully, but these errors were encountered: