Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DM-46776: Propagate fragments with ResourcePath.join #98

Merged
merged 4 commits into from
Oct 30, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 5 additions & 0 deletions doc/changes/DM-46776.feature.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
* Modified ``ResourcePath.join()`` to propagate fragments from the given path to the joined path.
This now means that if the ``ResourcePath`` constructor finds a fragment that fragment will be used.
Previously the fragment was dropped if a ``ResourcePath`` was given that had a fragment, or the fragment was treated as part of the filename if a plain string was given.
This change means that filenames can no longer include ``#`` characters.
* Added new ``ResourcePath.unquoted_fragment`` property to get the unquoted fragment.
21 changes: 15 additions & 6 deletions python/lsst/resources/_resourcePath.py
Original file line number Diff line number Diff line change
Expand Up @@ -393,9 +393,14 @@ def is_root(self) -> bool:

@property
def fragment(self) -> str:
"""Return the fragment component of the URI."""
"""Return the fragment component of the URI. May be quoted."""
return self._uri.fragment

@property
def unquoted_fragment(self) -> str:
"""Return unquoted fragment."""
return urllib.parse.unquote(self.fragment)

@property
def params(self) -> str:
"""Return any parameters included in the URI."""
Expand Down Expand Up @@ -693,7 +698,7 @@ def join(
quoted depending on the associated URI scheme. If the path looks
like a URI referring to an absolute location, it will be returned
directly (matching the behavior of `os.path.join`). It can
also be a `ResourcePath`.
also be a `ResourcePath`. Fragments are propagated.
isTemporary : `bool`, optional
Indicate that the resulting URI represents a temporary resource.
Default is ``self.isTemporary``.
Expand Down Expand Up @@ -723,6 +728,9 @@ def join(
refer to a file. Use `updatedFile` if the file is to be
replaced.

If an unquoted ``#`` is included in the path it is assumed to be
referring to a fragment and not part of the file name.

Raises
------
ValueError
Expand Down Expand Up @@ -757,10 +765,10 @@ def join(
# Absolute URI so return it directly.
return path_uri

# If this was originally a ResourcePath extract the unquoted path from
# it. Otherwise we use the string we were given to allow "#" to appear
# in the filename if given as a plain string.
if not isinstance(path, str):
# We want to propagate fragments to the joined path and we rely on
# the ResourcePath parser to find these fragments for us even in plain
# strings. Must assume there are no `#` characters in filenames.
if not isinstance(path, str) or path_uri.fragment:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a little sketched out by this -- it seems surprising that if you give join() a string filename it doesn't use that verbatim as the filename. It works within the specific context of how the datastore is set up, but breaks more general usage. I also wonder if there are some (probably slight) security implications here, where someone could smuggle in a ..#something or similar

It seems to me that callers could use .replace(fragment=...) if they want to add a fragment, or explicitly convert the string to a ResourcePath before calling join() if their intention is to treat it as a URI and not a path fragment.

Copy link
Member Author

@timj timj Oct 30, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the difficulty here is that before this change:

>>> from lsst.resources import ResourcePath
>>> ResourcePath("abc#def.txt")._uri
ParseResult(scheme='file', netloc='', path='/Users/timj/work/lsstsw/build/resources/abc', params='', query='', fragment='def.txt')
>>> ResourcePath("").join("abc#def.txt")._uri
ParseResult(scheme='file', netloc='', path='/Users/timj/work/lsstsw/build/resources/abc%23def.txt', params='', query='', fragment='')

and with this change:

>>> ResourcePath("abc#def.txt")._uri
ParseResult(scheme='file', netloc='', path='/Users/timj/work/lsstsw/build/resources/abc', params='', query='', fragment='def.txt')
>>> ResourcePath("").join("abc#def.txt")._uri
ParseResult(scheme='file', netloc='', path='/Users/timj/work/lsstsw/build/resources/abc', params='', query='', fragment='def.txt')

You get a consistent answer. These relative file paths can come from users specifying subsets of pipelines or from relative path entries in datastore.

We are assuming in other places that a # in the file part of a schemeless URI means fragment and that no-one using this is going to be putting a # in the filename itself. The question is whether we break that assumption elsewhere or try to make that assumption consistent.

Copy link
Member Author

@timj timj Oct 30, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For completeness, I think there is one other potential problem with "# in filename" at the moment.

>>> from lsst.resources import ResourcePath
>>> x = ResourcePath("a/bc#de.txt", forceAbsolute=False)
>>> y = ResourcePath("a", forceAbsolute=False)
>>> str(x)
'a/bc#de.txt'
>>> str(y.join("bc#de.txt"))
'a/bc#de.txt'

And for before and after this branch they stringify the same even though before one of those was a filename in the internal _uri representation and the other was a fragment. With this branch they are both fragments.

path = path_uri.unquoted_path

# Might need to quote the path.
Expand All @@ -780,6 +788,7 @@ def join(
path=newpath,
forceDirectory=forceDirectory,
isTemporary=isTemporary,
fragment=path_uri.fragment,
)

def relative_to(self, other: ResourcePath) -> str | None:
Expand Down
2 changes: 1 addition & 1 deletion python/lsst/resources/http.py
Original file line number Diff line number Diff line change
Expand Up @@ -1390,7 +1390,7 @@
try:
response_body = json.loads(resp.text)
if "macaroon" in response_body:
return f"{self}?authz={response_body['macaroon']}"
return str(self.replace(query=f"authz={response_body['macaroon']}"))

Check warning on line 1393 in python/lsst/resources/http.py

View check run for this annotation

Codecov / codecov/patch

python/lsst/resources/http.py#L1393

Added line #L1393 was not covered by tests
else:
raise ValueError(f"could not retrieve macaroon for URL {self}")
except json.JSONDecodeError:
Expand Down
6 changes: 5 additions & 1 deletion python/lsst/resources/s3.py
Original file line number Diff line number Diff line change
Expand Up @@ -582,8 +582,12 @@ def generate_presigned_put_url(self, *, expiration_time_seconds: int) -> str:
return self._generate_presigned_url("put_object", expiration_time_seconds)

def _generate_presigned_url(self, method: str, expiration_time_seconds: int) -> str:
return self.client.generate_presigned_url(
url = self.client.generate_presigned_url(
method,
Params={"Bucket": self._bucket, "Key": self.relativeToPathRoot},
ExpiresIn=expiration_time_seconds,
)
if self.fragment:
resource = ResourcePath(url)
url = str(resource.replace(fragment=self.fragment))
return url
21 changes: 20 additions & 1 deletion python/lsst/resources/tests.py
Original file line number Diff line number Diff line change
Expand Up @@ -444,6 +444,12 @@ def test_escapes(self) -> None:
uri = ResourcePath(hash_path)
self.assertEqual(uri.ospath, hash_path[:hpos])
self.assertEqual(uri.fragment, hash_path[hpos + 1 :])
self.assertEqual(uri.unquoted_fragment, uri.fragment)

# Fragments can be quoted, although this is not enforced anywhere.
with_frag = ResourcePath(self._make_uri("a/b.txt#" + urllib.parse.quote("zip-path=ingést")))
self.assertEqual(with_frag.fragment, "zip-path%3Ding%C3%A9st")
self.assertEqual(with_frag.unquoted_fragment, "zip-path=ingést")

def test_hash(self) -> None:
"""Test that we can store URIs in sets and as keys."""
Expand Down Expand Up @@ -474,9 +480,22 @@ def test_join(self) -> None:
self.assertFalse(up_relative.isdir())
self.assertEqual(up_relative.geturl(), f"{root_str}b/c.txt")

# Check that fragment is passed through join (simple unquoted case).
fnew3 = root.join("a/b.txt#fragment")
self.assertEqual(fnew3.fragment, "fragment")
self.assertEqual(fnew3.basename(), "b.txt", msg=f"Got: {fnew3._uri}")

# Join a resource path.
subpath = ResourcePath("a/b.txt#fragment2", forceAbsolute=False, forceDirectory=False)
fnew3 = root.join(subpath)
self.assertEqual(fnew3.fragment, "fragment2")
self.assertEqual(fnew3.basename(), "b.txt", msg=f"Got: {fnew3._uri}")

# Quoted string with fragment.
quote_example = "hsc/payload/b&c.t@x#t"
needs_quote = root.join(quote_example)
self.assertEqual(needs_quote.unquoted_path, "/" + quote_example)
self.assertEqual(needs_quote.unquoted_path, "/" + quote_example[:-2])
self.assertEqual(needs_quote.fragment, "t")

other = ResourcePath(f"{self.root}test.txt")
self.assertEqual(root.join(other), other)
Expand Down
4 changes: 2 additions & 2 deletions tests/test_http.py
Original file line number Diff line number Diff line change
Expand Up @@ -469,9 +469,9 @@ def test_plain_http_url_signing(self):

# Plain HTTP URLs are already readable without authentication, so
# generating a pre-signed URL is a no-op.
path = ResourcePath("http://nonwebdav.test/file")
path = ResourcePath("http://nonwebdav.test/file#frag")
self.assertEqual(
path.generate_presigned_get_url(expiration_time_seconds=300), "http://nonwebdav.test/file"
path.generate_presigned_get_url(expiration_time_seconds=300), "http://nonwebdav.test/file#frag"
)

# Writing to an arbitrary plain HTTP URL is unlikely to work, so we
Expand Down
9 changes: 9 additions & 0 deletions tests/test_s3.py
Original file line number Diff line number Diff line change
Expand Up @@ -169,6 +169,15 @@ def _test_url_signing_case(self, filename: str, test_data: bytes):
get_url = s3_path.generate_presigned_get_url(expiration_time_seconds=3600)
self._check_presigned_url(get_url, 3600)

# Check that fragments are retained.
s3_path = s3_path.replace(fragment="zip-path=X")
put_url = s3_path.generate_presigned_put_url(expiration_time_seconds=1800)
self.assertEqual(ResourcePath(put_url).fragment, "zip-path=X")
self._check_presigned_url(put_url, 1800)
get_url = s3_path.generate_presigned_get_url(expiration_time_seconds=3600)
self.assertEqual(ResourcePath(get_url).fragment, "zip-path=X")
self._check_presigned_url(get_url, 3600)

# Moto monkeypatches the 'requests' library to mock access to presigned
# URLs, so we are able to use HttpResourcePath to access the URLs in
# this test.
Expand Down
Loading