-
Notifications
You must be signed in to change notification settings - Fork 362
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Avoid destination directory creation when using glob double asterisk(**) #1321
Comments
Can you please check against https://filesystem-spec.readthedocs.io/en/latest/copying.html and see if your cases violate any of the stated desired behaviours? I would note that it's always possible to go your own way, by performing the glob first and then a list comprehension to make the set of paths you wish to copy to, since Is this the same as #1315 ? |
@martindurant Thank you for you response! I checked the expected behaviors you mentioned but I can't find a use case with the double asterisk (and with or without file extension suffix). After some digging, it seems related to the Here are some examples: from fsspec.implementations.memory import MemoryFileSystem
fs = MemoryFileSystem()
fs.touch(path="/test/data/a.txt")
fs.touch(path="/test/data/b/c.txt")
fs.touch(path="/test/data/d/e/f.txt")
# Double asterisk without trailing slashes and without file extension suffix
fs.copy(path1="/test/data/**", path2="/test/out/")
# Output: ['/test/out/data/a.txt', '/test/out/data/b/c.txt', '/test/out/data/d/e/f.txt'] (data folder not expected)
# Double asterisk with trailing slashes and without file extension suffix
fs.copy(path1="/test/data/**/", path2="/test/out/")
# Output: ['/test/out/a.txt', '/test/out/b/c.txt', '/test/out/d/e/f.txt'] (expected)
# Double asterisk without trailing slashes and with file extension suffix
fs.copy(path1="/test/data/**.txt", path2="/test/out/")
# Output: ['/test/out/data/a.txt', '/test/out/data/b/c.txt', '/test/out/data/d/e/f.txt'] (data folder not expected)
# Double asterisk with trailing slashes and with file extension suffix
fs.copy(path1="/test/data/**/*.txt", path2="/test/out/")
# Output: ['/test/out/data/b/c.txt', '/test/out/data/d/e/f.txt'] (data folder not expected and file a.txt missing, possibly related to the issue you linked)
# Double asterisk without trailing slashes, with file extension suffix and with `exists` in `other_paths` forced to False
fs.copy(path1="/test/data/**.txt", path2="/test/out/")
# Output: ['/test/out/a.txt', '/test/out/b/c.txt', '/test/out/d/e/f.txt'] (expected) Do you know which use case requires to have I would be glad to contribute if this is something you would like to support! |
@john-jam Thanks for looking at this. I think adding support for For testing, I would be inclined to modify for target_slash in [False, True]:
...
fs.cp(fs_join(source, "subdir", "*"), t) to something like from itertools import product
for glob, target_slash in product(["*", "**"], [False, True]):
....
fs.cp(fs_join(source, "subdir", glob), t) and then run the test suite as normal to see what happens. If you'd like to try something like this then that would be great. If you get into trouble then I'll happily help out. |
@ianthomas23 @martindurant I am currently preparing a PR to fix those behaviors and I have something working currently but I have some question for you to clarify the expectations with double asterisks What are the relations between
|
You could imagine it as a two-step process, that first you get a list of paths from the glob, and then all of the files within each of those when recursive.
Yes
We very likely haven't tested this case. Glob with ** works by listing all paths within the highest common root. For "/path/part/**/more/st*uff" this would be "/path/part". Maxdepth becomes the number of dir levels past that point which are allowed; so in this example, the minimum maxdepth that returns anything is 2, e.g., some file like "/path/path/more/st88uff". I think any other meaning for maxdepth is ambiguous as you point out. |
I see. So when using double asterisks
I'll add some test cases for this one because I can see some unexpected behaviors when having suffixes.
I don't get why it would be 2 in your example? Is it 2 because you forgot to mention that you select On a side note, applying only the |
No, this is not necessarily true if ** isn't the end of your input path
Exactly, I think it should match zero or more. It should be checked against posix docs that this is the actual convention. |
Oh right, I was talking about the case where
Okay then this needs to be supported as well. |
@martindurant After some digging online, it looks like posix does not support double asterisks I created a draft PR for now, with what I believe, a better support for the "extended-posix" rules. The added tests run several paths against the Some questions/comments originated from my tests and I would like your thoughts before I'll try supporting
Please let me know if there are some tests I forgot! I'll add tests for |
I would like to avoid creating the last folder of the source path at the destination when glob double asterisk(**) to select files with specific extensions are used. When using them in the
copy
method (also happens onget
orput
) on theLocalFileSystem
(and other remote fs implementation as well), a destination folder (the last folder of the source path) is created in several unwanted situations.Example
Code:
Output:
When a trailing slash is added to the destination (Test1), the folder
data
is created. When the trailing slash is not added and we copy only once (Test2), the folderdata
is not created (expected results). But when we do twice the copy operation, the folderdata
is created once again.Expect:
['/tmp/out/a/b/f3.txt', '/tmp/out/a/f2.txt', '/tmp/out/f1.txt']
even when we call copy twice.The text was updated successfully, but these errors were encountered: