Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Recursive Globbing at GCS bucket top level results in TypeError #431

Open
FullMetalMeowchemist opened this issue May 3, 2024 · 6 comments

Comments

@FullMetalMeowchemist
Copy link

FullMetalMeowchemist commented May 3, 2024

The following works

path = CloudPath("gs://top-level-bucket-name/second-level/")
filepaths = path.rglob("*")
assert list(filepaths)

However, when I perform this action at the bucket root, for example

path = CloudPath("gs://top-level-bucket-name/")
filepaths = path.rglob("*")
assert list(filepaths)

I get the following error

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In[75], line 1
----> 1 list(paths)

File ~/.pyenv/versions/3.10.12/envs/dai/lib/python3.10/site-packages/cloudpathlib/cloudpath.py:497, in CloudPath.rglob(self, pattern, case_sensitive)
    492 pattern_parts = PurePosixPath(pattern).parts
    493 selector = _make_selector(
    494     ("**",) + tuple(pattern_parts), _posix_flavour, case_sensitive=case_sensitive
    495 )
--> 497 yield from self._glob(selector, True)

File ~/.pyenv/versions/3.10.12/envs/dai/lib/python3.10/site-packages/cloudpathlib/cloudpath.py:458, in CloudPath._glob(self, selector, recursive)
    457 def _glob(self, selector, recursive: bool) -> Generator[Self, None, None]:
--> 458     file_tree = self._build_subtree(recursive)
    460     root = _CloudPathSelectable(
    461         self.name,
    462         [],  # nothing above self will be returned, so initial parents is empty
    463         file_tree,
    464     )
    466     for p in selector.select_from(root):
    467         # select_from returns self.name/... so strip before joining

File ~/.pyenv/versions/3.10.12/envs/dai/lib/python3.10/site-packages/cloudpathlib/cloudpath.py:453, in CloudPath._build_subtree(self, recursive)
    450         continue
    452     nodes = (p for p in parts)
--> 453     _build_tree(file_tree, next(nodes, None), nodes, is_dir)
    455 return dict(file_tree)

File ~/.pyenv/versions/3.10.12/envs/dai/lib/python3.10/site-packages/cloudpathlib/cloudpath.py:441, in CloudPath._build_subtree.<locals>._build_tree(trunk, branch, nodes, is_dir)
    438     trunk[branch] = Tree() if is_dir else None  # leaf node
    440 else:
--> 441     _build_tree(trunk[branch], next_branch, nodes, is_dir)

File ~/.pyenv/versions/3.10.12/envs/dai/lib/python3.10/site-packages/cloudpathlib/cloudpath.py:441, in CloudPath._build_subtree.<locals>._build_tree(trunk, branch, nodes, is_dir)
    438     trunk[branch] = Tree() if is_dir else None  # leaf node
    440 else:
--> 441     _build_tree(trunk[branch], next_branch, nodes, is_dir)

TypeError: 'NoneType' object is not subscriptable

This only happens if I do a recursive glob. If I were to simply do a glob("*") it retrieves the root level blob paths.

@pjbull
Copy link
Member

pjbull commented May 4, 2024

I tried this on the same version of Python that you are on (3.10.12), and I can't repro it:

Python 3.10.12 (main, Jul  5 2023, 15:02:25) [Clang 14.0.6 ]
Type 'copyright', 'credits' or 'license' for more information
IPython 8.24.0 -- An enhanced Interactive Python. Type '?' for help.

In [1]: from dotenv import load_dotenv, find_dotenv; load_dotenv(find_dotenv())
Out[1]: True

In [2]: from cloudpathlib import CloudPath

In [3]: list(CloudPath("gs://cloudpathlib-test-bucket/").rglob("*"))
Out[3]:
[GSPath('gs://cloudpathlib-test-bucket/test_client-test_content_type_setting'),
 GSPath('gs://cloudpathlib-test-bucket/test_caching-test_persistent_mode'),
 GSPath('gs://cloudpathlib-test-bucket/test_cloudpath_file_io-test_file_discovery'),

...

]

Is there anything else peculiar about the bucket or the data in it? For example, does it have a file that starts with a / character in it? Does it have files with other unusual characters, or does it have no data in it at all?

@FullMetalMeowchemist
Copy link
Author

FullMetalMeowchemist commented May 6, 2024

If there's anything peculiar, it's that there are mixed files where there are "directories" with more files below, and independent files.

image

This is not the exact state my bucket was in, but pretty close, and still yields the same issue. There are no "directories" that are empty (which was my original suspicion on what went wrong).

Without digging into the exact reason, I'm guessing that the recursive glob is treating top level "directories" as flat files or something, which ends up yielding None for the path.

EDIT:
I removed the pdfs in my root and still the same thing. We can probably rule out that it's mixed top-level blobs.

There are multi-layered directories after the root level, which might be the cause as well
Screenshot from 2024-05-05 22-53-52

Again, none of the directories are empty. The ones without dropdown arrows have flat files directly under them.

@pjbull
Copy link
Member

pjbull commented May 6, 2024

Hm, I still can't find a repro on buckets that we own (which include many folders and sub folders).

If possible, could you provide the local variable values at the the different levels in the callstack by stopping there with pdb?

@FullMetalMeowchemist
Copy link
Author

Sorry for the delay, I did a little bit of debugging today and this is what I found.

The following layering/file is what's causing the issue

Screenshot from 2024-05-14 15-12-18

What's very weird is that there are plenty of other files we have with the same exact directory structure, but this is the first that's causing this issue.

rglobbing from the output directory will not trigger the same NoneType.

Here's the args at the time of failure (that is, cloudpath.py:441). I took the liberty of removing some of the files that did not cause errors out from the args.

trunk = defaultdict('demo': defaultdict(<function CloudPath._build_subtree.<locals>.<lambda> at 0x7f2fa3fad5a0>,

...ETC...

<function CloudPath._build_subtree.<locals>.<lambda> at 0x7f2fa3fad5a0>, {'21': defaultdict(<function CloudPath._build_subtree.<locals>.<lambda> at 0x7f2fa3fad5a0>, {'file1.json': None}), '22': defaultdict(<function CloudPath._build_subtree.<locals>.<lambda> at 0x7f2fa3fad5a0>, {'file2.json': None}), '23': defaultdict(<function CloudPath._build_subtree.<locals>.<lambda> at 0x7f2fa3fad5a0>, {'file3.json': None, 'file4.json': None}), '24': defaultdict(<function CloudPath._build_subtree.<locals>.<lambda> at 0x7f2fa3fad5a0>, {'file5.json': None}), '25': defaultdict(<function CloudPath._build_subtree.<locals>.<lambda> at 0x7f2fa3fad5a0>, {'file6.json': None}), '26': defaultdict(<function CloudPath._build_subtree.<locals>.<lambda> at 0x7f2fa3fad5a0>, {'file7.json': None}), '27': defaultdict(<function CloudPath._build_subtree.<locals>.<lambda> at 0x7f2fa3fad5a0>, {'file8.json': None}), '28': defaultdict(<function CloudPath._build_subtree.<locals>.<lambda> at 0x7f2fa3fad5a0>, {'file9.json': None}), '29': defaultdict(<function CloudPath._build_subtree.<locals>.<lambda> at 0x7f2fa3fad5a0>, {'file10.json': None, 'file11.json': None}), '3': defaultdict(<function CloudPath._build_subtree.<locals>.<lambda> at 0x7f2fa3fad5a0>, {'file12.json': None}), '30': defaultdict(<function CloudPath._build_subtree.<locals>.<lambda> at 0x7f2fa3fad5a0>, {'file13.json': None}), '31': defaultdict(<function CloudPath._build_subtree.<locals>.<lambda> at 0x7f2fa3fad5a0>, {'file14.json': None}), '32': defaultdict(<function CloudPath._build_subtree.<locals>.<lambda> at 0x7f2fa3fad5a0>, {'file15.json': None}), '4': defaultdict(<function CloudPath._build_subtree.<locals>.<lambda> at 0x7f2fa3fad5a0>, {'file16.json': None}), '5': defaultdict(<function CloudPath._build_subtree.<locals>.<lambda> at 0x7f2fa3fad5a0>, {'file17.json': None}), '6': defaultdict(<function CloudPath._build_subtree.<locals>.<lambda> at 0x7f2fa3fad5a0>, {'file18.json': None}), '7': defaultdict(<function CloudPath._build_subtree.<locals>.<lambda> at 0x7f2fa3fad5a0>, {'file19.json': None}), '8': defaultdict(<function CloudPath._build_subtree.<locals>.<lambda> at 0x7f2fa3fad5a0>, {'file20.json': None}), '9': defaultdict(<function CloudPath._build_subtree.<locals>.<lambda> at 0x7f2fa3fad5a0>, {'file21.json': None})})}), 'output': None})
branch = 'output'
nodes = <generator object CloudPath._build_subtree.<locals>.<genexpr> at 0x7f2f9ea236f0>
is_dir = True

Up the stack I see

ipdb> parts
['output', '13655195358140545425', '0']

output and 13655195358140545425 do not have files in them. Only 0 has a json file.

Hopefully this helps.

@pjbull
Copy link
Member

pjbull commented May 23, 2024

Could you also try listing the contents of the bucket with the Google Cloud SDK list_blobs command? What does it show for output?

Is there any chance that there is additionally a blob (file) called output in addition to the folder?

Unlike on a normal file system, blob storage can have both a file and a "folder" with the same name in the same location. We don't always handle this gracefully.

@nyoungstudios
Copy link

I also experienced this same error as well. Although, it was not on a top level bucket for me. For what it's worth, I was using Python 3.8.16 with these pip versions:

cloudpathlib==0.18.1
google-cloud-core==2.3.2
google-cloud-storage==2.7.0

There were over hundred thousand files within this folder prefix. I know it isn't of much help, but I couldn't exactly narrow it down to anything useful without our folder structure. While we still use the cloudpathlib for other things, we decided to just write this part using the Google Cloud APIs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants