Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix ls #260

Merged
merged 1 commit into from
Feb 7, 2024
Merged

Fix ls #260

merged 1 commit into from
Feb 7, 2024

Conversation

renesat
Copy link
Contributor

@renesat renesat commented Jan 27, 2024

  • Fix bag: return empty list if contain one directory
  • Fix bag: return empty list when dir is not cached but have key in cache

I found it when working with DuckDB. To reproduce the second bug, you can try calling this:

import duckdb
import fsspec

# Working
fsspec.filesystem("lakefs").glob("lakefs://test/main/test_dir/*.parquet")

# Not working
duckdb.register_filesystem(fsspec.filesystem("lakefs"))
duckdb.sql("FROM 'lakefs://test/main/test_dir/*.parquet'")

@renesat renesat force-pushed the main branch 2 times, most recently from fe8fc1a to a687365 Compare January 27, 2024 18:52
@nicholasjng
Copy link
Collaborator

nicholasjng commented Jan 29, 2024

Thanks for the PR. What exactly is the issue you are seeing?

FWIW, for me, both of these statements work (although they do different things).

EDIT: The above repro on a local lakeFS quickstart instance, slightly edited (requires a .lakectl.yaml):

import duckdb
import fsspec
import lakefs

lakefs.Repository("quickstart").create("local://quickstart", include_samples=True, exist_ok=True)

# Working
res_a = fsspec.filesystem("lakefs").glob("lakefs://quickstart/main/*.parquet")
print(res_a)

# Not working
duckdb.register_filesystem(fsspec.filesystem("lakefs"))
res_b = duckdb.sql("FROM 'lakefs://quickstart/main/*.parquet'")
print(res_b)

@@ -542,7 +542,7 @@ def ls(
# Retry the API call with appended slash if the current result
# is just a single directory entry only (not its contents).
# This is useful to allow `ls("repo/branch/dir")` calls without a trailing slash.
if len(info) == 1 and info[0]["type"] == "directory":
if len(info) == 1 and info[0]["type"] == "directory" and info[0]["name"] == path + "/":
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unless I'm missing something, this should always be true, since path was requested in the original reference.objects() call (L520 above).

Copy link
Contributor Author

@renesat renesat Jan 29, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not always. Only when we request path without "/" at the end. If we have only one directory in the requested path, it would be false.

Like call for path lakefs://repo/main/ with this structure:

lakefs://
└── repo/
    └── main/
        └── test_dir/
            ├── a.parquet
            └── b.parquet

@renesat
Copy link
Contributor Author

renesat commented Jan 29, 2024

@nicholasjng this is more complex example with sample repo:

import duckdb
import fsspec
import lakefs

lakefs.Repository("quickstart").create("local://quickstart", include_samples=True, exist_ok=True)

fs = fsspec.filesystem("lakefs")
fs.mkdir("lakefs://quickstart/main/test")
fs.cp("lakefs://quickstart/main/lakes.parquet", "lakefs://quickstart/main/test/lakes.parquet")
fs.cp("lakefs://quickstart/main/lakes.parquet", "lakefs://quickstart/main/test/lakes2.parquet")
fs.rm("lakefs://quickstart/main/README.md")
fs.rm("lakefs://quickstart/main/data/",  recursive=True)
fs.rm("lakefs://quickstart/main/images/",  recursive=True)
fs.rm("lakefs://quickstart/main/lakes.parquet")

# Bug 1: return empty list but have "test" dir
try:
    test1 = fsspec.filesystem("lakefs").ls("lakefs://quickstart/main/")
    print(test1)
    assert test1 != []
except:
    print("test1 not working")


# Bug 2: return empty list but have "test" dir

# Not working
try:
    test2 = fsspec.filesystem("lakefs").glob("lakefs://quickstart/main/test/*.parquet")
    print(test2)
    assert test2 != []
except:
    print("test2 not working")


duckdb.register_filesystem(fsspec.filesystem("lakefs"))

# Not working
try:
    test3 = duckdb.sql("FROM read_parquet('lakefs://quickstart/main/test/*.parquet')")
    print(test3)
except:
    print("test3 not working")

# Working
test4 = duckdb.sql("FROM read_parquet(['lakefs://quickstart/main/test/lakes2.parquet', 'lakefs://quickstart/main/test/lakes2.parquet'])")
print(test4)

I create repo structure like this:

lakefs://
└── quickstart/
    └── main/
        └── test/
            ├── lakes.parquet
            └── lakes2.parquet

I think bag 2 related with bag 1.

@nicholasjng
Copy link
Collaborator

Sorry for the delay, and thanks for that new repro.
I ran this locally and was able to reproduce test case 1. We should therefore add it as a regression test. Would you like to do that?

Otherwise, looks good!

@nicholasjng
Copy link
Collaborator

The test failure happens because the path is not stripped of its leading protocol in test_ls_on_commit.

Changing L488 in fs.ls() to path = self._strip_protocol(path) solves the issue.

@renesat
Copy link
Contributor Author

renesat commented Feb 6, 2024

@nicholasjng I add test and find out why I have problem with line 430. It's because of incorrect cache update for directory with only directories in it. I fix it by removing "/" from the end of dirs names.

Copy link

codecov bot commented Feb 6, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Comparison is base (1586b02) 94.24% compared to head (e76dc9f) 94.24%.
Report is 2 commits behind head on main.

❗ Current head e76dc9f differs from pull request most recent head 3ed1f7a. Consider uploading reports for the commit 3ed1f7a to get more accurate results

Additional details and impacted files
@@           Coverage Diff           @@
##             main     #260   +/-   ##
=======================================
  Coverage   94.24%   94.24%           
=======================================
  Files           5        5           
  Lines         382      382           
  Branches       70       70           
=======================================
  Hits          360      360           
  Misses         13       13           
  Partials        9        9           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link
Collaborator

@nicholasjng nicholasjng left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, almost there!

- Fix bag: return empty list if contain one directory
- Fix bag: return empty list if dir is not cached but have key in cache
- Fix bag: not correct update cache id directory contains another directory
@renesat
Copy link
Contributor Author

renesat commented Feb 6, 2024

@AdrianoKF Done👌

@nicholasjng nicholasjng merged commit f05c5b6 into aai-institute:main Feb 7, 2024
3 checks passed
@nicholasjng
Copy link
Collaborator

Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants