Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dvcfs: optimize get() by reducing index.info calls() #10540

Merged
merged 2 commits into from
Aug 27, 2024
Merged

Conversation

skshetry
Copy link
Member

@skshetry skshetry commented Aug 26, 2024

This PR improves performance for dvc import/get when all files are cached by as much as 90%.

This is done by using fs.walk(detail=True) inside dvcfs.get(), which will gather all the info from index at the same time (using DataIndex.traverse). Doing this will avoid calling SQLtrie.info() millions of times.

We cannot use fs.info() calls to gather state information at the end of RepoDependency.download(), so I had to create dvcfs._get internal function that gives us the information that we need.

Closes #10537.

Name (time in s)
test_get-get[import-opt] 48.7513
test_get-get[main] 136.2494
test-import-import[import-opt] 76.9326
test-import-import[main] 265.0083

@skshetry skshetry added the performance improvement over resource / time consuming tasks label Aug 26, 2024
Comment on lines 754 to 765
expected_info_calls = {(name,)}
if isinstance(files[name], dict):
dirs = {
(name, *d.relative_to(tmp_dir / "out").parts)
for d in (tmp_dir / "out").rglob("*")
if d.is_dir()
}
expected_info_calls.update(dirs)
assert {
call.args[1] for call in index_info_spy.call_args_list
} == expected_info_calls
Copy link
Member Author

@skshetry skshetry Aug 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This asserts that Index.info() calls only happen for directories during fs.walk(). If the output is a file, we'll call Index.info().

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not that strong an opinion, but would it be better to have a separate test for files and dirs? Maybe we need to leave some comments? This isn't the easiest test to understand on its own.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have simplified the tests. PTAL.

tests/func/test_import.py Outdated Show resolved Hide resolved
Copy link
Collaborator

@dberenbaum dberenbaum left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Really like all the dvcfs tests you added.

@skshetry
Copy link
Member Author

skshetry commented Aug 27, 2024

I was investigating why it was taking a long time (79s is still a lot of time for imports).

Looking at the cprofile results, half the time was spent on hashing files. Turns out we were using 2.0 hashes in dvc-bench.

So, imports might drop to <40s on the benchmarks.

@dberenbaum
Copy link
Collaborator

2.0 hash cross-compatibility performance was the last item in #10059 (comment) that we never got to since I never saw anyone ask about it and it should becomes less of an issue over time.

@skshetry
Copy link
Member Author

skshetry commented Aug 27, 2024

that we never got to since I never saw anyone ask about it

That's difficult for anyone else to say what the expected behaviour here is. Before your PR #10531, we were hashing them anyway.

And imports have always been slower.

@skshetry skshetry merged commit 4f3fb15 into main Aug 27, 2024
26 checks passed
@skshetry skshetry deleted the import-opt branch August 27, 2024 12:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance improvement over resource / time consuming tasks
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants