Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dvcfs subrepos example #4740

Merged
merged 1 commit into from
Aug 11, 2023
Merged

dvcfs subrepos example #4740

merged 1 commit into from
Aug 11, 2023

Conversation

dberenbaum
Copy link
Collaborator

@dberenbaum dberenbaum requested a review from a team August 4, 2023 11:23
@shcheklein shcheklein temporarily deployed to dvc-org-dvcfs-subrepos-wszpkd3 August 4, 2023 11:27 Inactive
@skshetry
Copy link
Member

skshetry commented Aug 4, 2023

Documenting subrepos=True in the current state means also supporting nested repos and multiple repositories. I'm against documenting this.

It should be on the user to pass a path to the subrepo that they want, eg:

fs = DVCFileSystem(url="[email protected]:iterative/example-get-started.git", repo="/subrepo")

@github-actions
Copy link
Contributor

github-actions bot commented Aug 4, 2023

Link Check Report

There were no links to check!

@dberenbaum
Copy link
Collaborator Author

@skshetry Am I misunderstanding or is that not implemented today?

>>> from dvc.api import DVCFileSystem
>>> url = "https://github.com/iterative/monorepo-example.git"
>>> fs = DVCFileSystem(url, repo="/nlp", rev="develop")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/dave/micromamba/envs/dvc/lib/python3.11/site-packages/fsspec/spec.py", line 79, in __call__
    obj = super().__call__(*args, **kwargs)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/dave/Code/dvc/dvc/fs/dvc.py", line 150, in __init__
    self.hash_jobs = repo.fs.hash_jobs
                     ^^^^^^^
AttributeError: 'str' object has no attribute 'fs'

@skshetry
Copy link
Member

skshetry commented Aug 4, 2023

Sorry, I was proposing an API, that would open a filesystem to a subrepo directly. It's not implemented at the moment (and keyword argument could be different, repo is already taken).

The subrepos=True gives you a filesystem at the root of your url, and is more magic-y, as it supports multiple repositories inside a monorepo, and even nested dvc repositories.

I'd prefer it to be simpler — simpler in interface and simpler in implementation, and more transparent on what's happening.

@daavoo
Copy link
Contributor

daavoo commented Aug 4, 2023

The subrepos=True gives you a filesystem at the root of your url, and is more magic-y, as it supports multiple repositories inside a monorepo, and even nested dvc repositories.

I'd prefer it to be simpler — simpler in interface and simpler in implementation, and more transparent on what's happening.

Is there something that doesn't work?

I am missing what is the worry, there are cases where people explicitly wanted it

@efiop
Copy link
Contributor

efiop commented Aug 4, 2023

subrepos are just tricky and messy, hence the hesitation. Maybe @pmrowla has some comments on this, since he was dealing with that recently.

@daavoo
Copy link
Contributor

daavoo commented Aug 4, 2023

subrepos are just tricky and messy, hence the hesitation.

so even more in favor of documenting😄 we will get people trying and reporting issues we are not currently catching?

@skshetry
Copy link
Member

skshetry commented Aug 4, 2023

subrepos are just tricky and messy, hence the hesitation.

so even more in favor of documenting😄 we will get people trying and reporting issues we are not currently catching?

It's not only about implementation. The feature as a whole, is questionable (i.e. autodiscovering subrepos, and providing an overlay of multiple subrepos and nested repos). Nobody asked for that.

As far as dvc.api.DVCFileSystem is concerned, it should be a filesystem for a "DVC" repo, whether it's at a subdir or at the root.

@pmrowla
Copy link
Contributor

pmrowla commented Aug 4, 2023

It should be on the user to pass a path to the subrepo that they want, eg:

fs = DVCFileSystem(url="[email protected]:iterative/example-get-started.git", repo="/subrepo")

This is what I would prefer, but when I looked into implementing it last month for the get_url issue, I ran into the problem of existing dvc.api convention for users to provide paths relative to the git repo root, and not the dvc repo root. When we instantiate a a dvcfs instance for a subrepo instance, all of the fs methods for that instance expect paths relative to the DVC root.

So whereas now the user can do

>>> dvc.api.open("subrepo/path/to/file", repo="https://.../some-repo.git")

A subrepo specific dvcfs instance needs something like

>>> with DVCFileSystem(url="https://.../some-repo.git", repo="subrepo") as fs:
>>>     fs.open("path/to/file")

I think for a user it would be unintuitive and confusing as to why I can use the root git URL and paths relative to that git root in the top level dvc.api calls (and in CLI commands like dvc import) but not in DVCFileSystem


It's not only about implementation. The feature as a whole, is questionable (i.e. autodiscovering subrepos, and providing an overlay of multiple subrepos and nested repos). Nobody asked for that.

And just to be clear, I agree with this. But the problem is that we already support doing this for everything else in the DVC API and import/get

@efiop
Copy link
Contributor

efiop commented Aug 4, 2023

Thanks guys! So looks like overall we would much rather implement repo=(or something like that) for dvcfs (I think @pmrowla already implemented or planned to something similar) and expose that. We could mimic our current behaviour where we use it (i think only in dvc list) if we set some flag in info for subrepos (similar how git sets a special submodule flag), so that the user can discover it himself and then create corresponding filesystems.

I understand the hesitation to document this as is, since it might become a problem in the current state.

@efiop
Copy link
Contributor

efiop commented Aug 7, 2023

@dberenbaum Really sorry for blocking you here. Let's see if we can quickly make repo work.

@dberenbaum
Copy link
Collaborator Author

@efiop It's not urgent, but I didn't get the same idea as you from what @pmrowla wrote.

This is what I would prefer, but when I looked into implementing it last month for the get_url issue, I ran into the problem of existing dvc.api convention for users to provide paths relative to the git repo root, and not the dvc repo root.

I interpreted this as saying that we should stick to the git root for consistency with existing api calls.

@efiop
Copy link
Contributor

efiop commented Aug 7, 2023

@dberenbaum I think that's more about get_url's current conventions, but it doesn't mean that we can't have repo/subrepo arg in dvcfs.

Copy link
Contributor

@efiop efiop left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, looks like no one was willing to jump and implement, so let's unblock and document this as is. If/when we get to that subrepo option - we can obsolete this.

Thank you and sorry for blocking.

@efiop efiop merged commit 79ff64e into main Aug 11, 2023
4 checks passed
@efiop efiop deleted the dvcfs-subrepos branch August 11, 2023 01:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
No open projects
Archived in project
Development

Successfully merging this pull request may close these issues.

6 participants