Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support pulling named subsets of data, or excluding files from pull #2825

Closed
r-zip opened this issue Nov 20, 2019 · 16 comments
Closed

Support pulling named subsets of data, or excluding files from pull #2825

r-zip opened this issue Nov 20, 2019 · 16 comments
Labels
feature request Requesting a new feature p3-nice-to-have It should be done this or next sprint

Comments

@r-zip
Copy link

r-zip commented Nov 20, 2019

I've been working on a large project with multiple datasets. One of these datasets is large (>100 GB). If I simply run dvc pull, then it will pull the huge dataset, which takes up most available disk space on my machine.

The only way around this appears to be providing the file name to every data file to download. This is inconvenient, however, because there are many files I do want, and only one that I don't want.

I see two solutions to this:

  1. Allow named file groups. The user could specify groups of files in some sort of config, and pull them individually by name. I.e., dvc pull mnist. The user would also be able to exclude them: dvc pull all --exclude mnist.
  2. Allow exclusion of certain files from the command line. I.e., dvc pull --exclude data/mnist.dvc.
@triage-new-issues triage-new-issues bot added the triage Needs to be triaged label Nov 20, 2019
@shcheklein
Copy link
Member

To some extent related to this #2095 since one of the possible solutions to this can be specifying some setting per output in the DVC-file. Is it pulled/pushed by default, for example.

@shcheklein shcheklein added the feature request Requesting a new feature label Nov 20, 2019
@triage-new-issues triage-new-issues bot removed the triage Needs to be triaged label Nov 20, 2019
@Suor
Copy link
Contributor

Suor commented Nov 20, 2019

@shcheklein I don't think this is related, this is not about different remotes.

@r-zip There are two tricks you can employ:

  1. Recursive dvc-file collection:

    dvc pull --recursive some_dir

    This will collect all dvc-files in some_dir and pull all their outs.

  2. Use scripting, say bash files:

    In pull_123.sh:

    #!/bin/bash
    
    dvc pull stage1.dvc stage2.dvc stage3.dvc

    Then run:

    chmod +x pull_123.sh  # make it executable
    ./pull_123.sh         # pull stages 1,2,3

    You may also use makefiles or anything else you like.

@shcheklein
Copy link
Member

@shcheklein I don't think this is related, this is not about different remotes.

@Suor yep, I think we are on the same page. It's very different. The only relation to the ticket is that potentially we can an option per output that specifies the default behavior for push/pull. Or it even can be a Null remote for some outputs? What do you think? Can it be a solution for this one?

@Suor
Copy link
Contributor

Suor commented Nov 20, 2019

@shcheklein I was thinking about remote: null option or not mentioning some out in any of the remotes. This might not what is expected for this issue.

@dberenbaum
Copy link
Collaborator

I think this can be closed now that there is --glob support in dvc pull. Feel free to reopen if you think it's still a valid feature request.

@pbailey-hf
Copy link

Hello. I'm looking for a solution to a similar problem. I don't see the --glob option in dvc pull for version 3.51.2. Has it been superseded?

@dberenbaum
Copy link
Collaborator

@pbailey-hf You can pull any subpath without --glob, like dvc pull data/subpath.

@pbailey-hf
Copy link

I see thanks for the response! What I'm hoping to achieve is that a user can clone a git repo and dvc pull to grab everything required for the build. As part of that repo, though, I have a large amount of test data that is not usually required, so I'd like it be be excluded from dvc pull by default. I'm glad to work on a PR that would implement @shcheklein's original suggestion if that's the move. Thanks!

@dberenbaum
Copy link
Collaborator

Do you want to push that test data or you only need it locally for your own purposes? If you don't ever need to push it, you can mark that data with push: false (see https://dvc.org/doc/user-guide/project-structure/dvc-files#output-entries and #8578).

If you still need to push that test data, you could set up different remotes for you and the downstream users. You can set remote: your_private_remote for the test data and remote: some_public_remote for all the other data. Then, users could pull with dvc pull -r some_public_remote to only pull the data you want them to have (see #6486).

@pbailey-hf
Copy link

pbailey-hf commented Jun 6, 2024

OK I've tried out your suggestion, putting the required files in the default remote and the test data in a separate one. It seems to work:

$ dvc pull
Collecting
WARNING: Some of the cache files do not exist neither locally nor on remote. Missing cache files:
md5: b3f6e9a3596bfa5d03700769afb1642f.dir
Fetching
Building workspace index
Comparing indexes
17 files added and 443 files fetched
ERROR: failed to pull data from the cloud - Checkout failed for following targets:
test/data
Is your cache up to date?
<https://error.dvc.org/missing-files>

$ echo $?
1

There seem to be a couple of drawbacks:

  1. There's an error message insisting that the cache is somehow out of date and test/data couldn't be pulled, even though that was my intention
  2. DVC indicates an error.

I appreciate your responsiveness on this issue, and I'll keep digging for solutions.

@dberenbaum
Copy link
Collaborator

You still need to pass dvc pull -r default_remote or dvc has no way to know you only want to pull from that remote.

@pbailey-hf
Copy link

pbailey-hf commented Jun 6, 2024

Thanks I noticed my error after I posted that, but had similar output when I added --remote. Would something like this #10451 be within the realm of acceptable (assuming I did the test and doc updates)?

@dberenbaum
Copy link
Collaborator

Thanks I noticed my error after I posted that, but had similar output when I added --remote.

Sorry, I also forgot to add that you need to --allow-missing if you want to silence those errors. You might want to look at #8298 for more background.

Would something like this #10451 be within the realm of acceptable (assuming I did the test and doc updates)?

Can you explain what it is doing and how it solves the problem for you?

@pbailey-hf
Copy link

It marks certain Outputs as explicit, so that a bare dvc pull does not fetch/checkout those files. They must be explicitly referenced in order to be included (e.g. dvc pull test/data)

@dberenbaum
Copy link
Collaborator

I'm fine to consider that option if you want to try to implement it!

@pbailey-hf
Copy link

Hey @dberenbaum my PR is updated with tests, and docs. Feedback is welcome. Thanks for your consideration.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request Requesting a new feature p3-nice-to-have It should be done this or next sprint
Projects
None yet
Development

No branches or pull requests

7 participants
@Suor @dberenbaum @shcheklein @efiop @r-zip @pbailey-hf and others