Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dvc push in a project with multiple users #7510

Closed
emilijapur opened this issue Mar 28, 2022 · 5 comments
Closed

dvc push in a project with multiple users #7510

emilijapur opened this issue Mar 28, 2022 · 5 comments
Labels
A: data-sync Related to dvc get/fetch/import/pull/push

Comments

@emilijapur
Copy link

Bug Report

dvc push: failed to push data to the cloud

Description

Hello, I stumbled into a problem that when multiple people work in the same project and they run different experiments on their computer sometimes dvc generates directories in .dvc/cache with the same name as it already exists in dvc remote server. Thus, if a user wants to push data after dvc run it can not be done, because for example directory .dvc/cache/16 exists in /path/to/remote/server/16. In that case error is shown:

ERROR: failed to transfer 'md5: 24dd737c0642bf1ff8eee74eb121fbb6' - Permission denied                                                                                                                                                                                 
ERROR: failed to transfer 'md5: 233c0d5895672b19e2428dae2ead5447' - Permission denied                                                                                                                                                                                 
ERROR: failed to push data to the cloud - 2 files failed to upload    

This happens even though the user has all rights in the directory /path/to/remote/server/

This mostly happens when multiple people are working in the same project at the same time or user deletes all his cache in the computer.

I believe this problem can be solved if every user would download all cache from remote server, however, this is not possible in my case, because there are terabytes of data.

Reproduce

Repeat this multiple times:

  1. dvc add dataset.csv
  2. dvc run -first_step -d dataset.csv -o output.csv Rscript first_step.R
  3. dvc push

Delete all dvc cache files from computer and repeat it multiple times again. After a while it would generate folders with the same name as in remote server.

Expected

I expect to push files from any computer without having to download all cache from remote server without any errors.

Environment information

Output of dvc doctor:

$ dvc doctor
DVC version: 2.9.5 (deb)
---------------------------------
Platform: Python 3.8.3 on Linux-5.4.0-81-generic-x86_64-with-glibc2.14
Supports:
        azure (adlfs = 2022.2.0, knack = 0.9.0, azure-identity = 1.7.1),
        gdrive (pydrive2 = 1.10.0),
        gs (gcsfs = 2022.1.0),
        hdfs (fsspec = 2022.1.0, pyarrow = 7.0.0),
        webhdfs (fsspec = 2022.1.0),
        http (aiohttp = 3.8.1, aiohttp-retry = 2.4.6),
        https (aiohttp = 3.8.1, aiohttp-retry = 2.4.6),
        s3 (s3fs = 2022.1.0, boto3 = 1.20.24),
        ssh (sshfs = 2021.11.2),
        oss (ossfs = 2021.8.0),
        webdav (webdav4 = 0.9.4),
        webdavs (webdav4 = 0.9.4)
Cache types: hardlink, symlink
Cache directory: ext4 on /dev/sdb2
Caches: local
Remotes: ssh
Workspace directory: ext4 on /dev/sdb2
Repo: dvc, git

OS - Ubuntu 20.04.4 LTS

@daavoo daavoo added the A: data-sync Related to dvc get/fetch/import/pull/push label Apr 6, 2022
@dberenbaum
Copy link
Collaborator

What type of remote are you using?

@karajan1001
Copy link
Contributor

Remotes: ssh

I think it is ssh remote.

@dberenbaum
Copy link
Collaborator

@emilijapur If you write/copy a file to the same directory on the ssh server without dvc, do others have permission to write to that file?

@emilijapur
Copy link
Author

Yes, I am using ssh remote.

@dberenbaum I believe the problem is that DVC remote directory owner and group in linux is "user_group" which consists of user_A, user_B, user_C. And when user_A pushes to DVC remote directory using ssh it creates folder "24" for example with file "dd737c0642bf1ff8eee74eb121fbb6" (because he created hash value "24dd737c0642bf1ff8eee74eb121fbb6" . Then if user_B generates hash value "24xxxxxxxxxxxxxxxxxxx" it tries to create and push file named "xxxxxxxxxxxxxxxxxxx" to directory "24". Since user_A has already created folder "24" user_B faces the problem I have written before.

I believe it comes because the owner of directory "24" is user_A and user_B does not have the permissions to push to that directory., since the owner and group of thet directory is not "user_group", but "user_A".

@pmrowla
Copy link
Contributor

pmrowla commented Apr 13, 2022

This looks like a duplicate of iterative/dvc-ssh#15 (we need a way to configure the default dir permissions in SSH remotes, the same way we have it in cache.shared)

@emilijapur there are suggested workarounds in the linked issue (either have your users set umask so that dirs default to 0775, or use setfacl to enforce the proper directory permissions server-side.

@pmrowla pmrowla closed this as completed Apr 13, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A: data-sync Related to dvc get/fetch/import/pull/push
Projects
None yet
Development

No branches or pull requests

5 participants