Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

enumerate chunks for a given pattern #44

Open
eshagh opened this issue Mar 5, 2017 · 4 comments
Open

enumerate chunks for a given pattern #44

eshagh opened this issue Mar 5, 2017 · 4 comments

Comments

@eshagh
Copy link

eshagh commented Mar 5, 2017

Hello,

I'm evaluating duplicacy and am narrowing on a particular mode of operation:

  • for a variety of reasons (performance, reliability, unsupported backend), I am abstracting away the relay of chunks to cloud storage (rclone 32 parallel uploads, etc)
  • in this mode of operation, I cache chunks locally, upload them to a cloud storage, then truncate locally
  • this preserves the file name and location to enable dedupe, but does not occupy local storage

In this mode of operation, I can quickly download chunks via parallelism. When I have the chunks (and snapshot files) downloaded and available locally, I can recover quickly. This does not work, however, if I do not have sufficient local disk space to bring the entire collection of chunks down.

If duplicacy could enumerate chunks necessary for a specific restore pattern (say, all my cat pictures only), I could prime the restore by first obtaining only those chunks necessary for the restore prior to running the restore job. At present, I am only able to enumerate chunks for a given snapshot. It is feasible to try to cut my dataset into smaller logical repos, each independently backed up/snapshotted, such that a full restore could be performed repo by repo, but that is laborious.

So the feature request is to add chunk enumeration for a given restore pattern.

@eshagh
Copy link
Author

eshagh commented Mar 5, 2017

found a flaw in my theory: metadata chunks

by truncating all chunks, I effectively broken the ability of duplicacy to operate

instead, I am achieving my goal with unionfs-fuse & rclone mount (gcsfuse was not operating in a manner that allowed me to be successful)

thanks for the great software, hopefully my evaluation proves successful and I can buy a few hundred licenses

@eshagh eshagh closed this as completed Mar 5, 2017
@eshagh
Copy link
Author

eshagh commented Mar 5, 2017

Actually, the truncation portion of my scheme may not be workable, but priming the pump may still be valuable for a discrete set of data (given by an include filter). That would make enumerating relevant chunks a useful function.

@eshagh eshagh reopened this Mar 5, 2017
@gilbertchen
Copy link
Owner

I agree enumerating relevant chunks is useful, but I think you can just parse the output of the cat command to find relevant chunks. The output is the snapshot file in json format, so parsing it won't be too hard. For each file the 'content' field specifies the chunks referened by this file. For example:

    {
      "content": "1333:5630902:1336:972724",
      "gid": 20,
      "hash": "d80c6f7ff3c5774ef22d3ce81fe471e250ca665c2e0bb87f1e64fd9b465f7485",
      "mode": 493,
      "path": "go/src/github.com/gilbertchen/duplicacy/releases/1.1.7/duplicacy_linux_i386_1.1.7",
      "size": 10598718,
      "time": 1481856019,
      "uid": 501
    },

means the content of this file starts at chunk 1333 offset 5630902 while ends at chunk 1336 offset 972724, so chunks 1333-1336 are needed for this file.

@gilbertchen
Copy link
Owner

I can work on a simple script to parse the snapshot file output from the cat command and print all relevant chunks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants