enumerate chunks for a given pattern #44

eshagh · 2017-03-05T04:33:16Z

Hello,

I'm evaluating duplicacy and am narrowing on a particular mode of operation:

for a variety of reasons (performance, reliability, unsupported backend), I am abstracting away the relay of chunks to cloud storage (rclone 32 parallel uploads, etc)
in this mode of operation, I cache chunks locally, upload them to a cloud storage, then truncate locally
this preserves the file name and location to enable dedupe, but does not occupy local storage

In this mode of operation, I can quickly download chunks via parallelism. When I have the chunks (and snapshot files) downloaded and available locally, I can recover quickly. This does not work, however, if I do not have sufficient local disk space to bring the entire collection of chunks down.

If duplicacy could enumerate chunks necessary for a specific restore pattern (say, all my cat pictures only), I could prime the restore by first obtaining only those chunks necessary for the restore prior to running the restore job. At present, I am only able to enumerate chunks for a given snapshot. It is feasible to try to cut my dataset into smaller logical repos, each independently backed up/snapshotted, such that a full restore could be performed repo by repo, but that is laborious.

So the feature request is to add chunk enumeration for a given restore pattern.

eshagh · 2017-03-05T18:04:28Z

found a flaw in my theory: metadata chunks

by truncating all chunks, I effectively broken the ability of duplicacy to operate

instead, I am achieving my goal with unionfs-fuse & rclone mount (gcsfuse was not operating in a manner that allowed me to be successful)

thanks for the great software, hopefully my evaluation proves successful and I can buy a few hundred licenses

eshagh · 2017-03-05T18:39:12Z

Actually, the truncation portion of my scheme may not be workable, but priming the pump may still be valuable for a discrete set of data (given by an include filter). That would make enumerating relevant chunks a useful function.

gilbertchen · 2017-03-05T19:07:46Z

I agree enumerating relevant chunks is useful, but I think you can just parse the output of the cat command to find relevant chunks. The output is the snapshot file in json format, so parsing it won't be too hard. For each file the 'content' field specifies the chunks referened by this file. For example:

    {
      "content": "1333:5630902:1336:972724",
      "gid": 20,
      "hash": "d80c6f7ff3c5774ef22d3ce81fe471e250ca665c2e0bb87f1e64fd9b465f7485",
      "mode": 493,
      "path": "go/src/github.com/gilbertchen/duplicacy/releases/1.1.7/duplicacy_linux_i386_1.1.7",
      "size": 10598718,
      "time": 1481856019,
      "uid": 501
    },

means the content of this file starts at chunk 1333 offset 5630902 while ends at chunk 1336 offset 972724, so chunks 1333-1336 are needed for this file.

gilbertchen · 2017-04-21T16:29:35Z

I can work on a simple script to parse the snapshot file output from the cat command and print all relevant chunks.

eshagh closed this as completed Mar 5, 2017

eshagh reopened this Mar 5, 2017

gilbertchen added the enhancement label Apr 21, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

enumerate chunks for a given pattern #44

enumerate chunks for a given pattern #44

eshagh commented Mar 5, 2017

eshagh commented Mar 5, 2017

eshagh commented Mar 5, 2017

gilbertchen commented Mar 5, 2017

gilbertchen commented Apr 21, 2017

enumerate chunks for a given pattern #44

enumerate chunks for a given pattern #44

Comments

eshagh commented Mar 5, 2017

eshagh commented Mar 5, 2017

eshagh commented Mar 5, 2017

gilbertchen commented Mar 5, 2017

gilbertchen commented Apr 21, 2017