Add support for 'cloud' friendly API - which one? #21

tasket · 2019-02-23T00:22:14Z

Although ssh with Linux shell is currently supported, this is not commonly offered by large 'cloud' storage services.

Some protocols that have been already suggested:

sftp

amazon S3

~~swift~~

webDAV

...or using FUSE to access one of the above or other storage type such as @cryfs .

The text was updated successfully, but these errors were encountered:

jpouellet · 2019-02-23T04:05:19Z

If arbitrary object storage backends could be supported, that'd be awesome. Supporting "s3-compatible" semantics seem to be a widely supported least common denominator.

tasket · 2019-02-24T00:53:10Z

I'm glad you mentioned semantics because having looked at some Swift docs from that perspective I'm not sure it can quite do the job (and wasn't even able to find a storage provider that used it). One of the legs holding up sparsebak's speed & efficiency is POSIX fs semantics. Which is why I think little old sftp may cut it if the others don't. Interestingly, amazon s3 offers sftp which gives me hope others do as well.

What I need at a minimum in addition to put, get, delete is a very efficient mv/rename that works in some kind of hierarchical directory (or tags holding a 'directory' string). Oddly enough, when I looked at Swift I did not notice a way to delete. Additionally, if I am to implement deduplication without some tedious and gross code, then I am going to need a link equivalent as well which sftp has.

Finally, there needs to be some easily accessible API on my client end to allow me to stream files out like sausage links. I may have to make a dependency to a non-core library (or tool like sshfs) in order to do that.

tasket · 2019-03-27T16:27:59Z

It appears that the Amazon s3 protocol is open and used by a number of other cloud storage providers, more than sftp, and the API is easy to use from python. So I'm listing it as a candidate for now.

The doubt I have about s3 protocol is whether their mv is as efficient as sftp (which is essentially posix), and whether they have anything that can take the place of ln. The latter will be necessary to implement deduplication without having to create and manage a separate abstraction layer of my own. If s3 can't link or cow-copy, then I'm inclined to stay with sftp.

Amazon themselves have been gradually moving s3 tools in the direction of posix, and now offer actual sftp access, so I'd just assume use the real thing. The irony here would be that I have to listen to ppl whine about no s3 protocol support bc they have to resort to the service provided by Amazon.

tasket · 2021-09-12T19:20:27Z

This looks really useful: https://github.com/s3fs-fuse/s3fs-fuse

darthShadow · 2021-10-03T15:31:56Z

Perhaps rclone?

https://github.com/rclone/rclone

tasket · 2021-10-03T21:53:06Z

Perhaps rclone?

https://github.com/rclone/rclone

Interesting, but rather heavy (14MB compressed). FUSE is a much better deal, IMO, and I get the impression that tools like @rclone exist because of Windows usage patterns. OTOH, rclone can mount remote storage as a local fs so "there ya go"... like FUSE you can already use it with Wyng. :)

With FUSE and rclone available, the question about protocol support becomes more about whether Wyng will integrate the process of connecting remotely or leave it to the user or a GUI shell to make the connection.

darthShadow · 2021-10-03T22:04:09Z

Honestly, I am not sure if it's worth adding all the complexity that comes with adding such functionality natively. Perhaps you could just link to rclone for that or use its python wrapper? https://github.com/rclone/rclone/tree/master/librclone#python

tlaurion · 2022-01-28T02:01:45Z

This looks really useful: https://github.com/s3fs-fuse/s3fs-fuse
@tasket any progress?

I have read the docs in diagonal, and it seems it would resolve my stalled PoC with rsync.net, since buckets seems to be able to be configured as read only with other access keys?

tasket · 2022-06-22T21:18:49Z

See issue #101 about sshfs performance.

DemiMarie · 2024-05-05T01:48:59Z

The doubt I have about s3 protocol is whether their mv is as efficient as sftp (which is essentially posix), and whether they have anything that can take the place of ln. The latter will be necessary to implement deduplication without having to create and manage a separate abstraction layer of my own. If s3 can't link or cow-copy, then I'm inclined to stay with sftp.

You were right to be skeptical. Object stores like S3 are flat key/value stores with no heirarchy, so ln is impossible and mv implemented as a copy. S3 does support atomic PUT operations, but that’s about it. Range requests are also supported, so one can get only the part of an object one is actually interested in.

Generally, object stores expect one to perform large, independent accesses and be able to tolerate substantial per-access latency. Individual objects can be very large, though, allowing high throughput. It is also possible to efficiently operate on multiple objects in parallel, but this requires that one can submit multiple requests before knowing any of the results.

Looking at https://github.com/tasket/wyng-backup/blob/5f153e4c155cd4e400ad85e8b1d6fa08a1508300/doc/Wyng_Archive_Format_V3.md, it seems that the current design is very much more suited to a file system than to object storage. For object storage, I would go with something like this:

/root_metadata # contains list of volumes and basic metadata about each of them
/volume1
/volume1/session1_meta
/volume1/session1_data
/volume1/session2_meta
/volume1/session2_data
/volume2
/volume2/session1_meta
/volume2/session1_data

Here, the session*_meta entries contain the (encrypted and authenticated) metadata, represented as something like the following (in JSON):

{
    "version": 1,
    "keys": [
        { "start": 1234, "size": 5678, "name": "/volume1/session1_data", "hash": "000000000000000000" }
        { "start": 1234, "size": 5678, "name": "/volume1/session2_data", "hash": "000000000000000000" }
    ]
}

The key differences are:

Far shorter critical path for accessing an object. At most, I need to read 4 objects in order:
1. /root_metadata
2. /volume1
3. /volume1/session2_meta
4. The objects (or portions of objects) /volume1/session2_meta refers to.
Deduplication is implemented manually. Instead of requiring that a read from /volume1/session2_data return data from other objects, /volume1/session2_meta specifies where the data is. Wyng itself makes parallel request to obtain all of the data chunks and returns the result as a single stream.

This is more work on Wyng’s part, but allows Wyng to use cheap, scalable object storage, rather than a file system that is much harder to scale horizontally.

tlaurion · 2024-05-18T22:29:16Z

Interesting. I guess other exists, never tried Microsoft one drive https://github.com/oxalica/orb

This creates a usable block device that can be formatted as a mountable btrfs device.

tasket · 2024-05-20T19:47:04Z

I'm moving this up to milestone v0.9.

DemiMarie · 2024-05-20T21:04:16Z

Interesting. I guess other exists, never tried Microsoft one drive https://github.com/oxalica/orb

This creates a usable block device that can be formatted as a mountable btrfs device.

This will have vastly inferior performance to a real block device, and it poses a major security risk if one considers Onedrive to be untrusted. This is because btrfs considers the block device to be trusted.

I strongly recommend implementing native support for object storage in wyng-backup instead.

tlaurion · 2024-05-21T18:31:09Z

@DemiMarie the alternative here, not to dismiss this cloud based storage because that is needed and ssh servers from my past PoC attempts are rare outside of self-hosting and self-managing a VPS, will be to to self host the backup archives, which needs work in parallel.

We are talking about QuebesOS user base here and I can already see some push back into hosting private backups, into any cloud provider.

The solution to this, which I'm working on in parallel, is to have easy recipe to self host said wyng archives through self-made NAS on top of OpenWRT supported models.

I have a working PoC I'm using daily, which fixes needed to make this work were worked on and fixed already, but traces of the discussions are under #195

kocmo · 2024-06-08T23:19:44Z

@DemiMarie

Generally, object stores expect one to perform large, independent accesses and be able to tolerate substantial per-access latency

If using existing Wyng storage model verbatim, storing individual chunks as individual units of data - that would probably suggest larger Wyng chunk sizes (1-16M chunks?) - to decrease the overhead of API calls per chunk. As deduplication efficiency will go down with increased chunk size, wondering whether resulting dedup efficiency will still be acceptable.

For example, Duplicacy uses variable-length chunks in that range, they target 4M chunks on average.

2. Deduplication is implemented manually

In case of S3 pruning will be somewhat more complicated than with filesystem-based storage:

Maintain a global map from the keys (i.e., truncated hashes) of referenced chunks to their reference counts (i.e., via a hashtable or B-something tree)
Or alternatively a linear scan through all the manifests, this will be much slower

On the other hand, for those cloud storage options that support garbage collection / reference counting for blobs - Wyng could offload much deduplication complexity to them.

P.S.:

See issue #101 about sshfs performance.

Google returns interesting performance differences between sftp vs. sshfs vs. rclone sftp mount vs. rsync over ssh - may be worthwhile to benchmark.

tasket · 2024-06-09T03:09:11Z

FWIW, the current max chunk size in Wyng is 2MB. Big chunks aren't good for deduplication, though.

I've thought about the content-only addressing angle for some time (Wyng V3 format is a hybrid of offset and content addressing). Probably the most effective way to reclaim space from unused chunks, without scanning the whole archive directory on every prune or delete, is to create a differential run something like what receive --use-snapshot does when restoring backward in time vs the snapshot; instead of a single snapshot's manifest being the baseline, a pan-archive merge of all manifests would serve that role. In this case, uniq or diff tells us which of the 'deleted' chunks for this prune/delete op are actually no longer referenced. This could be done at the end of each prune or delete call, or as its own batch operation. I'm guessing the overhead for this would be noticeable (filesystems handling hardlinks is still more efficient) but still much less time and complexity burden than other methods like variable-sized blocks which sometimes have to be broken-up and possibly re-encoded.

The problem with keeping a separate chunk map of any kind is you now have the logistical problem of cache coherency (Wyng already has a 1-layer cache coherency challenge, adding another persistent layer is something to avoid if possible).

DemiMarie · 2025-01-13T21:42:02Z

@DemiMarie

Generally, object stores expect one to perform large, independent accesses and be able to tolerate substantial per-access latency

If using existing Wyng storage model verbatim, storing individual chunks as individual units of data - that would probably suggest larger Wyng chunk sizes (1-16M chunks?) - to decrease the overhead of API calls per chunk. As deduplication efficiency will go down with increased chunk size, wondering whether resulting dedup efficiency will still be acceptable.

I think it would be best to look at cloud pricing to see what the cost per request is compared to the cost of metadata access.

For example, Duplicacy uses variable-length chunks in that range, they target 4M chunks on average.

Deduplication is implemented manually

In case of S3 pruning will be somewhat more complicated than with filesystem-based storage:

Maintain a global map from the keys (i.e., truncated hashes) of referenced chunks to their reference counts (i.e., via a hashtable or B-something tree)

Or alternatively a linear scan through all the manifests, this will be much slower

I suggest doing benchmarks to determine the relative costs of different operations.

DemiMarie · 2025-01-25T04:18:49Z

I talked with @Laikulo and they suggested that a local index be used to reduce the number of requests that must be made to object storage.

tasket mentioned this issue Sep 21, 2021

Feature Request: S3 storage bucket support #91

Open

tasket added the research label Jun 22, 2023

tasket added this to the v1.0 milestone Jun 22, 2023

tlaurion mentioned this issue May 4, 2024

[Contribution] qubes-incremental-backup-poc OR Wyng backup QubesOS/qubes-issues#858

Open

tlaurion mentioned this issue May 11, 2024

qubes-ssh port is not respected? #191

Closed

tasket added the enhancement New feature or request label May 20, 2024

tasket modified the milestones: v1.0, v0.9 May 20, 2024

tasket mentioned this issue May 20, 2024

v0.9 release timeline #197

Open

8 tasks

DemiMarie mentioned this issue Jan 13, 2025

Qubes GUI integration tasket/wyng-util-qubes#36

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support for 'cloud' friendly API - which one? #21

Add support for 'cloud' friendly API - which one? #21

tasket commented Feb 23, 2019 •

edited

Loading

jpouellet commented Feb 23, 2019

tasket commented Feb 24, 2019

tasket commented Mar 27, 2019

tasket commented Sep 12, 2021

darthShadow commented Oct 3, 2021

tasket commented Oct 3, 2021

darthShadow commented Oct 3, 2021 •

edited

Loading

tlaurion commented Jan 28, 2022

tasket commented Jun 22, 2022

DemiMarie commented May 5, 2024

tlaurion commented May 18, 2024 •

edited

Loading

tasket commented May 20, 2024

DemiMarie commented May 20, 2024

tlaurion commented May 21, 2024

kocmo commented Jun 8, 2024 •

edited

Loading

tasket commented Jun 9, 2024

DemiMarie commented Jan 13, 2025

DemiMarie commented Jan 25, 2025

Add support for 'cloud' friendly API - which one? #21

Add support for 'cloud' friendly API - which one? #21

Comments

tasket commented Feb 23, 2019 • edited Loading

jpouellet commented Feb 23, 2019

tasket commented Feb 24, 2019

tasket commented Mar 27, 2019

tasket commented Sep 12, 2021

darthShadow commented Oct 3, 2021

tasket commented Oct 3, 2021

darthShadow commented Oct 3, 2021 • edited Loading

tlaurion commented Jan 28, 2022

tasket commented Jun 22, 2022

DemiMarie commented May 5, 2024

tlaurion commented May 18, 2024 • edited Loading

tasket commented May 20, 2024

DemiMarie commented May 20, 2024

tlaurion commented May 21, 2024

kocmo commented Jun 8, 2024 • edited Loading

tasket commented Jun 9, 2024

DemiMarie commented Jan 13, 2025

DemiMarie commented Jan 25, 2025

tasket commented Feb 23, 2019 •

edited

Loading

darthShadow commented Oct 3, 2021 •

edited

Loading

tlaurion commented May 18, 2024 •

edited

Loading

kocmo commented Jun 8, 2024 •

edited

Loading