Replies: 26 comments
-
Someone who knows more than me can tell me if this is right, but it sounds like a work-around (when applied from the start) would be:
And now all those datasets should dedup against each other just fine. Right? Not the cleanest solution, but not too bad I think.... |
Beta Was this translation helpful? Give feedback.
-
Its not really a matter of 'fail-secure' or not. Deduplication across unrelated datasets is simply not possible due to the cryptography involved with encrypted datasets. Basically, each newly created encrypted dataset gets its own master encryption key. Because of this, any data that might logically be the same as data from another dataset will still end up encrypted differently on-disk, preventing dedup from working. The only exception to this rule is clones and snapshots of an existing dataset, which need to share the existing key in order to read the existing data. As for whether or not this is a bug, I'm not sure I agree. Without encryption, you will end up with the same inability to dedup if you use different compression settings on different datasets in a pool, so I'm not sure how this is different. @DeHackEd is right. Creating all datasets as a clones of some other dataset will work around this if you really need this capability. |
Beta Was this translation helpful? Give feedback.
-
While dedup might not be the best since bread in slices (as of the attached performance cost) there are times when it can be a solution to a problem. It being limited by encryption to 'one' dataset removes the ability to deduplicate datasets against the pool that are recv'd from other machines. If dedup can work across encrypted clones... there should be a way to make it work for a whole encrypted dataset hirarchy (without having to deal with a clone/origin relationship) should the admin have a use case for that. Possibly a flag on |
Beta Was this translation helpful? Give feedback.
-
No metadata, just bytes. Dedupe shouldn't ever be used in a security sensitive environment. I can certainly see where SAAS/hosted prefer using as little resources as possible per user but from a security standpoint, that isn't a very good thing. It opens a can of worms once you cross authority domains. Not surprising this is exactly why cloud providers whom restrict content and require their own clients to be used when uploading, should ever be trusted. Might be better off with more traditional shared volumes with encryption on your storage node(s). |
Beta Was this translation helpful? Give feedback.
-
Sorry? What did you want to say with that?
I don't see why that should be the case. Could you please share the reasoning behind that statement? |
Beta Was this translation helpful? Give feedback.
-
Because dedup information metadata - which blocks are the same - is visible without the decryption key. You can tell because the leaf nodes in the filesystem are pointing at the same physical disk block. Any information or metadata leak in encryption schemes is considered undesirable by default. |
Beta Was this translation helpful? Give feedback.
-
That's not necessarily avoidable or all that bad. It depends more on the kinds of attacks you are worried about. To give a contrived example, lets say that I have an app that sends 1 of 2 messages once per day, encrypted with 256 bit AES-GCM. However, the first message is 1 byte and the second is 1MB. In this case, an attacker who can view the ciphertext can easily determine which message is being sent just by the size of the payload. This doesn't mean AES-GCM is a bad encryption solution, it just has some limitations that the user / app developer needs to keep in mind. In the case of dedup with encryption, anyone with access to the raw disk CAN determine whether or not any blocks have dedup'd against each other, but it is up to the user to determine how big of a deal this attack vector is to them. The user should also be aware of CRIME style attacks (which are very situational in practice here) when using encryption + compression + dedup, as I believe is stated in the man page. |
Beta Was this translation helpful? Give feedback.
-
Shouldn't matter. For that scenario to happen the attacker needs to have access to the raw disks / zdb - then he could do a dedup scan anyway, so worst case having dedup active will save the time a scrub would take (to do a full data walk to detect any identical leaf blocks).
Undesireable is different to reducing complexity of an attack (or breaking the crypto). The zfs filesystem/pool metadata can not be encrypted anyway or scrub without the keys being loaded couldn't work (which, I hope, isn't the case), hence I still see no reason to not give the ability to dedup across datasets. |
Beta Was this translation helpful? Give feedback.
-
That no metadata should be visible outside an environment where security matters, It includes names, serial numbers, sizes, etc. Things that can be use to mount an attack on the data. If I know it comes from Finance for example, there's no point attacking servers from other departments. Likewise if I can see, or in the case of dedupe infer the presence of a pattern I need not waste time attacking other targets. Though the attacks against AES and the two current modes are difficult, some other sidechannel attacks are more generic against encryption and compression itself. Archive bombs are certainly a thing for example. See attacks on dm_crypt if you're curious. I'm aware zfs does not entirely follow these practices which is why I replied. Fact that someone could implement this (or worse), outside ZFS without a user knowing is another reason to be very careful where your data resides. tl;dr - Because. There is a compelling case to be made for zero knowledge. |
Beta Was this translation helpful? Give feedback.
-
I don't know if there is ever really a way to have zero knowledge without simply burning all the data so its not stored in the first place. Unfortunately, security is always a compromise between safety and usability; there is no way to have some of one and 100% of the other. Encrypted data will always be vulnerable to some (hopefully very limited) extent against side channel attacks, such as the one I stated in the comment above. In ZFS we decided to strike a balance between the 2. Nobody can read your data and nobody can modify it without you knowing, but you can still backup the data, scrub the pool, take snapshots, and perform other general administration tasks. We decided to allow dedup to work for snapshots and clones because these datasets all need to have access to a common encryption key to share data anyway, so there was nothing preventing it from working in these cases. For the vast majority of users the encryption scheme implemented here will meet the security requirements they have. For everyone else, we documented the limitations of the encryption implementation in the man page so that users can decide whether this solution will be sufficient for their needs. In my personal opinion, if you need more security than what is provided here then you will definitely need to look into the implementation details for ANY encryption solution you are considering so I believe this approach should suffice. |
Beta Was this translation helpful? Give feedback.
-
Any checksum would disagree with that.
The timing example you gave is not the kind of side channel attack to worry about, at least it's fairy easy to handle (random iv). The second is closer to a real risk, data at rest can and SHOULD be encrypted without metadata.
And that works so long as data at rest is encrypted. Dedupe against encrypted data should never really be worth the effort since that would effectively be collisions. Which leaves dedupe before encryption either on the source, inflight or destination.
The vast majority of users will not look at the data itself nor understand what data isn't being encrypted.
For the most part I'd agree.. Where I think we don't is that it should not be possible to tell a blob of random data from a zfs volume. Looking at a raw send for example is not entirely raw nor are the vdevs. |
Beta Was this translation helpful? Give feedback.
-
Well, historically that IS the exact scenario encryption on disk is meant to protect against. |
Beta Was this translation helpful? Give feedback.
-
I'm assuming you mean a MAC. Even in that case you haven't actually stored the data. All you can do with a MAC is confirm some bit off data is correct. So a solution built only using MACs would effectively be incredibly secure but almost completely unusable.
I'm not sure if we're talking about the same thing. The example I gave was about the size of the output ciphertext. A random IV will not prevent that kind of attack or any kind off timing attack that I'm aware of. I would argue that both of these are in fact "real risks". It is simply up to the user to decide what risks they are ok with.
I'm not talking about what the users understand or dont understand. I am talking about what they actually need I think that in general, we just simply disagree on where the balance of security vs usability should be and thats fine. If you want more security, you can decide to use zfs encryption without dedup and without compression. Or you can use something completely different if you want something more secure, such as |
Beta Was this translation helpful? Give feedback.
-
To expand on what @DeHackEd wrote above, the core purpose of Encrypted Data at Rest is to protect against someone stealing the system/disks, or someone improperly disposing of disks. A dedupe scan isn't useful if the keys or IVs for the dedupe domains (usually file-systems) are different. Whether or not a single dedupe domain is acceptable is completely dependent on the risk. If lives/billion$/jail is at risk, multiple encryption domains might make sense. |
Beta Was this translation helpful? Give feedback.
-
Checksums are typically used for that yes. Not sure why you believe it is unusable given that is the basis behind several projects including Freenet, i2p, etc. Perhaps you're confusing zero knowledge in terms of public key vs symmetric crypto. There is an argument to be made that all cryptography ultimately requires some key transfer, thus vulnerable. I'm referring to specifically symmetric with static keys given that is the mode zfs operates in. Though adding public key would be interesting as it would allow verification / scrubbing without compromising any key.
In what context? The random iv ensures the output is unique. You stipulated "that's not necessarily avoidable or all that bad. ". It is avoidable and is bad, I never said it wasn't rather given the second was more of a concern given the two examples. The context is that of metadata in zfs, not say observing an ssh session where timing can correlate to keystrokes.
That isn't the full issue here though. The metadata itself is stored unencrypted regardless of dedupe or compression.
o_O dm_crypt has it's uses but that isn't really a fair option when compared to zfs. Regarding the MAC, I'm more confused at this point because zfs suffers the same fate.
Indeed, that isn't really helping your case either though. It's exactly why ZFS needs to be locked down. You can store keys in hardware devices or not at all - Secure Enclave in IOS and Smartcards for example. AMD has their encrypted memory or even simply using UEFI. Point is the key not being stored with the data thus rendering it unusable should someone walk off with a drive. There are absolutely different levels of risk given the context. I'm more inclined to lean toward ZFS being used in shared environments like VMs and multiuser systems then single use which does open it up to a whole other level of attacks. I think many people would be surprised to learn the volume can be accessed at all by zfs without being decrypted / "opened". |
Beta Was this translation helpful? Give feedback.
-
Certainly. That's why it works with clones/snapshots (as they use the same key/IV) while it dosn't across datasets where these differ.
That's why I asked if it would be possible to explicitely instruct ZFS to copy the encryption key from the parent (instead of generating a new one) into a newly created (or recv'd - in case it's possible to encrypt recv datasets, which I havn't checked) dataset, to allow for a whole hierarchy to dedup across itself. Yes, there certainly are security implications (like having to break only one key to get access to the multiple datasets that use it), but it would allow to combine native encryption (to protect against loss of drives) with dedup (to be able to fit the raw data into the pool at all) for certain setups - without suffering from the downsides of clone/origin relationships.
As long as the data being read is white noise without the key... it shouldn't be a problem for many scenarios. |
Beta Was this translation helpful? Give feedback.
-
That isn't the preferred mode of operation in any secure environment nor does it fit what anyone would expect from FDE. Doesn't appear the keys need to be reentered anyway, at least I was able to export and reimpmort an encrypted volume +without+ needing to reenter any password. That should never, ever happen.
I'd agree except that isn't the case. It's unfortunate we live in a world where the mere presence of any metadata can have deadly consequences. |
Beta Was this translation helpful? Give feedback.
-
@GregorKopka I haven’t tested, but to make sure we are on the same page: you are saying that receiving an unencrypted send (i.e. not a raw send where the data is encrypted, the key is thus already set, and and the receiver doesn’t even need the key) into an existing encrypted filesystem creates a new encryption root? I am skeptical of that, but again, I haven’t tested yet. |
Beta Was this translation helpful? Give feedback.
-
@rlaager by default performing a non-raw send into a dataset below and encrypted one will make that dataset an encrypted child of the parent. You can override this yourself by receiving with |
Beta Was this translation helpful? Give feedback.
-
That's what I expected. So if you are receiving non-raw sends, data will dedup across datasets, assuming they are all parented under the same encryption root. If you are receiving raw sends, the data is already encrypted and the receiver can't possibly do anything about that. If you are creating the data on the system, then things dedup within a given encryption root, which is an intentional design decision and is unlikely to change. I'm going to close this. If I've missed something, we can re-open. |
Beta Was this translation helpful? Give feedback.
-
One small correction. Only encrypted clone-families dedup. So you can dedup within a given dataset, its snapshots, and clones of those snapshots. This is because dedup works based on which datasets are sharing the master key, not the user's wrapping key. If all datasets under an encryption root shared a master key it would be impossible for you to separate those datasets from the encryption root in the future (at least cryptographically). |
Beta Was this translation helpful? Give feedback.
-
@h1z1 ZFS native encryption isn't FDE - in case that acronym is for 'Full Disk Encryption'. @rlaager No, I don't think that I'm saying that. I understand ZFS native encryption in the way that receiving into (or creating) a child below an encryption root creates an encrypted dataset, unlockable through the user supplied (and changeable) key specified for the encryption root, used to unlock the distinct internal (and static) key material of each dataset below the encryption root and this giving the limitation listed in
not stemming from an arbitrary restriction but from the distinct internal key material between the datasets causing different ciphertexts for identical data on the different datasets, causing the hashes of the on-disk blocks to differ, resulting in things dedup within a given encryption root not happening. @tcaputi Could you please clarify if I'm correct with that view? Because in case Richard is right then at least the documentation needs to be updated (the note about dedup limitation would need to go, plus a warning about the massive security implications of handing control over a former encryption root child dataset to someone else needing to be added). But I hope I'm not wrong with each dataset having distinct internal key material. In case that's the case: please reopen. My main use case for zfs dedup is in serving clone unfriendly OS images on zvols, which I dedup to both save on pool space and to get enough of all clients into ARC to move the bottleneck far enough from disk throughput toward network bandwidth to make booting more than very few of them in parallel an acceptable experience. Thus I agree with the OP in that we need a switch to disable the dedup limitation,to make native encryption feasible for this scenario. PS: beaten by a few seconds, in parts obsoleted by the post of @tcaputi a moment prior. |
Beta Was this translation helpful? Give feedback.
-
@tcaputi is there a reasonable likelihood you would implement such a thing? If so, let’s reopen and assign this to you. Otherwise, let’s leave it closed as “wontfix”, given the recent discussions about the issue tracker. There’s likely no point in keeping feature requests open that nobody intends to implement. |
Beta Was this translation helpful? Give feedback.
-
I don't have any plans to implement it at this moment, but I'm not sure we should just close feature requests because we don't have anyone to work on them immediately. Most features require at least a bit of lead time before anyone is able to get started on them. We can probably talk about how to deal with feature requests at the leadership meeting today. |
Beta Was this translation helpful? Give feedback.
-
This was discussed at the OpenZFS leadership meeting, starting around here: The use case makes a lot more sense to me now. I apologize for closing this prematurely. @tcaputi my intention wasn't to close a feature request solely because nobody was working on it at the moment, but (initially, though I was incorrect) it seemed like a feature request that went significantly against a critical piece of the encryption design, you had no plans to work on it, and the likelihood of someone else picking up a big encryption redesign seems quite low. Had I been correct, closing would probably have been appropriate because there's no point in A) giving false hope, and B) cluttering up the tracker. That might be a good topic for an OpenZFS meeting. |
Beta Was this translation helpful? Give feedback.
-
@GregorKopka Indeed I'm aware of that, it's part of the problem. |
Beta Was this translation helpful? Give feedback.
-
System information
I've just been bitten by unexpected storage bloat when doing a test migration of a backup storage box from ZoL 0.7 using LUKS-encrypted disks to ZoL 0.8 with native encryption. Different backups are kept in different datasets, but tend to have quite a bit of data in common, and so deduping between datasets is an important feature for me.
The zfs(8) manpage states that "Deduplication is still possible with encryption enabled but for security, datasets will only dedup against themselves, their snapshots, and their clones." While fail-secure is obviously a sane default, adding this new policy to ZFS (with the cop-out excuse of "for security") without any means of turning it off is a bug.
Beta Was this translation helpful? Give feedback.
All reactions