Skip to content

Use a "SecureString" like type to store Parquet encryption keys #7373

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
Tracked by #7278
adamreeve opened this issue Apr 1, 2025 · 5 comments
Open
Tracked by #7278

Use a "SecureString" like type to store Parquet encryption keys #7373

adamreeve opened this issue Apr 1, 2025 · 5 comments
Labels
enhancement Any new improvement worthy of a entry in the changelog parquet Changes to the parquet crate

Comments

@adamreeve
Copy link
Contributor

#6637 and #7111 added support for reading and writing Parquet files with encryption. These add FileDecryptionProperties and FileEncryptionProperties types that hold encryption keys as a Vec<u8>. Precaution should be taken to prevent accidentally exposing these keys and allowing unauthorised access to encrypted data.

In the C++ Parquet implementation for example, these keys are "wiped" after a file is read or written, which is intended to prevent any memory access bugs from being able to expose these keys. But it's known that this wiping isn't very secure as only the first byte of the key is usually overwritten. See apache/arrow#31603 and some of the discussion in apache/arrow#44990.

Ideally these keys should be stored in a type that automatically clears the whole key from memory when it is dropped, eg. something like https://crates.io/crates/secure-string, or a custom abstraction built on top of https://crates.io/crates/zeroize.

We might also want to have a Debug implementation that doesn't show the key contents to avoid accidental logging of keys.

@adamreeve adamreeve added enhancement Any new improvement worthy of a entry in the changelog parquet Changes to the parquet crate labels Apr 1, 2025
@alamb
Copy link
Contributor

alamb commented Apr 5, 2025

This seems reasonable to me, though the memory safety concern is likely less a problem with rust

You can probably make something like this:

struct SecureKey {
  inner: Vec<u8>
}

impl Drop for SecureKey { 
...
// reset memory here
}

/// allow Secure key to be coerced to &[u8]
impl AsRef<[u8]> for SecureKey {
  fn as_ref(&self) -> &[u8] {
    &inner
  }
}

@tustvold
Copy link
Contributor

tustvold commented Apr 9, 2025

This might be unwanted complication, but having done a quick scan through the parquet encryption code I was a little surprised that the interfaces are in terms of explicit keys at all, I would have perhaps expected an interface where the manner in which ciphertext gets generated/decrypted was opaque to the internal parquet machinery beyond propagating key metadata. This would allow using mechanisms like HSMs or external key stores to handle the key material, without that material ever being visible.

TBC if we do handle key material we should probably do something like SecureString to reduce the risk of compromise, but I also think we ideally should be designing to allow for more secure mechanisms that don't expose the private key material to the process at all.

TBC my knowledge of parquet's modular encryption is limited and it may be some aspect of its design prevent doing this

@adamreeve
Copy link
Contributor Author

I don't think this is exactly what you have in mind, but I expect most users would want to use the key management tools API that is in progress and integrates with a KMS to encrypt and decrypt data encryption keys, rather than directly using the existing low level APIs. I have a PR open for that at #7387

When that is used, the master encryption keys aren't exposed to the process but are managed by a KMS which will usually perform the encryption and decryption remotely. It encrypts and decrypts data encryption keys that are randomly generated per file, rather than decrypting the Parquet data directly.

So the only keys that are kept in memory are keys that are limited in scope to a single Parquet file, limiting the damage that can be done if a key is exposed.

cc @ggershinsky you might have some thoughts on this topic.

@tustvold
Copy link
Contributor

tustvold commented Apr 9, 2025

Right, my suggestion is that we keep the interfaces within the parquet crate as unopinionated as possible, with a view to enabling people to handle keys as appropriate for their use-case.

@ggershinsky
Copy link
Contributor

I don't think this is exactly what you have in mind, but I expect most users would want to use the key management tools API that is in progress and integrates with a KMS to encrypt and decrypt data encryption keys, rather than directly using the existing low level APIs. I have a PR open for that at #7387

When that is used, the master encryption keys aren't exposed to the process but are managed by a KMS which will usually perform the encryption and decryption remotely. It encrypts and decrypts data encryption keys that are randomly generated per file, rather than decrypting the Parquet data directly.

So the only keys that are kept in memory are keys that are limited in scope to a single Parquet file, limiting the damage that can be done if a key is exposed.

cc @ggershinsky you might have some thoughts on this topic.

This is an accurate description of the security model, which follows the standard "envelope encryption" practice. The master keys never leave the KMS (typically HSM-based).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Any new improvement worthy of a entry in the changelog parquet Changes to the parquet crate
Projects
None yet
Development

No branches or pull requests

4 participants