Use a "SecureString" like type to store Parquet encryption keys #7373

adamreeve · 2025-04-01T22:12:01Z

#6637 and #7111 added support for reading and writing Parquet files with encryption. These add FileDecryptionProperties and FileEncryptionProperties types that hold encryption keys as a Vec<u8>. Precaution should be taken to prevent accidentally exposing these keys and allowing unauthorised access to encrypted data.

In the C++ Parquet implementation for example, these keys are "wiped" after a file is read or written, which is intended to prevent any memory access bugs from being able to expose these keys. But it's known that this wiping isn't very secure as only the first byte of the key is usually overwritten. See apache/arrow#31603 and some of the discussion in apache/arrow#44990.

Ideally these keys should be stored in a type that automatically clears the whole key from memory when it is dropped, eg. something like https://crates.io/crates/secure-string, or a custom abstraction built on top of https://crates.io/crates/zeroize.

We might also want to have a Debug implementation that doesn't show the key contents to avoid accidental logging of keys.

The text was updated successfully, but these errors were encountered:

alamb · 2025-04-05T11:01:03Z

This seems reasonable to me, though the memory safety concern is likely less a problem with rust

You can probably make something like this:

struct SecureKey {
  inner: Vec<u8>
}

impl Drop for SecureKey { 
...
// reset memory here
}

/// allow Secure key to be coerced to &[u8]
impl AsRef<[u8]> for SecureKey {
  fn as_ref(&self) -> &[u8] {
    &inner
  }
}

tustvold · 2025-04-09T11:22:43Z

This might be unwanted complication, but having done a quick scan through the parquet encryption code I was a little surprised that the interfaces are in terms of explicit keys at all, I would have perhaps expected an interface where the manner in which ciphertext gets generated/decrypted was opaque to the internal parquet machinery beyond propagating key metadata. This would allow using mechanisms like HSMs or external key stores to handle the key material, without that material ever being visible.

TBC if we do handle key material we should probably do something like SecureString to reduce the risk of compromise, but I also think we ideally should be designing to allow for more secure mechanisms that don't expose the private key material to the process at all.

TBC my knowledge of parquet's modular encryption is limited and it may be some aspect of its design prevent doing this

adamreeve · 2025-04-09T11:32:25Z

I don't think this is exactly what you have in mind, but I expect most users would want to use the key management tools API that is in progress and integrates with a KMS to encrypt and decrypt data encryption keys, rather than directly using the existing low level APIs. I have a PR open for that at #7387

When that is used, the master encryption keys aren't exposed to the process but are managed by a KMS which will usually perform the encryption and decryption remotely. It encrypts and decrypts data encryption keys that are randomly generated per file, rather than decrypting the Parquet data directly.

So the only keys that are kept in memory are keys that are limited in scope to a single Parquet file, limiting the damage that can be done if a key is exposed.

cc @ggershinsky you might have some thoughts on this topic.

tustvold · 2025-04-09T11:45:49Z

Right, my suggestion is that we keep the interfaces within the parquet crate as unopinionated as possible, with a view to enabling people to handle keys as appropriate for their use-case.

ggershinsky · 2025-04-09T12:15:00Z

I don't think this is exactly what you have in mind, but I expect most users would want to use the key management tools API that is in progress and integrates with a KMS to encrypt and decrypt data encryption keys, rather than directly using the existing low level APIs. I have a PR open for that at #7387

When that is used, the master encryption keys aren't exposed to the process but are managed by a KMS which will usually perform the encryption and decryption remotely. It encrypts and decrypts data encryption keys that are randomly generated per file, rather than decrypting the Parquet data directly.

So the only keys that are kept in memory are keys that are limited in scope to a single Parquet file, limiting the damage that can be done if a key is exposed.

cc @ggershinsky you might have some thoughts on this topic.

This is an accurate description of the security model, which follows the standard "envelope encryption" practice. The master keys never leave the KMS (typically HSM-based).

adamreeve added enhancement Any new improvement worthy of a entry in the changelog parquet Changes to the parquet crate labels Apr 1, 2025

adamreeve mentioned this issue Apr 1, 2025

[Epic] Parquet Modular Encryption Support #7278

Open

15 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use a "SecureString" like type to store Parquet encryption keys #7373

Use a "SecureString" like type to store Parquet encryption keys #7373

adamreeve commented Apr 1, 2025

alamb commented Apr 5, 2025

tustvold commented Apr 9, 2025 •

edited

Loading

adamreeve commented Apr 9, 2025

tustvold commented Apr 9, 2025

ggershinsky commented Apr 9, 2025

Use a "SecureString" like type to store Parquet encryption keys #7373

Use a "SecureString" like type to store Parquet encryption keys #7373

Comments

adamreeve commented Apr 1, 2025

alamb commented Apr 5, 2025

tustvold commented Apr 9, 2025 • edited Loading

adamreeve commented Apr 9, 2025

tustvold commented Apr 9, 2025

ggershinsky commented Apr 9, 2025

tustvold commented Apr 9, 2025 •

edited

Loading