Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Streaming decryption #36

Closed
rectalogic opened this issue Jul 25, 2023 · 10 comments · Fixed by #77
Closed

Streaming decryption #36

rectalogic opened this issue Jul 25, 2023 · 10 comments · Fixed by #77
Labels
enhancement New feature or request good first issue Good for newcomers help wanted Extra attention is needed

Comments

@rectalogic
Copy link

Any plans to support streaming decryption from a file-like object? We want to stream large encrypted files from an sftp server and decrypt them on the fly as we write them somewhere else.

It looks like decrypt_file is basically doing that, we just need an API that takes the actual file-like fileobj.

fn decrypt_file(

Maybe using something like this https://github.com/omerbenamram/pyo3-file

@woodruffw
Copy link
Owner

woodruffw commented Jul 25, 2023

Thanks for opening an issue!

Could you say a bit more about your requirements here? We added the decrypt_file API with #32; that originally included support for file-like objects, but I held off on it due to a lack of clarity on end-user needs. Is the problem here that your SFTP streaming operation produces a Python fileobj, and not something with a file descriptor or path that could otherwise be streamed from?

@rectalogic
Copy link
Author

Correct, we have no fd or path. In this case it's actually a custom object that implements file-like methods (it inherits io.BufferedIOBase)

But if you are doing buffering yourself then I would expect probably any python object that just implements read(size=-1) to work.

@woodruffw
Copy link
Owner

Got it, thanks for explaining!

I'm open to supporting this, but I need to think a bit more about how I want to do it: I'd generally prefer to keep this package's dependencies to a bare minimum, and I want to make sure the downsides of using a file-like object API are communicated well to users (specifically, there's much more "critical" time spent in the CPython interpreter, so fileobj I/O is a lot slower than acquiring a file on just the Rust side.)

@woodruffw
Copy link
Owner

I gave this some more thought:

  1. If at all possible with your use case, I'd prefer not to have a dedicated file-like API here (for the reasons mentioned above). In particular: since you're streaming from a server, have you looked into using memfd or a FIFO for this? If your use case enables system-level file descriptors as an abstraction, then I'd be open to supporting an API that takes raw file descriptors and does I/O over them. This is arguably less Pythonic, but IMO it avoids a lot of the abstraction leakage and interpreter overhead problems that I've seen with fileobjs in the past.
  2. If the above is absolutely not an option for you, then I'd consider adding fileobj support through pyo3-file. Ideally this would be in the form of an extra (i.e. pip install pyrage[fileobj]), but I'm not sure if pyO3 or Maturin supports that.

Let me know if this sounds reasonable to you! TL;DR: I'd be willing to accept a PR for fileobj support, but only if your other options here are exhausted 🙂

@rectalogic
Copy link
Author

We would prefer not to route it thru the filesystem if possible.

Another approach could be chunked decryption, where we would get bytes however we want in python and repeatedly pass an incremental byte array into some decryption API that would return the decrypted bytes.

It looks like the rust age library has an API for this https://github.com/str4d/rage/blob/bccc4bc9497e0a4542219787ec8183414d1bd9a5/age/src/primitives/stream.rs#L197

I guess you would need to expose age::Decryptor via some python wrapper since it will need to maintain state across chunks.

@woodruffw
Copy link
Owner

We would prefer not to route it thru the filesystem if possible.

That's understandable; does memfd not suffice for your use case? When used correctly a memfd descriptor is similar to a mmap descriptor, and should never touch the filesystem.

Re: chunked decryption: this would also be reasonable!

@bendem
Copy link

bendem commented Oct 16, 2023

Hello, I'm looking for this too. We have a decryption service that reads file from remote storage and writes the decrypted bytes to the caller. This service has no disk available to store the encrypted or decrypted state (plus we pass that stream through other filters like decompression and checksumming).

We have an API that basically looks like this on our end and all decryptors implement it:

def filter(input: io.BufferedReader, output: io.BufferedWriter):
    ...

The simplest implementation of this is what we have for zstd decompression (also rust bindings):

def zstd_decompress_filter(input: io.IOBase, output: io.IOBase) -> str:
    dctx = zstandard.ZstdDecompressor()
    dctx.copy_stream(input, output)

https://github.com/indygreg/python-zstandard/blob/0063333790a853360c816101511635865405834f/rust-ext/src/compressor.rs#L219

@woodruffw
Copy link
Owner

This service has no disk available to store the encrypted or decrypted state (plus we pass that stream through other filters like decompression and checksumming).

Could you check and see if memfd would work for your use case, as mentioned in #36 (comment)? That would be my preferred approach to implementing this, and wouldn't require disk access or usage.

@bendem
Copy link

bendem commented Oct 17, 2023

I don't know memfd, never used it. Unless I'm missing something, I will create an input and output anonymous files and I will have to write the whole file into that anonymous file, basically loading it fully in memory. That's not streaming.

Here is what I came up with, take it as pseudocode.

def age_decrypt_filter(path: str, request_output: io.BytesIO) -> str:
    s3_input = s3.get_stream(path)
    inputfd = os.memfd_create("s3-input")
    outputfd = os.memfd_create("age-output")

    recipient = pyrage.x25519.Recipient.from_str("age1z...")

    with open(inputfd, "rb") as age_input:
        shutil.copyfileobj(s3_input, age_input)

        with open(outputfd, "wb") as age_output:
            age_input.seek(0, os.SEEK_SET)
            pyrage.decrypt_file(age_input, age_output, [recipient])

            age_output.seek(0, os.SEEK_SET)
            zstd_decompress_filter(age_output, request_output)

def zstd_decompress_filter(input: io.IOBase, output: io.IOBase) -> str:
    dctx = zstandard.ZstdDecompressor()
    dctx.copy_stream(input, output)

@woodruffw
Copy link
Owner

I don't know memfd, never used it. Unless I'm missing something, I will create an input and output anonymous files and I will have to write the whole file into that anonymous file, basically loading it fully in memory. That's not streaming.

The trick here would be to use demand-paging with mmap + memfd, which would be true streaming. This is traditionally not possible with network FDs because they aren't arbitrarily seekable, but Linux actually does support it with TCP as of a few years back. The constraints there (such as having to configure the MTU on your interface, and needing to guarantee that inbound data is page-sized and page-aligned) may not be worth it for your case, though.

TL;DR this is do-able, but probably not practically. Sorry for taking you down a rabbit hole 🙂

Looks like a fileobj-based interface is probably for the best, after all. Would you be interested in submitting a patch for that, potentially based on the work in #32?

@woodruffw woodruffw added enhancement New feature or request help wanted Extra attention is needed good first issue Good for newcomers labels Nov 24, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request good first issue Good for newcomers help wanted Extra attention is needed
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants