Streaming decryption #36

rectalogic · 2023-07-25T21:24:04Z

Any plans to support streaming decryption from a file-like object? We want to stream large encrypted files from an sftp server and decrypt them on the fly as we write them somewhere else.

It looks like decrypt_file is basically doing that, we just need an API that takes the actual file-like fileobj.

pyrage/src/lib.rs

Line 224 in aab0f86

fn decrypt_file(

Maybe using something like this https://github.com/omerbenamram/pyo3-file

The text was updated successfully, but these errors were encountered:

woodruffw · 2023-07-25T21:55:42Z

Thanks for opening an issue!

Could you say a bit more about your requirements here? We added the decrypt_file API with #32; that originally included support for file-like objects, but I held off on it due to a lack of clarity on end-user needs. Is the problem here that your SFTP streaming operation produces a Python fileobj, and not something with a file descriptor or path that could otherwise be streamed from?

rectalogic · 2023-07-25T22:13:21Z

Correct, we have no fd or path. In this case it's actually a custom object that implements file-like methods (it inherits io.BufferedIOBase)

But if you are doing buffering yourself then I would expect probably any python object that just implements read(size=-1) to work.

woodruffw · 2023-07-25T22:50:55Z

Got it, thanks for explaining!

I'm open to supporting this, but I need to think a bit more about how I want to do it: I'd generally prefer to keep this package's dependencies to a bare minimum, and I want to make sure the downsides of using a file-like object API are communicated well to users (specifically, there's much more "critical" time spent in the CPython interpreter, so fileobj I/O is a lot slower than acquiring a file on just the Rust side.)

woodruffw · 2023-08-01T03:45:47Z

I gave this some more thought:

If at all possible with your use case, I'd prefer not to have a dedicated file-like API here (for the reasons mentioned above). In particular: since you're streaming from a server, have you looked into using memfd or a FIFO for this? If your use case enables system-level file descriptors as an abstraction, then I'd be open to supporting an API that takes raw file descriptors and does I/O over them. This is arguably less Pythonic, but IMO it avoids a lot of the abstraction leakage and interpreter overhead problems that I've seen with fileobjs in the past.
If the above is absolutely not an option for you, then I'd consider adding fileobj support through pyo3-file. Ideally this would be in the form of an extra (i.e. pip install pyrage[fileobj]), but I'm not sure if pyO3 or Maturin supports that.

Let me know if this sounds reasonable to you! TL;DR: I'd be willing to accept a PR for fileobj support, but only if your other options here are exhausted 🙂

rectalogic · 2023-08-01T22:21:20Z

We would prefer not to route it thru the filesystem if possible.

Another approach could be chunked decryption, where we would get bytes however we want in python and repeatedly pass an incremental byte array into some decryption API that would return the decrypted bytes.

It looks like the rust age library has an API for this https://github.com/str4d/rage/blob/bccc4bc9497e0a4542219787ec8183414d1bd9a5/age/src/primitives/stream.rs#L197

I guess you would need to expose age::Decryptor via some python wrapper since it will need to maintain state across chunks.

woodruffw · 2023-08-01T23:59:10Z

We would prefer not to route it thru the filesystem if possible.

That's understandable; does memfd not suffice for your use case? When used correctly a memfd descriptor is similar to a mmap descriptor, and should never touch the filesystem.

Re: chunked decryption: this would also be reasonable!

bendem · 2023-10-16T09:18:53Z

Hello, I'm looking for this too. We have a decryption service that reads file from remote storage and writes the decrypted bytes to the caller. This service has no disk available to store the encrypted or decrypted state (plus we pass that stream through other filters like decompression and checksumming).

We have an API that basically looks like this on our end and all decryptors implement it:

def filter(input: io.BufferedReader, output: io.BufferedWriter):
    ...

The simplest implementation of this is what we have for zstd decompression (also rust bindings):

def zstd_decompress_filter(input: io.IOBase, output: io.IOBase) -> str:
    dctx = zstandard.ZstdDecompressor()
    dctx.copy_stream(input, output)

https://github.com/indygreg/python-zstandard/blob/0063333790a853360c816101511635865405834f/rust-ext/src/compressor.rs#L219

woodruffw · 2023-10-16T14:28:43Z

This service has no disk available to store the encrypted or decrypted state (plus we pass that stream through other filters like decompression and checksumming).

Could you check and see if memfd would work for your use case, as mentioned in #36 (comment)? That would be my preferred approach to implementing this, and wouldn't require disk access or usage.

bendem · 2023-10-17T14:19:34Z

I don't know memfd, never used it. Unless I'm missing something, I will create an input and output anonymous files and I will have to write the whole file into that anonymous file, basically loading it fully in memory. That's not streaming.

Here is what I came up with, take it as pseudocode.

def age_decrypt_filter(path: str, request_output: io.BytesIO) -> str:
    s3_input = s3.get_stream(path)
    inputfd = os.memfd_create("s3-input")
    outputfd = os.memfd_create("age-output")

    recipient = pyrage.x25519.Recipient.from_str("age1z...")

    with open(inputfd, "rb") as age_input:
        shutil.copyfileobj(s3_input, age_input)

        with open(outputfd, "wb") as age_output:
            age_input.seek(0, os.SEEK_SET)
            pyrage.decrypt_file(age_input, age_output, [recipient])

            age_output.seek(0, os.SEEK_SET)
            zstd_decompress_filter(age_output, request_output)

def zstd_decompress_filter(input: io.IOBase, output: io.IOBase) -> str:
    dctx = zstandard.ZstdDecompressor()
    dctx.copy_stream(input, output)

woodruffw · 2023-10-17T15:03:53Z

I don't know memfd, never used it. Unless I'm missing something, I will create an input and output anonymous files and I will have to write the whole file into that anonymous file, basically loading it fully in memory. That's not streaming.

The trick here would be to use demand-paging with mmap + memfd, which would be true streaming. This is traditionally not possible with network FDs because they aren't arbitrarily seekable, but Linux actually does support it with TCP as of a few years back. The constraints there (such as having to configure the MTU on your interface, and needing to guarantee that inbound data is page-sized and page-aligned) may not be worth it for your case, though.

TL;DR this is do-able, but probably not practically. Sorry for taking you down a rabbit hole 🙂

Looks like a fileobj-based interface is probably for the best, after all. Would you be interested in submitting a patch for that, potentially based on the work in #32?

woodruffw added enhancement New feature or request help wanted Extra attention is needed good first issue Good for newcomers labels Nov 24, 2023

julianhille mentioned this issue Sep 22, 2024

Re-add fh streaming possibility #77

Merged

woodruffw closed this as completed in #77 Sep 26, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Streaming decryption #36

Streaming decryption #36

rectalogic commented Jul 25, 2023

woodruffw commented Jul 25, 2023 •

edited

Loading

rectalogic commented Jul 25, 2023

woodruffw commented Jul 25, 2023

woodruffw commented Aug 1, 2023

rectalogic commented Aug 1, 2023

woodruffw commented Aug 1, 2023

bendem commented Oct 16, 2023

woodruffw commented Oct 16, 2023

bendem commented Oct 17, 2023

woodruffw commented Oct 17, 2023

Streaming decryption #36

Streaming decryption #36

Comments

rectalogic commented Jul 25, 2023

woodruffw commented Jul 25, 2023 • edited Loading

rectalogic commented Jul 25, 2023

woodruffw commented Jul 25, 2023

woodruffw commented Aug 1, 2023

rectalogic commented Aug 1, 2023

woodruffw commented Aug 1, 2023

bendem commented Oct 16, 2023

woodruffw commented Oct 16, 2023

bendem commented Oct 17, 2023

woodruffw commented Oct 17, 2023

woodruffw commented Jul 25, 2023 •

edited

Loading