ChaCha20-Poly1305 Benchmark #6679

hbiyik · 2021-12-23T13:23:09Z

hbiyik
Dec 23, 2021

Hello

Out of personal interest, i was benchmarking the tunnel crypto performance and focused on chacha20Poly1305.

I noticed that several Python libraries are mainly not focused on the speed but the safety and flexibility of the library therefore the performance is poor.

I have benchmarked several options for encryption throughput:

Cryptography Library with Openssl Backend (yellow guy)
Libnacl with libsodium library (green guy)
A very unsafe version of Cryptography Library specificly tuned for tribler (the red guy)
A custom C based python library that directly itnerfaces with openssl which is based on this: https://github.com/shaktee/aeadpy/blob/2cbdfca4fb41ba6794e0dc487b2302c975c0028c/aead_python.c#L83 (the blue guy)

Below graph shows the input size (byte) in X axis, and throughput of encryption (megabyte per second) in Y axis for AMDfx8320 64bit cpu utilized with single core on linux machine.

I find those results are interesting. The main performance gain here is that, the less of python object interaction between the C code the better the performance. Ie: aeadpy directly creates the Encryption Context in C, encrypts the inputs, destroys the context adnr eturn the encrypted data. Therefore the context stuff is not exposed to python and there is less overhead. This also shows how optimized the real encryption code as well, since the context conversion to python nearly takes as much as time that the real encryption takes.

Even though in the synthetic benchmarks you can see gains about 2x, i am not so sure if this will directly translate to the tunnel performance. Therefore i have 2 questions

Is there any test case to bechmark encryted tunnel performance. I want to also verify if it really improves
what would be the average input size to be encrypted in the tunnel backend? You can see the performance of the library greatly depends on the input size as well, therefore further optimizations can be targetted on that input profile.

I can share my code as well, but currently it is very junky.

qstokkink · 2021-12-23T13:55:06Z

qstokkink
Dec 23, 2021
Maintainer

I'll start with the answers to your questions:

Yes, we have three tests for tunnel performance, which you can find here (click me). These tests are run periodically on Jenkins, but they are unfortunately broken, untouched and unloved since October.
The message sizes vary, but for a ballpark estimate you should think in the order of several hundreds to under 2k bytes.

At this point, I think the team consensus is to no longer invest in trying to optimize the Python part of the tunnels and to make a dedicated C (/Rust?) implementation. This is mostly because the "non-optimal" performance is mostly due to Python itself. As you also mentioned, handing off data between C and Python absolutely kills performance.

0 replies

hbiyik · 2021-12-23T15:21:30Z

hbiyik
Dec 23, 2021
Author

Thanks for the tests, that would help me to check it.

I am picking that on my personal interest, lets see where i end up, and may be those findings could be useful for future C/Rust implementation as well.

1 thing i am sure, in the C level, no general purpose crypto library is faster than OpenSSL.
And definetely the input size is also a factor for the speed as well. See below chart that show throughput up to 1Mbyte input size. After 200K or something the speed drops drastically, this is due to L1 data cache can not keep up with the speed i guess (Assumption), so therefore even in the C interface a streaming method should be implemented at least for OpenSSL. Considering low-end devices the chunk size should be even lower, may be arounf 64K/32K range.

0 replies

hbiyik · 2021-12-23T15:23:57Z

hbiyik
Dec 23, 2021
Author

One another interesting thing is, for Python criterias, achieving 1Gbyte/s thoprugput per core is something not bad at all.

0 replies

hbiyik · 2021-12-29T22:32:54Z

hbiyik
Dec 29, 2021
Author

I have tested crypto backend all tunnel performance with various options.
Here is the executive report:

Crypto backend which is faster %100 percent on pure crypto operations only improves about %3 percent of hiddentunnel performance. As said the performance is lost is the reactor doing normal python stuff. So that it never has chance to push crypto banckend to performant levels. Therefore %3 improvement is basically nothing.
Using uvloop definetely improves the tunnel performance about %70 - %80 regardless hidden or not, which is very nice for just a one liner to the codebase. This %70 - %80 is valid only when the reactor is free, when background tasks fill the event loop, the tunnel performance gain drops to %40-%30 which is still something good.
I have noticed that a lot of type checking isinstance stuff if or other types of type checks, limit controls etc in ipv8 is really wasting reactor juice a lot. It is going to be re-written in C but, just fot the reference.

Finally,

Here is my test script, which runs n hops + seeder + exitnodes + bootstrap tracker locally without going to WAN, which enables to test tunnel performance on loopback adapter, free from the network conditions. You can also map the peer to specific CPU cores + profile the code line by line. This also includes AEADPY version of chacha20 to test as well, buy you gotta build it if you want to use it.

https://github.com/hbiyik/junkyard/blob/master/testtunnel.py

Here are the stuff you can configure on the script:

TOTALMB = 25 # total MB to ping pong in between entry and exit
STARTPORT = 7090 # each peer is incrementing this ipv8 port
PROFILE = False # if enabled, enables linerprofiler on entry node, results save in profile.txt
UVLOOP = False # change the reactor to UVLOOP instead of asyncio
AEADPY = False # use faster symmetric encrytion library, the idea is implemented here, little bit ugly: https://github.com/shaktee/aeadpy
USEAFFINITY = True # each peer is running as a seperate proceess, when affinities enabled, each peer strictly runs on specified core in CPU_AFFINITIES. This is very usefull to compare performance of different pyipv8 implementations
HOPS = [1] # test for each node count 
E2E = [True] # test for each e2e criteria
SCALE = 1 # scale the transmitted packet size on test, bigger the faster, because it reduces the reactor overhead, however be carefull not to exceed UDP packet size
LOGLEVEL = 20
TRACKERPORT = 7089

and pure crypto benchmark script: https://github.com/hbiyik/junkyard/blob/master/cryptobench.py with abused versions of Cryptography library as a bonus.

In case you find it usefull.

0 replies

qstokkink · 2021-12-30T08:06:32Z

qstokkink
Dec 30, 2021
Maintainer

Thanks for sharing your code; interesting insights. I think you'll find any throughput increase completely nullified once you start sending across long distances over the Internet. However, not wasting a user's CPU cycles is also a nice improvement to strive for i.m.o.

AFAIK uvloop still only supports Linux, while the majority of our users use Windows (which is why we opted-out before). Maintaining and shipping OS-specific optimizations is usually not worth it, but it's hard to argue with even 30%-40% increase. Perhaps we should reconsider using it.

As the C implementation is still pretty far off, if you have found any easy performance optimizations in the Python code I'd love to hear about them. We can definitely carefully examine any particularly bad calls and see if they are still necessary (some parts of the codebase are 14 years old by now and still very Java-esque and defensively programmed).

0 replies

hbiyik · 2022-01-01T22:03:20Z

hbiyik
Jan 1, 2022
Author

i think libuv is cross paltform already but for some reason uvloop is still linux only.

coming to python optimizations, taskmager is using iscoroutine function on each task registery, this function is costly because it inspects the bytecode each time it is executed. I benachmarked the code and i thinl there is %2 to %3 optimization amrgin there. I think best way to do this use typing module and not to check any kind of data type on runtime but again thats a big refactor and how woth it is is questionable. You can find the benchmark resutls in the ODS file below. I didnt care to put it here because i think there are better news.

if not isinstance(task, Task) and not iscoroutinefunction(task) and not callable(task):

The EVP cython wrapper:
So i created a new module from scratch to wrap around Open SSL evp.h header which also provides Poly1305ChaCha20 cipher. On pure crypto benchmarks is at least %400 faster than any available library which is great news. I have also tested the perfomance of it with ipv8 tunnels and the take away is, normal tunnels are %10 faster, E2E tunnels are %20 faster against the libnacl* on practical tunnel tests.

Here is the cython library
https://github.com/hbiyik/evp/blob/master/asyncaead/evp.pyx

Here is the bechmark output:

Here is the benchmark data and graps
tunneltest1.ods

PS: I have to admit that i dont %100 know what i am doing internally in the tunnel test scripts, ie: when i eatablish a tunnel without e2e flavour, the ipv8 still uses crypto backend ~~which confuses me little bit~~ (after reading documentation im still a noob, but understand it clearly now) but in any case the numbers should be give or take similar, you are the expert on that ipv8 internals. But i think there is an easy performance grab here.

PS2: Here is how to patch ipv8 to use evp:
initialite the class somewhere with

Evp = evp.Aead()

and then patch the tunnelcrypto.py

    def encrypt_str(self, content, key, salt, salt_explicit):
        # return the encrypted content prepended with salt_explicit
        ciphertext = Evp.encrypt(content, key, salt + struct.pack('!q', salt_explicit), b"")
        return struct.pack('!q', salt_explicit) + ciphertext

    def decrypt_str(self, content, key, salt):
        # content contains the tag and salt_explicit in plaintext
        if len(content) < 24:
            raise CryptoException("truncated content")

        block = salt + content
        return Evp.decrypt(block[12:], key, block[:12], b"")

1 reply

qstokkink Jan 2, 2022
Maintainer

Thanks for pointing out the TaskManager call. I agree that we should probably not touch this line: the typing module cannot serve as a substitute for runtime type checking in this case, which would make debugging a nightmare.

Your results look promising. However, changing the core cryptography of our anonymization layer and making the switch into shipping C code is not something that should be done lightly. I'll take some time going through your code (cross-referencing the OpenSSL docs to make sure it's secure, there are no memory leaks, etc.) and run my own benchmarks. In the process I'd also switch this to CFFI, because it integrates nicely with setuptools and I don't want to maintain custom compiler localization code. At the end of the ride, there should probably be a Tribler dev team vote on the trade-offs between security, maintainability and performance.

hbiyik · 2022-01-02T11:00:13Z

hbiyik
Jan 2, 2022
Author

crytography module is already using the method with CFFI and it is horribly slow, because to use EVP interface from OpenSSL, you have inititalize the context, configure it, feed the data and clean the context which is several calls. When used with CFFI, python object conversion takes all the speed improvements because all of those OpenSSL calls are already too fast compared to Python conversion overhead. Also cryptography is doing some paranoid exception handling in Python level, which really makes it even worse.

I already had a version tested of CFFI based on cryptography without those paranoid exception handling, but still the python overhead is too big, and it barely matches libnacl performance.

If there was a single call interface from OpenSSL directly, CFFI would be a great solution but i think there is not,

The module i wrote is for sure is not safe, there is no input validation first of all, but the input validation in Cython is quite fast so there should be no performance loss, also there is no exception handling of OpenSSL, as long as the input is valid OpenSSL should not raise an exception to my knowledge but even in the worst case of exception handling in Cython the impact to performance should be miniscule just becase it is on C level.

BTW Cython is also quite good with Setuptools if thats the concern.

2 replies

qstokkink Jan 3, 2022
Maintainer

Completely agreed, ping-ponging back and forth between Python and C is horrible. CFFI lets you declare and integrate with C functions directly though: https://cffi.readthedocs.io/en/latest/overview.html#purely-for-performance-api-level-out-of-line

In your next message I see you already resolve this, but let me stress this anyway: skipping asserts or exceptions in the crypto layer is unacceptable. You should be paranoid about cryptography.

Regarding Cython vs. CFFI, the path juggling for Windows in your example is an example of why I prefer CFFI. In my opinion (you may certainly disagree) almost everything about CFFI is just a bit easier and neater than Cython. To add some nuance: I don't think Cython is bad. I just think CFFI is better.

hbiyik Jan 3, 2022
Author

Ah i did not know it was possible to define a block of inline function with cffi, now i see what you mean, thanks for the explanation.

hbiyik · 2022-01-02T18:26:19Z

hbiyik
Jan 2, 2022
Author

update: exceptions are handled now, openssl has built input validation so implementation is safe now imho. Also added few percents of performance boost :). I think there is still considerable amrgin in the C level but lets keep it portable atm.

0 replies

qstokkink · 2022-01-16T14:55:47Z

qstokkink
Jan 16, 2022
Maintainer

Finally found some time to test this out, I ported your code to CFFI (you can find it here) and did some benchmarking and testing in the wild. I think I just hit some new highscores for throughput (same machine previously didn't go over 3.6MBps for 1-hop downloads for this download):

7 replies

hbiyik Jan 17, 2022
Author

It is anyways good to know the options in case tunnel performance needs to be improved.

Two things about the implementation, i think it might be beneficial to release the GIL during the crypto operations so that asyncio concurrency can be established during the crypto process for the rest of the 'futures' pending in the event loop. But i am not sure if snychronous function calls can really help to free the event loop in this regard only by just releasing the GIL.

Coming to the 2nd topic, my initial idea was actually to make crypto functions async, but i did not discuss this to make things less complicated. I have a very buggy prototype code filled with deadlocks and segfaults here just to show the idea. The idea is when encrypt/decrypt calls are made they will return Future instances, and those instances along with the input parameters will be put in a FIFO queue, which is constructed by the pthread mutexes having a maxsize of max thread count. The threads which will be spawned as much as logical cpu cores will consume the backlog of Futures and calculate the crypto operation without gil, and after finishing will acquire the gil and set the ~~thread~~ Future result from the thread. Threfore, the reactor will be completely freed from the crypto cpu bottleneck this it self should benefit the performance a lot, and also the crypto operations will be completely parallel, so in case of multiple circuit cases where crypto operations are redundant in the ipv8 backend, overall crypto performance would at lease gain x2 (according to my cyristallball) compared to synchronous C level evp implementation, even considering Future object overheads, GIL release operation delays and etc. When i have the time i am planning to finish the implementation and benchmark the performance with async variant, and i think my synthetic tests is already giving more or less correct output as per my understanding from previous discussions.

qstokkink Jan 17, 2022
Maintainer

I'd love to see if that works out, please keep us updated. In the past I've also played around with something similar, but (shockingly) the GIL operations were so slow that throughput was actually worse: synchronous without GIL release was faster.

hbiyik Jan 17, 2022
Author

I think when the input size is small thats a concern, but this also can be mitigated by spawning 1 extra thread with GIL and consume the consumed Queue, which is another madness, but first i will have to make the simpler version working, currently the quality of it as below :):

RuntimeError("some stuff")

hbiyik Feb 28, 2022
Author

I tried this approach throughly, i mean i really tried. The takeaway is GIL is so slow to release. And to work with a Future, one needs to release that GIL and acquire again at least each operation. And whatever you do, it never is faster than the actual crypto operation therefore it slows down the operation. Result is a miserable failure. For the future record. Theoretically may be it would make sense when the Async future is not only for crypto operation, but for a bigger task, ie: handling the overall packet request/response. That would be more CPU intensive longer task, so that GIL acquire/release delay would be neglectable. However, thats a total different design of the stack and not practical route to follow.

qstokkink Feb 28, 2022
Maintainer

Thanks for letting us know. Good to hear that at least it wasn't my crummy implementation that made it slow before 😄 At any rate, this further cements my belief that we should make a full C port of the lower level overlay code.

synctext · 2022-01-17T15:44:50Z

synctext
Jan 17, 2022
Maintainer

very impressive work 🎉 🏅 Fascinating performance boost!

Nobody is maintaining the tunnels currently. Anonymity and tunnels are on-hold till at least summer of 2022. The focus is on general usability, bug fixing and content search/tagging. Making such an invasive change at this moment is a bit too daring I'm afraid. We should definitely try to re-produce your results in the lab and on Mac plus Windows also. Hopefully they replicate.

Edit: just to add further: due to shortage of people (not budget) we're focused on fixes currently. Don't break it, if its not broken policy.

1 reply

hbiyik Jan 18, 2022
Author

I think thats quite reasonable, i mean the main question is still there, even in the case the trickeries would improve the performance of tunnels, the underlying method for the improvement is not to use Python, so that I think that the big question still stands. Is it really worth to keep ipv8 in python, that question is still there.

And what would be the point of maintaining foreign code in current python codebase, if you only partially benefit it from it (say it in crypto backend or serializer backend or some other hotspot) and still have to maintain foreign code, instead why not just write the whole IPv8 in another language which is performant. To me at least somehow it is clear clear that only thing the (P)erformance and (P)ython has in common is their initials nothing else.

ChaCha20-Poly1305 Benchmark #6679

hbiyik Dec 23, 2021

Replies: 10 comments · 11 replies

qstokkink Dec 23, 2021 Maintainer

hbiyik Dec 23, 2021 Author

hbiyik Dec 23, 2021 Author

hbiyik Dec 29, 2021 Author

qstokkink Dec 30, 2021 Maintainer

hbiyik Jan 1, 2022 Author

qstokkink Jan 2, 2022 Maintainer

hbiyik Jan 2, 2022 Author

qstokkink Jan 3, 2022 Maintainer

hbiyik Jan 3, 2022 Author

hbiyik Jan 2, 2022 Author

qstokkink Jan 16, 2022 Maintainer

hbiyik Jan 17, 2022 Author

qstokkink Jan 17, 2022 Maintainer

hbiyik Jan 17, 2022 Author

hbiyik Feb 28, 2022 Author

qstokkink Feb 28, 2022 Maintainer

synctext Jan 17, 2022 Maintainer

hbiyik Jan 18, 2022 Author

hbiyik
Dec 23, 2021

Replies: 10 comments 11 replies

qstokkink
Dec 23, 2021
Maintainer

hbiyik
Dec 23, 2021
Author

hbiyik
Dec 23, 2021
Author

hbiyik
Dec 29, 2021
Author

qstokkink
Dec 30, 2021
Maintainer

hbiyik
Jan 1, 2022
Author

qstokkink Jan 2, 2022
Maintainer

hbiyik
Jan 2, 2022
Author

qstokkink Jan 3, 2022
Maintainer

hbiyik Jan 3, 2022
Author

hbiyik
Jan 2, 2022
Author

qstokkink
Jan 16, 2022
Maintainer

hbiyik Jan 17, 2022
Author

qstokkink Jan 17, 2022
Maintainer

hbiyik Jan 17, 2022
Author

hbiyik Feb 28, 2022
Author

qstokkink Feb 28, 2022
Maintainer

synctext
Jan 17, 2022
Maintainer

hbiyik Jan 18, 2022
Author