Replies: 10 comments 11 replies
-
I'll start with the answers to your questions:
At this point, I think the team consensus is to no longer invest in trying to optimize the Python part of the tunnels and to make a dedicated C (/Rust?) implementation. This is mostly because the "non-optimal" performance is mostly due to Python itself. As you also mentioned, handing off data between C and Python absolutely kills performance. |
Beta Was this translation helpful? Give feedback.
-
Thanks for the tests, that would help me to check it. I am picking that on my personal interest, lets see where i end up, and may be those findings could be useful for future C/Rust implementation as well. 1 thing i am sure, in the C level, no general purpose crypto library is faster than OpenSSL. |
Beta Was this translation helpful? Give feedback.
-
One another interesting thing is, for Python criterias, achieving 1Gbyte/s thoprugput per core is something not bad at all. |
Beta Was this translation helpful? Give feedback.
-
I have tested crypto backend all tunnel performance with various options.
Finally, Here is my test script, which runs n hops + seeder + exitnodes + bootstrap tracker locally without going to WAN, which enables to test tunnel performance on loopback adapter, free from the network conditions. You can also map the peer to specific CPU cores + profile the code line by line. This also includes AEADPY version of chacha20 to test as well, buy you gotta build it if you want to use it. https://github.com/hbiyik/junkyard/blob/master/testtunnel.py Here are the stuff you can configure on the script:
and pure crypto benchmark script: https://github.com/hbiyik/junkyard/blob/master/cryptobench.py with abused versions of Cryptography library as a bonus. In case you find it usefull. |
Beta Was this translation helpful? Give feedback.
-
Thanks for sharing your code; interesting insights. I think you'll find any throughput increase completely nullified once you start sending across long distances over the Internet. However, not wasting a user's CPU cycles is also a nice improvement to strive for i.m.o. AFAIK As the C implementation is still pretty far off, if you have found any easy performance optimizations in the Python code I'd love to hear about them. We can definitely carefully examine any particularly bad calls and see if they are still necessary (some parts of the codebase are 14 years old by now and still very Java-esque and defensively programmed). |
Beta Was this translation helpful? Give feedback.
-
i think libuv is cross paltform already but for some reason uvloop is still linux only. coming to python optimizations, taskmager is using iscoroutine function on each task registery, this function is costly because it inspects the bytecode each time it is executed. I benachmarked the code and i thinl there is %2 to %3 optimization amrgin there. I think best way to do this use typing module and not to check any kind of data type on runtime but again thats a big refactor and how woth it is is questionable. You can find the benchmark resutls in the ODS file below. I didnt care to put it here because i think there are better news. if not isinstance(task, Task) and not iscoroutinefunction(task) and not callable(task): The EVP cython wrapper: Here is the cython library Here is the benchmark data and graps PS: I have to admit that i dont %100 know what i am doing internally in the tunnel test scripts, ie: when i eatablish a tunnel without e2e flavour, the ipv8 still uses crypto backend PS2: Here is how to patch ipv8 to use evp: Evp = evp.Aead() and then patch the tunnelcrypto.py def encrypt_str(self, content, key, salt, salt_explicit):
# return the encrypted content prepended with salt_explicit
ciphertext = Evp.encrypt(content, key, salt + struct.pack('!q', salt_explicit), b"")
return struct.pack('!q', salt_explicit) + ciphertext
def decrypt_str(self, content, key, salt):
# content contains the tag and salt_explicit in plaintext
if len(content) < 24:
raise CryptoException("truncated content")
block = salt + content
return Evp.decrypt(block[12:], key, block[:12], b"") |
Beta Was this translation helpful? Give feedback.
-
I already had a version tested of CFFI based on cryptography without those paranoid exception handling, but still the python overhead is too big, and it barely matches If there was a single call interface from OpenSSL directly, CFFI would be a great solution but i think there is not, The module i wrote is for sure is not safe, there is no input validation first of all, but the input validation in Cython is quite fast so there should be no performance loss, also there is no exception handling of OpenSSL, as long as the input is valid OpenSSL should not raise an exception to my knowledge but even in the worst case of exception handling in Cython the impact to performance should be miniscule just becase it is on C level. BTW Cython is also quite good with Setuptools if thats the concern. |
Beta Was this translation helpful? Give feedback.
-
update: exceptions are handled now, openssl has built input validation so implementation is safe now imho. Also added few percents of performance boost :). I think there is still considerable amrgin in the C level but lets keep it portable atm. |
Beta Was this translation helpful? Give feedback.
-
Finally found some time to test this out, I ported your code to CFFI (you can find it here) and did some benchmarking and testing in the wild. I think I just hit some new highscores for throughput (same machine previously didn't go over 3.6MBps for 1-hop downloads for this download): |
Beta Was this translation helpful? Give feedback.
-
very impressive work 🎉 🏅 Fascinating performance boost! Nobody is maintaining the tunnels currently. Anonymity and tunnels are on-hold till at least summer of 2022. The focus is on general usability, bug fixing and content search/tagging. Making such an invasive change at this moment is a bit too daring I'm afraid. We should definitely try to re-produce your results in the lab and on Mac plus Windows also. Hopefully they replicate. Edit: just to add further: due to shortage of people (not budget) we're focused on fixes currently. Don't break it, if its not broken policy. |
Beta Was this translation helpful? Give feedback.
-
Hello
Out of personal interest, i was benchmarking the tunnel crypto performance and focused on chacha20Poly1305.
I noticed that several Python libraries are mainly not focused on the speed but the safety and flexibility of the library therefore the performance is poor.
I have benchmarked several options for encryption throughput:
Below graph shows the input size (byte) in X axis, and throughput of encryption (megabyte per second) in Y axis for AMDfx8320 64bit cpu utilized with single core on linux machine.
I find those results are interesting. The main performance gain here is that, the less of python object interaction between the C code the better the performance. Ie: aeadpy directly creates the Encryption Context in C, encrypts the inputs, destroys the context adnr eturn the encrypted data. Therefore the context stuff is not exposed to python and there is less overhead. This also shows how optimized the real encryption code as well, since the context conversion to python nearly takes as much as time that the real encryption takes.
Even though in the synthetic benchmarks you can see gains about 2x, i am not so sure if this will directly translate to the tunnel performance. Therefore i have 2 questions
I can share my code as well, but currently it is very junky.
Beta Was this translation helpful? Give feedback.
All reactions