Yet another PRNG, versalite RC4 implementation for multi-purpose and multi-platform usage #599

Knogle · 2024-09-07T21:48:11Z

Ahoy ahoy, I hope you're doing well.

I hope I'm not being too annoying with this stuff. As part of my testing of block ciphers and cipher streams like AES, I've also been taking a look at RC4. While RC4 is no longer suitable for security-critical applications like key exchange or actual encryption, it still remains a viable option for tasks like disk wiping. One of the advantages of RC4 is that it's simple to implement, doesn't rely on external libraries, and is fully integrated into nwipe.

It's similar to AES in performance, as this one is a block-cipherstream as well, but inferior in data quality, as AES delivers entropy, close to perfection.

An important difference between RC4 and other algorithms I've tested is that I've applied prefetching to keep the 258-byte state fully in the CPU's L1 cache and have unrolled the loops for performance optimization.

RC4 offers good randomness, which is slightly better than Xoroshiro-256, though it comes at the cost of some speed. The most interesting part is that I've optimized it to be used with SSE4.2 on supported platforms, as well as AVX2. On AVX2 platforms, the performance is incredibly fast.

Regarding the AVX2 part, I’ve run into some issues with how to automatically include the -mavx2 flag. Manually setting the flag with ./configure CFLAGS="-mavx2" works well, but I think it’s important to have it integrated properly through automake. As a result, I've temporarily commented out the AVX2-specific optimizations in the code, but they are included and ready for use.

The RC4 algorithm can be optimized quite effectively across a variety of architectures, as it benefits significantly from SIMD operations. I'm also currently working on an OpenCL/OpenGL/CUDA variant, which is incredibly fast if a suitable device is available.

Steps were taken to allow older systems to use it and keep backwards compatbility, they will default to the non-optimized implementation when lacking SSE4.2 or AVX2.

Notes on the Optimizations:

Prefetching: You've implemented prefetching to ensure the 258-byte state remains in the CPU's L1 cache, minimizing cache misses and improving performance.
Loop Unrolling: By unrolling the loops, you've reduced the overhead caused by branching and condition checks, further improving performance.
SSE4.2 and AVX2: You've optimized the code for SSE4.2 on supported CPUs, with an even faster path for AVX2. This aligns well with modern CPU capabilities and ensures RC4 performs optimally.
SIMD Benefits: RC4 benefits from SIMD operations, allowing you to take advantage of parallel processing, particularly on architectures that support AVX2. 4 times the amount of data is created in one cycle.
Future OpenCL/OpenGL/CUDA: The mention of a GPU-based variant highlights the potential for massive parallelism on supported hardware, which would further increase performance in high-throughput tasks like disk wiping.

…g AVX2 and SSE 4.2 for improved performance.

…ure.ac, during check for AVX2 support in the compiler, until it's fixed.

Knogle · 2024-09-08T22:09:20Z

Ahoy,

Dropped for now, unfortunatly due to the fragile nature of RC4, bad seeds lead to incredible low stream quaility.
Maybe i can fix this in the future, but dropped for now.

Knogle · 2024-09-10T16:10:48Z

Ahoy,
I will completely drop this.
I have performed certain tests and after a certain amount of input data it is possible to get to the encryption key, and predict the rest of the data. So not secure, even not for data wiping.

Knogle added 2 commits September 7, 2024 23:15

Added and optimized RC4 PRNG to generate 4096-byte random blocks usin…

b69ea6e

…g AVX2 and SSE 4.2 for improved performance.

Commented sections for AVX2 support for RC4 due to issues with config…

65468f8

…ure.ac, during check for AVX2 support in the compiler, until it's fixed.

Knogle closed this Sep 8, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Yet another PRNG, versalite RC4 implementation for multi-purpose and multi-platform usage #599

Yet another PRNG, versalite RC4 implementation for multi-purpose and multi-platform usage #599

Knogle commented Sep 7, 2024 •

edited

Loading

Knogle commented Sep 8, 2024

Knogle commented Sep 10, 2024

Yet another PRNG, versalite RC4 implementation for multi-purpose and multi-platform usage #599

Yet another PRNG, versalite RC4 implementation for multi-purpose and multi-platform usage #599

Conversation

Knogle commented Sep 7, 2024 • edited Loading

Notes on the Optimizations:

Knogle commented Sep 8, 2024

Knogle commented Sep 10, 2024

Knogle commented Sep 7, 2024 •

edited

Loading