Yet another PRNG, versalite RC4 implementation for multi-purpose and multi-platform usage #599
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Ahoy ahoy, I hope you're doing well.
I hope I'm not being too annoying with this stuff. As part of my testing of block ciphers and cipher streams like AES, I've also been taking a look at RC4. While RC4 is no longer suitable for security-critical applications like key exchange or actual encryption, it still remains a viable option for tasks like disk wiping. One of the advantages of RC4 is that it's simple to implement, doesn't rely on external libraries, and is fully integrated into nwipe.
It's similar to AES in performance, as this one is a block-cipherstream as well, but inferior in data quality, as AES delivers entropy, close to perfection.
An important difference between RC4 and other algorithms I've tested is that I've applied prefetching to keep the 258-byte state fully in the CPU's L1 cache and have unrolled the loops for performance optimization.
RC4 offers good randomness, which is slightly better than Xoroshiro-256, though it comes at the cost of some speed. The most interesting part is that I've optimized it to be used with SSE4.2 on supported platforms, as well as AVX2. On AVX2 platforms, the performance is incredibly fast.
Regarding the AVX2 part, I’ve run into some issues with how to automatically include the
-mavx2
flag. Manually setting the flag with./configure CFLAGS="-mavx2"
works well, but I think it’s important to have it integrated properly through automake. As a result, I've temporarily commented out the AVX2-specific optimizations in the code, but they are included and ready for use.The RC4 algorithm can be optimized quite effectively across a variety of architectures, as it benefits significantly from SIMD operations. I'm also currently working on an OpenCL/OpenGL/CUDA variant, which is incredibly fast if a suitable device is available.
Steps were taken to allow older systems to use it and keep backwards compatbility, they will default to the non-optimized implementation when lacking SSE4.2 or AVX2.
Notes on the Optimizations:
Prefetching: You've implemented prefetching to ensure the 258-byte state remains in the CPU's L1 cache, minimizing cache misses and improving performance.
Loop Unrolling: By unrolling the loops, you've reduced the overhead caused by branching and condition checks, further improving performance.
SSE4.2 and AVX2: You've optimized the code for SSE4.2 on supported CPUs, with an even faster path for AVX2. This aligns well with modern CPU capabilities and ensures RC4 performs optimally.
SIMD Benefits: RC4 benefits from SIMD operations, allowing you to take advantage of parallel processing, particularly on architectures that support AVX2. 4 times the amount of data is created in one cycle.
Future OpenCL/OpenGL/CUDA: The mention of a GPU-based variant highlights the potential for massive parallelism on supported hardware, which would further increase performance in high-throughput tasks like disk wiping.