Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Yet another PRNG, versalite RC4 implementation for multi-purpose and multi-platform usage #599

Closed
wants to merge 2 commits into from

Conversation

Knogle
Copy link
Contributor

@Knogle Knogle commented Sep 7, 2024

Ahoy ahoy, I hope you're doing well.

I hope I'm not being too annoying with this stuff. As part of my testing of block ciphers and cipher streams like AES, I've also been taking a look at RC4. While RC4 is no longer suitable for security-critical applications like key exchange or actual encryption, it still remains a viable option for tasks like disk wiping. One of the advantages of RC4 is that it's simple to implement, doesn't rely on external libraries, and is fully integrated into nwipe.

It's similar to AES in performance, as this one is a block-cipherstream as well, but inferior in data quality, as AES delivers entropy, close to perfection.

An important difference between RC4 and other algorithms I've tested is that I've applied prefetching to keep the 258-byte state fully in the CPU's L1 cache and have unrolled the loops for performance optimization.

RC4 offers good randomness, which is slightly better than Xoroshiro-256, though it comes at the cost of some speed. The most interesting part is that I've optimized it to be used with SSE4.2 on supported platforms, as well as AVX2. On AVX2 platforms, the performance is incredibly fast.

Regarding the AVX2 part, I’ve run into some issues with how to automatically include the -mavx2 flag. Manually setting the flag with ./configure CFLAGS="-mavx2" works well, but I think it’s important to have it integrated properly through automake. As a result, I've temporarily commented out the AVX2-specific optimizations in the code, but they are included and ready for use.

The RC4 algorithm can be optimized quite effectively across a variety of architectures, as it benefits significantly from SIMD operations. I'm also currently working on an OpenCL/OpenGL/CUDA variant, which is incredibly fast if a suitable device is available.

Steps were taken to allow older systems to use it and keep backwards compatbility, they will default to the non-optimized implementation when lacking SSE4.2 or AVX2.


Notes on the Optimizations:

  • Prefetching: You've implemented prefetching to ensure the 258-byte state remains in the CPU's L1 cache, minimizing cache misses and improving performance.

  • Loop Unrolling: By unrolling the loops, you've reduced the overhead caused by branching and condition checks, further improving performance.

  • SSE4.2 and AVX2: You've optimized the code for SSE4.2 on supported CPUs, with an even faster path for AVX2. This aligns well with modern CPU capabilities and ensures RC4 performs optimally.

  • SIMD Benefits: RC4 benefits from SIMD operations, allowing you to take advantage of parallel processing, particularly on architectures that support AVX2. 4 times the amount of data is created in one cycle.

  • Future OpenCL/OpenGL/CUDA: The mention of a GPU-based variant highlights the potential for massive parallelism on supported hardware, which would further increase performance in high-throughput tasks like disk wiping.

Screenshot from 2024-09-07 23-07-57

…g AVX2 and SSE 4.2 for improved performance.
…ure.ac, during check for AVX2 support in the compiler, until it's fixed.
@Knogle
Copy link
Contributor Author

Knogle commented Sep 8, 2024

Ahoy,

Dropped for now, unfortunatly due to the fragile nature of RC4, bad seeds lead to incredible low stream quaility.
Maybe i can fix this in the future, but dropped for now.

@Knogle Knogle closed this Sep 8, 2024
@Knogle
Copy link
Contributor Author

Knogle commented Sep 10, 2024

Ahoy,
I will completely drop this.
I have performed certain tests and after a certain amount of input data it is possible to get to the encryption key, and predict the rest of the data. So not secure, even not for data wiping.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant