Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bump MSRV to 1.36 #1011

Merged
merged 17 commits into from
Aug 28, 2020
Merged

Bump MSRV to 1.36 #1011

merged 17 commits into from
Aug 28, 2020

Conversation

vks
Copy link
Collaborator

@vks vks commented Aug 1, 2020

This allows us to get rid of some unsafe code. I did not test whether this affects performance however.

cc @Shnatsel

@newpavlov
Copy link
Member

We also can replace:

#![cfg_attr(not(feature = "std"), no_std)]
#[cfg(all(feature = "alloc", not(feature = "std")))] extern crate alloc;

with:

#[cfg(feature = "alloc")]
extern crate alloc;
#[cfg(feature = "std")]
extern crate std;

@dhardy
Copy link
Member

dhardy commented Aug 1, 2020

Technically this shouldn't affect some crates, e.g. rand_pcg, but of course we aren't testing older Rust versions so bumping the "MSRV" still makes sense.

@vks
Copy link
Collaborator Author

vks commented Aug 1, 2020

Also see rust-secure-code/safety-dance#54.

It seems like I broke some code on big endian platforms, let me check...

src/distributions/mod.rs Outdated Show resolved Hide resolved
src/rngs/adapter/read.rs Outdated Show resolved Hide resolved
Copy link
Member

@dhardy dhardy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lots of nice changes!

Since it wasn't linked, a large chunk of this PR is about these issues: rust-secure-code/safety-dance#54

Could you check the benchmarks please? Specifically relating to the int → bytes and bytes → int conversions (generators).

rand_core/src/le.rs Outdated Show resolved Hide resolved
Comment on lines 118 to 140
pub fn next_u32_via_fill<R: RngCore + ?Sized>(rng: &mut R) -> u32 {
let mut buf = [0; 4];
rng.fill_bytes(&mut buf);
u32::from_le_bytes(buf)
u32::from_ne_bytes(buf)
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

?

Indeed, the old version didn't byte-swap. But fill_bytes_via_next does byte-swap. Weird?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm also puzzled.

rand_core/src/impls.rs Outdated Show resolved Hide resolved
@vks
Copy link
Collaborator Author

vks commented Aug 4, 2020

Could you check the benchmarks please? Specifically relating to the int → bytes and bytes → int conversions (generators).

Generating bytes is ca. 8 % slower, which is not ideal but I think acceptable. Generating u32 and u64 is not affected.

Before (master):

gen_bytes_chacha12:   2,733,838 ns/iter (+/- 181,694) = 374 MB/s
gen_bytes_chacha20:   4,339,602 ns/iter (+/- 237,793) = 235 MB/s
gen_bytes_chacha8:    1,918,279 ns/iter (+/- 103,581) = 533 MB/s
gen_u32_chacha20:      18,866 ns/iter (+/- 2,748) = 212 MB/s
gen_u64_chacha20:      36,510 ns/iter (+/- 6,994) = 219 MB/s
init_chacha:               44 ns/iter (+/- 5)

After (this PR):

gen_bytes_chacha12:   2,968,875 ns/iter (+/- 360,677) = 344 MB/s
gen_bytes_chacha20:   4,544,087 ns/iter (+/- 297,158) = 225 MB/s
gen_bytes_chacha8:    2,065,126 ns/iter (+/- 164,368) = 495 MB/s
gen_u32_chacha20:      18,793 ns/iter (+/- 1,655) = 212 MB/s
gen_u64_chacha20:      34,735 ns/iter (+/- 2,007) = 230 MB/s
init_chacha:               33 ns/iter (+/- 3)

@dhardy
Copy link
Member

dhardy commented Aug 5, 2020

Quick test on my side: 4-9% slower bytes generation from ChaCha; 23% slower from HC128. Other code paths (e.g. bytes from PCG and int outputs) appear unaffected.

The default byte-length is 1024; I tested 128 and 12000 with basically identical results (form the latter):

# master branch:
test gen_bytes_chacha12      ... bench:  29,536,911 ns/iter (+/- 560,458) = 406 MB/s
test gen_bytes_chacha20      ... bench:  47,129,314 ns/iter (+/- 715,162) = 254 MB/s
test gen_bytes_chacha8       ... bench:  20,503,464 ns/iter (+/- 290,756) = 585 MB/s
test gen_bytes_hc128         ... bench:   5,426,882 ns/iter (+/- 112,118) = 2211 MB/s

# bump-msrv branch:
test gen_bytes_chacha12      ... bench:  31,337,042 ns/iter (+/- 728,833) = 382 MB/s
test gen_bytes_chacha20      ... bench:  49,303,860 ns/iter (+/- 3,711,389) = 243 MB/s
test gen_bytes_chacha8       ... bench:  22,460,498 ns/iter (+/- 2,488,804) = 534 MB/s
test gen_bytes_hc128         ... bench:   7,034,898 ns/iter (+/- 671,394) = 1705 MB/s

I'm not really happy about this performance loss. Do we have any better options?

@newpavlov
Copy link
Member

newpavlov commented Aug 9, 2020

I think we have to change fill_bytes code, otherwise compiler will not be able to properly optimize out panics and various checks. Here is a draft, unfortunately it's not completely panic-free, since we can't constrain index to be less than length of results. But we can solve it with a bit of unsafe code.

UPD: On a second look, the linked code is not correct (it will not fill "tail" bytes), but I think the general idea can be understood.

@vks
Copy link
Collaborator Author

vks commented Aug 10, 2020

@newpavlov Are you talking about BlockRng::fill_bytes?

@newpavlov
Copy link
Member

Yes.

@vks
Copy link
Collaborator Author

vks commented Aug 11, 2020

@newpavlov With your suggestion, I get the same performance as before (without resorting to unsafe code), however, the results are not quite correct: test_chacha_true_values_c is failing.

UPD: On a second look, the linked code is not correct (it will not fill "tail" bytes), but I think the general idea can be understood.

Do you know what is wrong? I don't see it.

@newpavlov
Copy link
Member

fill_via_u32 should look something like this:

fn fill_via_u32(src: &[u32], dst: &mut [u8]) -> usize {
    let mut src = src.iter();
    let mut chunks = dst.chunks_exact_mut(4);
    for (v, chunk) in (&mut src).zip(&mut chunks) {
        chunk.copy_from_slice(&v.to_le_bytes());
    }
    let rem = chunks.into_remainder();
    if rem.len() != 0 {
        let v = src.next().unwrap();
        rem.copy_from_slice(&v.to_le_bytes()[..rem.len()]);
        dst.len()/4 + 1
    } else {
        dst.len()/4
    }
}

Unfortunately it adds a panic on the unwrap. We could use chunks_mut instead, but it would add a panic path as well. The core idea is to allow compiler to properly optimize out the core loop and I think it will work with such function as well.

But honestly, I don't quite like the current approach and I think it's worth to experiment with one proposed here.

@vks
Copy link
Collaborator Author

vks commented Aug 11, 2020

But honestly, I don't quite like the current approach and I think it's worth to experiment with one proposed here.

I agree, being able to efficiently provide bitwise randomness seems more future proof for algorithms that optimize for the consumed entropy, and it makes generating bool more efficient.

fill_via_u32 should look something like this:

At this point, we can go back to using rand_core::impls::fill_via_u32_chunks. Unfortunately, this brings back the old performance problems.

@vks
Copy link
Collaborator Author

vks commented Aug 11, 2020

@dhardy If you think the performance of the safe code is not acceptable, we can always go back to the old fill_via_chunks implementation. This restores the old performance.

@dhardy
Copy link
Member

dhardy commented Aug 11, 2020

@newpavlov If bit- or byte-level counters work better here (with reasonable perf. all round), I don't have a problem with that. Also I don't think changing behaviour in a breaking release before 1.0 is a big deal; as long as we document it we're doing better than many rand libs.

Personally I still feel that allowing loss-less bit-level consumption is not very useful in terms of performance, and even byte-level consumption may not be overall. But I may be wrong.

@newpavlov
Copy link
Member

newpavlov commented Aug 11, 2020

@dhardy
Calculation of an aligned index takes only 4 instructions, we would also need to add several instructions to check if resulting index points at the end of a buffer, but I think performance hit will be barely noticeable. So I think it will be reasonable to keep the old implementation of fill_via_chunks for now and experiment on a new bit-level index in a separate PR after this one will be merged.

@vks
Copy link
Collaborator Author

vks commented Aug 12, 2020

So I think it will be reasonable to keep the old implementation of fill_via_chunks for now and experiment on a new bit-level index in a separate PR after this one will be merged.

I did that, they performance is comparable to before now:

test gen_bytes_chacha12      ... bench:   2,725,956 ns/iter (+/- 468,335) = 375 MB/s
test gen_bytes_chacha20      ... bench:   4,182,217 ns/iter (+/- 197,325) = 244 MB/s
test gen_bytes_chacha8       ... bench:   1,882,080 ns/iter (+/- 140,729) = 544 MB/s
test gen_bytes_hc128         ... bench:     505,143 ns/iter (+/- 47,225) = 2027 MB/s

I think this can be merged now (maybe after squashing)?

rand_core/src/block.rs Outdated Show resolved Hide resolved
rand_core/src/impls.rs Outdated Show resolved Hide resolved
rand_core/src/impls.rs Show resolved Hide resolved
src/distributions/mod.rs Outdated Show resolved Hide resolved
vks added 8 commits August 27, 2020 19:21
The results from master, using unsafe code:
```
gen_bytes_chacha12:   2,733,838 ns/iter (+/- 181,694) = 374 MB/s
gen_bytes_chacha20:   4,339,602 ns/iter (+/- 237,793) = 235 MB/s
gen_bytes_chacha8:    1,918,279 ns/iter (+/- 103,581) = 533 MB/s
```

The results of the new code using `chunks_exact_mut` (this commit):
```
gen_bytes_chacha12:   3,049,147 ns/iter (+/- 220,631)   = 335 MB/s
gen_bytes_chacha20:   4,645,772 ns/iter (+/- 269,261)   = 220 MB/s
gen_bytes_chacha8:    2,214,954 ns/iter (+/- 1,745,600) = 462 MB/s
```

The results of using `chunks_mut` (before this commit):
```
gen_bytes_chacha12:   3,492,109 ns/iter (+/- 164,638) = 293 MB/s
gen_bytes_chacha20:   5,087,706 ns/iter (+/- 249,219) = 201 MB/s
gen_bytes_chacha8:    2,700,197 ns/iter (+/- 524,148) = 379 MB/s
```
Copy link
Member

@dhardy dhardy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This time I can't just check the new commits thanks to a rebase, so I'll have to trust 😉

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants