Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix alignment issue of TensorData bytes #2416

Merged
merged 15 commits into from
Dec 17, 2024

Conversation

WorldSEnder
Copy link
Contributor

@WorldSEnder WorldSEnder commented Oct 24, 2024

Checklist

  • Confirmed that run-checks all script has been executed.
    I have some issues with flaky tests, in particular quantize_dynamic_int8 and downsample_interpolation are off-by-1 in a few cases. I've ignored them locally, they seem to be fine in CI.
    Further, I've stumbled over a test failure seemingly related to my AMD GPU driver

ACO ERROR:
In file ../mesa-24.1.4/src/amd/compiler/aco_assembler.cpp:1056
Unsupported opcode: v4: %370:v[8-11] = v_wmma_f32_16x16x16_f16 (latekill)%369:v[36-43].xx, (latekill)%359:v[48-55].xx, %368:v[8-11].xx

  • Made sure the book is up-to-date with this PR. Not applicable, changes to internals only.

Related Issues/PRs

Fixes #2375.

Changes

Add a new opaque structure Bytes which holds on to tensor data. As explained in the linked issue, using a Vec<u8> is incorrect, as this will deallocate the memory with an incorrect alignment/memory layout given to the allocator, which violates the allocator contract. Instead, the structure remembers the alignment of the data and uses that when deallocating.

For serialization, only the bytes are written and no alignment is taken into account. Instead, when deserializing, the data is allocated with over-alignment (currently align_of::<u128>) which should be sufficient to interpret a slice of the data as a slice of some larger type, such as f32 etc.

Note that this means that serialization is not strictly a round-trip, which can show in the difference between how slices and Vec works:

  • to re-interpret a [u8] as a [E], the slice needs to be aligned at least as strictly as the alignment of E.
  • to re-interpet a Vec<A> as a Vec<B>, the alignments must be exactly equal. This means that a deserialized instance of Bytes possibly can not be converted via Bytes::try_into_vec. In case this is needed in the future, one would either need to get access to the needed alignment during deserialization or copy the data to a correctly aligned buffer. Such an API is possible but not included in this PR.

Testing

Some limited tests of TensorData has been run through miri manually to confirm that it doesn't complain after the patch. In case you want tests in the repo, I would need more guidance how to set that up, since miri should probably only run a limited number of tests for CI performance reasons.

introduce max alignment (which depends on platform anyway) and dont serialize that part
fixes Clone, Debug, and Eq impls to work on the bytes, not the pointers.
Copy link

codecov bot commented Oct 24, 2024

Codecov Report

Attention: Patch coverage is 83.09524% with 71 lines in your changes missing coverage. Please review.

Project coverage is 82.06%. Comparing base (8a89293) to head (2239737).
Report is 1 commits behind head on main.

Files with missing lines Patch % Lines
crates/burn-tensor/src/tensor/bytes.rs 82.10% 63 Missing ⚠️
crates/burn-tensor/src/tensor/data.rs 92.15% 4 Missing ⚠️
crates/burn-jit/src/ops/transaction.rs 0.00% 3 Missing ⚠️
crates/burn-core/src/record/serde/data.rs 80.00% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #2416      +/-   ##
==========================================
- Coverage   82.06%   82.06%   -0.01%     
==========================================
  Files         831      832       +1     
  Lines      106003   106362     +359     
==========================================
+ Hits        86990    87284     +294     
- Misses      19013    19078      +65     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@WorldSEnder
Copy link
Contributor Author

Re Codecov report:

The missed lines are for the most part the Debug impl of Bytes. I think it makes sense to invest a bit more time into nice formatting there, any comments on what the output should be would be appreciated.

@laggui
Copy link
Member

laggui commented Nov 4, 2024

Sorry for the delayed response! We discussed this offline but forgot to get back to you.

First, thanks for identifying the potential issue.

While these changes to handle alignment concerns more generically could certainly provide better safety, it also adds complexity and might not be immediately necessary. That's why we're a bit hesitant on making this change for now. As you noticed this is not an issue for most allocators.

Of course that doesn't mean it is not an issue, just that it does not affect the majority of use cases. So we should still keep this in mind, and we might circle back in the future thanks to your investigation.

@WorldSEnder
Copy link
Contributor Author

WorldSEnder commented Nov 4, 2024

While I understand the added complexity makes it difficult to review, I am a bit disappointed. There is no way around the fact that the currently implementation is misusing unsafe. miri rightfully complains about it. Yes, on the major platforms I have on my hand (linux and windows) the allocator plays along, yet burn-tensor is no_std. I do not have a ras-pi on my hands to test this, but the burn book mentions the embedded-alloc crate and that already seems much more fickle about the alloc requirements. After all, this is rust but I understand that you don't have the time to properly review this, with the ongoing work on cubecl and all. Don't read this as me trying to rush this in, I would also not be happy to see this merged without a proper review.

I only noticed this when I implemented the deserialization logic, but please note that the current impl will also happily allocate a Vec<u8> aligned to "just" a u8 even for element types of f32 or whatever you element type one has that will surprisingly and inconveniently fail to cast to a slice of its own element type for example here.

In any case, I suppose there is no problem with just leaving this open for the time being.

Just one more question: Would you be less wary if the new unsafe code was in a tested outside crate such as bytemuck? Some parts such as the (de)serialization I think don't make sense to be outside burn, but most of the "byte container" could possibly make more sense upstream.

@laggui
Copy link
Member

laggui commented Nov 4, 2024

Don't read this as me trying to rush this in, I would also not be happy to see this merged without a proper review.

No worries, we're on the same page here 🙂

I think our main concerns were more in terms of trade-offs, mostly time and possible ramifications to this breaking change (though I think my position has changed on that.. as you'll see below).

burn-tensor is no_std. I do not have a ras-pi on my hands to test this, but the burn book mentions the embedded-alloc crate and that already seems much more fickle about the alloc requirements

I'm glad you brought this up! Honestly I think I kind of forgot to take that into consideration.. this should be especially important for Burn as we aim for portability. I think our previous position was that the current implementation was not an issue for most modern processors and platforms, but that's not good enough.

I should really get a ras-pi myself and start messing around with Burn on no_std so we can improve this.

Just one more question: Would you be less wary if the new unsafe code was in a tested outside crate such as bytemuck? Some parts such as the (de)serialization I think don't make sense to be outside burn, but most of the "byte container" could possibly make more sense upstream.

I'm not too wary about this tbh.

I gotta run but wanted to make sure to get back to you with my current thoughts.

In short, in my previous comment I was opened to the change but did not think it was required for the short term. But now I think we should make sure that this is addressed (ideally, without performance impacts to de/serialization - so we should make sure to benchmark that for some practical use cases).

@WorldSEnder
Copy link
Contributor Author

I gotta run but wanted to make sure to get back to you with my current thoughts.

In short, in my previous comment I was opened to the change but did not think it was required for the short term.

No worries, I too think we are on the same page :) While strictly speaking violating the allocator contract like atm could lead to memory unsafety, I don't think any allocator actually does use the violated guarantees in a meaningful way (but I'm open to surprises). On the other hand, misaligned allocations would be a real concern but at least as far as I can tell mostly lead to panics, not undefined behaviour or violations of memory safety, so a relaxed approach to fixing this is fine 👍 Sorry if I came off as a bit panicky

@laggui
Copy link
Member

laggui commented Nov 8, 2024

@WorldSEnder I was looking at your PR yesterday and I think we'd like to stay away from a new opaque type for now.

I also benchmarked some solutions to preserve alignment and none of them come close to the current pointer re-interpret sadly.

I think we can keep both approaches instead based on some target architectures for now, since we know that heap allocations are overaligned on mainstream systems. And for others with stricter alignment rules, we fallback to something similar to what you implemented in the Bytes opaque type. We also add some documentation for anyone that might come across the code in the future 🙂

    /// Initializes a new tensor data structure from the provided values.
    fn init<E: Element, S: Into<Vec<usize>>>(
        mut value: Vec<E>,
        shape: S,
        dtype: DType,
    ) -> Self {
        // Ensure shape is valid
        let shape = shape.into();
        let shape_numel = Self::numel(&shape);
        value.truncate(shape_numel);
        let numel = value.len();
        assert_eq!(
            shape_numel, numel,
            "Shape {:?} is invalid for input of size {:?}",
            shape, numel,
        );

        let bytes = Self::to_bytes(value);

        Self {
            bytes,
            shape,
            dtype,
        }
    }

    fn to_bytes<E: Element>(value: Vec<E>) -> Vec<u8> {
        // Fast path for architectures with heap allocation overalignment.
        #[cfg(all(
            not(miri),
            any(target_arch = "x86", target_arch = "x86_64", target_arch = "aarch64")
        ))]
        {
            let mut value = value;
            let factor = std::mem::size_of::<E>() / std::mem::size_of::<u8>();
            let len = value.len() * factor;
            let capacity = value.capacity() * factor;
            let ptr = value.as_mut_ptr();

            std::mem::forget(value);

            // SAFETY: Practically, mainstream systems align heap allocations beyond
            // the minimum required alignment for the element type (e.g., 8 bytes on 64-bit
            // systems).
            //
            // This is not guaranteed by Rust's memory model, which follows strict alignment
            // requirements as specified by the type's layout.
            //
            // But we leverage the overalignment and efficient unaligned access to interpret
            // the same memory layout as individual bytes instead of `E` elements.
            // This simple trick makes this pointer cast acceptable (and requires no copy).
            unsafe { Vec::from_raw_parts(ptr as *mut u8, len, capacity) }
        }

        // Fallback for architectures with stricter alignment rules.
        #[cfg(any(
            miri,
            not(any(target_arch = "x86", target_arch = "x86_64", target_arch = "aarch64"))
        ))]
        {
            let num_bytes = value.len() * std::mem::size_of::<E>();
            let mut bytes = vec![0u8; num_bytes];
            unsafe {
                std::ptr::copy_nonoverlapping(
                    value.as_ptr() as *const u8,
                    bytes.as_mut_ptr(),
                    num_bytes,
                );
            }
            bytes
        }
    }

I also added miri to the conditions to make sure it doesn't complain 😅 let me know how you feel about this.

@WorldSEnder
Copy link
Contributor Author

WorldSEnder commented Nov 8, 2024

@laggui I think in the long run, an opaque wrapper should happen at some point (note the wrapper in this PR does no mem-copy so should not have the overhead you mention). A publicly exposed field of Vec<u8> lets the user mem::swap in any Vec<u8> that they may have constructed themselves. I'd want to double check if that entails that the storage has to be aligned if the system allocator always aligns it, my current assumption would be that it entails more weirdness that an opaque wrapper struct would cover.

Secondly, the target detection is perhaps 90% there. Sadly, I don't think there is a way to detect the specific allocator being used, and the check is of course only detecting the target_*, not whether System is being used as the #[global_allocator] and as far as I know there is no way to detect which allocator ends up being used in the end as the forwarding Global does not expose this.
Note for future reference: It seems the choice of default system allocator is made in the std in here if we want to go the route of just using system information as a proxy.
Another comment would be to just let miri complain. This would still be technically violate the allocator contract and since there is no way to check which allocator is used, I would still see it as unsound. So let miri panic and don't paint over the wound.

What exactly is the concern with an opaque struct? It seems there would be a simple almost-zero-cost conversion from Vec<u8> to TensorData, a Deref<Target=[u8]> impl and reasonably efficient conversion back to Vec<u8> (though that might entail a mem-copy or the downside of possibly-undefined-behaviour-maybe-just-panicking.

PS: Did you encounter any slowdown with this PR in your benchmarking? If so, this would be a concern and I would put in more work to make sure I didn't miss any unreasonable overhead.

@laggui
Copy link
Member

laggui commented Nov 11, 2024

A publicly exposed field of Vec lets the user mem::swap in any Vec that they may have constructed themselves. I'd want to double check if that entails that the storage has to be aligned if the system allocator always aligns it, my current assumption would be that it entails more weirdness that an opaque wrapper struct would cover.

Yeah I think we could (should) remove the public field and users could only get the slice with the current available methods. So no direct access to the underlying bytes. Not sure why we left it public actually 😅

Secondly, the target detection is perhaps 90% there. Sadly, I don't think there is a way to detect the specific allocator being used, and the check is of course only detecting the target_*, not whether System is being used as the #[global_allocator] and as far as I know there is no way to detect which allocator ends up being used in the end as the forwarding Global does not expose this.

That's what I found as well. Was trying to cover the most important cases with this condition where stricter alignment requirements are necessary, but I realize this is not 100%. I wasn't sure if a new type was warranted at this time but I now realize that it might be required to do this cleanly.

Another comment would be to just let miri complain. [...] So let miri panic and don't paint over the wound.

Oh I agree! This was mostly for the current tests with miri 😅

What exactly is the concern with an opaque struct?

An opaque struct like Bytes is a nice to encapsulate and control the alignment details, but it introduces a breaking change with potential overhead (and complexity, though in its current form it is fairly simple and contained). I think my main concern is/was with potential redundancy since the alignment provided by Vec<u8> is sufficient on common platforms with Rust's default allocator. I thought we might be able keep away from the new opaque struct to avoid the breaking change and deal with the alignment requirements when necessary, though I'm not sure if this is feasible for 100% coverage (and I am not entirely informed on the different allocators). I'm bouncing back and forth between the two positions a bit (as with any trade-off, thanks for the constructive feedback / discussion in this thread btw!)

Also tiny note on the current Bytes implementation: what about sending between threads? Might need to be more careful (as opposed to Vec<u8>).

PS: Did you encounter any slowdown with this PR in your benchmarking? If so, this would be a concern and I would put in more work to make sure I didn't miss any unreasonable overhead.

Ran some benchmarks for TensorData creation from existing floating point data (32 x 512 x 1024) and record loading (12 linear layers with shape [2048, 2048]) on both the current state and with your changes.

The results seem to be on par with the current, which is great 🙏 I think I recalled incorrectly that there was a copy in from_elems without looking back at the code, so I was expecting a slowdown. But that's not the case (it's in another path).

Current

Benchmark Feature Backend Device Median
from_data ndarray ndarray Cpu 95.000µs
load_record_lazy ndarray ndarray Cpu 105.000µs
load_record_manual ndarray ndarray Cpu 0.000ns
load_record_sync ndarray ndarray Cpu 253.399ms

This PR

Benchmark Feature Backend Device Median
from_data ndarray ndarray Cpu 96.000µs
load_record_lazy ndarray ndarray Cpu 112.000µs
load_record_manual ndarray ndarray Cpu 0.000ns
load_record_sync ndarray ndarray Cpu 269.023ms

@WorldSEnder
Copy link
Contributor Author

WorldSEnder commented Nov 11, 2024

Yeah I think we could (should) remove the public field and users could only get the slice with the current available methods. So no direct access to the underlying bytes. Not sure why we left it public actually 😅

I was wondering if it was intentional that you could push further bytes after construction, cause that seems a bit weird since you could push e.g. a single u8 even when the TensorData contains u16 etc. Good to hear that this was never intended, thought I missed something :D

Was trying to cover the most important cases with this condition where stricter alignment requirements are necessary, but I realize this is not 100%.

After more investigation, and a bit longer look than I had wished, I found that the default allocator on windows actually does use the passed align in a non-trivial way. Specifically, data below MIN_ALIGN is just allocated as-is, but anything beyond that will allocate additional space and contain a header with a pointer to the actual allocation, pictogram:

x ... unused bytes
u ... user data

 v --- aligned to MIN_ALIGN, this pointer is allocated from HeapAlloc
 | x x x x Header(ptr) | u u u u u u u ..... |
 ^----------------^-^
                       ^  aligned to requested alignment, this pointer is returned to user-code

The reason the current code gets away with it - supposedly - is because on 64-bit platforms MIN_ALIGN is 16, which is larger than the data types that currently get put into a tensor. But it's awefully brittle and it just feels like corner-case after corner-case to find these kind of "exceptions" to the allocator playing nicely along to wrong alignment.

it introduces a breaking change

Yeah, the timing close to the release of 0.15 is a bit unfortunate. Not sure about your planned release cycles, I wouldn't mind holding it until the breaking 0.16 comes along naturally.

Also tiny note on the current Bytes implementation: what about sending between threads?

Thread-safety is indeed a good point. The current Bytes should be Send + Sync without a problem since Element: Send + Sync and &Bytes offering only read access, but that could be encoded in some static asserts and tests.

The results seem to be on par with the current, which is great 🙏

There seems to be a slight slowdown in deserialization, which might perform an extra copy to an aligned buffer. I think in that case we can get away with actually checking the runtime alignment instead (still storing and de-allocating with the alignment of the actual buffer) and not copying/only copying when someone wants to convert it an actual Vec<E>. I will take another look 👍

this already improves readability a bit by separating out alloc/dealloc logic
and adding a bunch of safety comments and better error messages
borrowing from the deserializer will not save a copy, and is moreover
inefficient when we could take ownership of an existing byte buffer
both changes only target miri's borrowing semantics, oprationally
the pointers are the same, but they obey different borrow-stack rules.
@WorldSEnder
Copy link
Contributor Author

Alright, have added both suggestions (Sync and Send + avoiding a copy if the deserialized vector ends up being aligned). I have added further tests and these can be run under miri. In case you want to do that yourself, try it with

cargo +nightly miri test --package burn-tensor --lib -- tensor::bytes::tests --show-output

@WorldSEnder
Copy link
Contributor Author

@laggui I wanted to merge the changes from quantization and in the process came up with a few questions where I'm not sure what's intended and what is not, again about serialization.

There should be no difference in this PR and the state on main, but (de)serialization seems to assume that data in storage is in system byte order. I don't see any conversion from/to network byte order. So serializing a tensor on a little endian (LE) system such as x86, then deserializing on a BE system (well mostly SPARC; but fairly uncommon I think) would shuffle the bytes and not recover the data as intended. Further network order is big endian (which is lame, as most systems have to perform a byte shuffle for what are merely historic reasons). So the question is this: Since serialization is performed as a byte sequence where endianess matters, is the current state just assuming little endian everywhere and not correcting for it correct?

If the answer to the above is yes, and there is no external constraints on the byte order, I don't see why the byte packing (which does seem to pad with 0s, then reinterpret as a Vec of big endian u32) could not just use the system endianess and pad then reinterpret without endianess conversion and do so in place. The current implementation seems to create 2 extra Vecs. One in pack_i8s_to_u32s, then another in to_vec().

In any case, I will implement a fn extend_from_slice for Bytes to accomodate storing the quantization strategy there.

A lot of methods seem to ignore the difference between values_as_bytes and as_bytes and try to make use of the additional quantization data, too (such as iter, to_vec). Is that intentional?

@laggui
Copy link
Member

laggui commented Nov 16, 2024

@laggui I wanted to merge the changes from quantization and in the process came up with a few questions where I'm not sure what's intended and what is not, again about serialization.

There should be no difference in this PR and the state on main, but (de)serialization seems to assume that data in storage is in system byte order. I don't see any conversion from/to network byte order. So serializing a tensor on a little endian (LE) system such as x86, then deserializing on a BE system (well mostly SPARC; but fairly uncommon I think) would shuffle the bytes and not recover the data as intended. Further network order is big endian (which is lame, as most systems have to perform a byte shuffle for what are merely historic reasons). So the question is this: Since serialization is performed as a byte sequence where endianess matters, is the current state just assuming little endian everywhere and not correcting for it correct?

Yep, we don't check for endianness atm so we assume everything is fine for now. Might have to check in the future.

If the answer to the above is yes, and there is no external constraints on the byte order, I don't see why the byte packing (which does seem to pad with 0s, then reinterpret as a Vec of big endian u32) could not just use the system endianess and pad then reinterpret without endianess conversion and do so in place. The current implementation seems to create 2 extra Vecs. One in pack_i8s_to_u32s, then another in to_vec().

Funny you bring this up, I realized the other day while making some changes that the "packing" order might not be as optimal. Quantization is still WIP but we were just about to refactor the quantization kernels in jit along with the representation. Learning some things as we go :)

In any case, I will implement a fn extend_from_slice for Bytes to accomodate storing the quantization strategy there.

A lot of methods seem to ignore the difference between values_as_bytes and as_bytes and try to make use of the additional quantization data, too (such as iter, to_vec). Is that intentional?

In iter we chose not to dequantize the values so it is intentional for the current behavior. Doesn't make as much sense in to_vec, we should change that.

@WorldSEnder
Copy link
Contributor Author

WorldSEnder commented Nov 16, 2024

Yep, we don't check for endianness atm so we assume everything is fine for now. Might have to check in the future.

Ah okay! If you do in the future, that could be a good spot to move the assumption of a MAX_ALIGN to something more dtype dependent, as I would assume that converting byte order also needs to know about the underlying datatype, which could in one swoop also fix the runtime alignment (try_enforce_runtime_align in this PR).

@laggui
Copy link
Member

laggui commented Nov 18, 2024

I was wondering if it was intentional that you could push further bytes after construction, cause that seems a bit weird since you could push e.g. a single u8 even when the TensorData contains u16 etc.

I think there might be some use cases for that actually.. I discussed with @nathanielsimard last week and iirc this was one of the reasons for keeping it as Vec<u8> at the moment. Correct me if I'm wrong, but this was for some work in burn-remote? Can't remember exactly what you mentioned so maybe you could chime in here with your points.

@nathanielsimard
Copy link
Member

I don't recall a reason to push u8 into the vector.

@WorldSEnder
Copy link
Contributor Author

WorldSEnder commented Nov 19, 2024

@syl20bnr I believe https://github.com/tracel-ai/github-actions/blob/b87057a7f68360d9d01a8da21de874c38e7acbab/setup-rust/action.yml#L23-L26 is accidentally run on windows machines as well, leading to tests failing there.

@syl20bnr
Copy link
Member

Argh, thank you, fixing it.

@syl20bnr
Copy link
Member

@WorldSEnder should be fixed, I restarted the job.

The new Allocation struct keeps the raw allocation and its layout,
the Bytes struct wraps an Allocation and asserts that len bytes of it are initialized
@WorldSEnder
Copy link
Contributor Author

WorldSEnder commented Nov 19, 2024

Okay, implemented Bytes::extend_from_byte_slice to support appending quantization data. This means TensorData::quantized should use at least one less vec copy in most cases. Note that pack_i8s_to_u32s could probably be done in place, but that's left for another PR.

I realized that most methods that use the bytes, such as the aforementioned as[_mut]_slice and to_vec/into_vec refuse on the grounds of mismatching datatypes ("normal" vs quantized) before trying to touch the unrelated quantization data. Perhaps this could be a deliberate choice telling users to unquantize before being allowed to use these methods anyway.

As a side-effect of working Bytes into quantized, this now has a length check (unified with new), I suppose this was meant to be there in the first place. Let me know if I missed anything :)

@juntyr
Copy link

juntyr commented Dec 4, 2024

TLDR This PR should be merged asap. Fixing a soundness issue that could turn into a security advisory should not be stalled for this long.

I recently started experimenting with burn, enjoyed it a lot, but now came across this PR and I'm concerned.

As explained in the linked issue, using a Vec<u8> is incorrect, as this will deallocate the memory with an incorrect alignment/memory layout given to the allocator, which violates the allocator contract.

This is still an understatement. When violating a safety condition of an unsafe method, you need to assume that this causes undefined behaviour (and in this case it immediately does and is caught by Miri).

While these changes to handle alignment concerns more generically could certainly provide better safety, it also adds complexity and might not be immediately necessary. That's why we're a bit hesitant on making this change for now. As you noticed this is not an issue for most allocators.

The Rust compiler is allowed to do anything with a program that exhibits undefined behaviour. So any program that uses burn somewhere could in theory be compiled into complete garbage because burn includes unsound code. The compiler is not as optimising/punishing that every soundness issue infects compilation this way, as can be seen by this code running just fine, but that can change with any new release of Rust. It would be a shame if some larger application upgrades their Rust version and suddenly crashes at runtime because the compiler is now smart enough to exploit this undefined behaviour for further optimisations.

While you are right that this does issue does not manifest on the most popular platforms, Rust has a high standard for writing sound code - "it hasn't segfaulted on my machine yet" is not sufficient in Rust. Safe Rust code must never exhibit undefined behaviour (and so any soundness issues should be fixed as soon as they're discovered) and should never segfault, and since you are exposing TensorData with a safe interface, you are required to make sure that no undefined behaviour can occur when calling it from safe code.

@laggui
Copy link
Member

laggui commented Dec 4, 2024

"it hasn't segfaulted on my machine yet" is not sufficient in Rust.

To be fair, it hasn't segfaulted on any machine for Burn users 😅 the current implementation is safe on the target platforms using Rust's default allocator. The discussion in this PR details why it works in practice with the supported tensor data types, and other allocators were out of scope. Now, as I mentioned in my previous comments, this doesn't mean that this is a non issue.

I actually would have liked to merge this two weeks ago following the planned release, but we had to work on other issues and improvements (here and in CubeCL) which have caused us to push our release schedule. And the plan was to make this breaking change for the following release.

@juntyr
Copy link

juntyr commented Dec 4, 2024

I’m happy to hear the PR will be merged soon <3

I really like burn so far and want to use it for a longer term project.

The default Rust std allocator is unspecified, but currently delegates to the System allocator. After a few more layers of redirection, we end up with its implementation here (for Unix, in adjacent files for other operating systems): https://github.com/rust-lang/rust/blob/master/library/std/src/sys/alloc/unix.rs You can see that it boils down to libc being called on Unix.

However, different targets use different allocators to back the system and so you would have to check all of their implementations to see if your stronger requirements are met.

Furthermore, supplying a different global allocator is not a feature that a Rust crate can choose to not support. The global allocator is opaque to all libraries and the only assumptions you can make are in its safety contract.

So you could, with just safe code, change the allocator to one where freeing memory with a different (but same-sized) layout than it was allocated with (the effect of creating a Vec from the raw parts of a Vec of a type with a different size and alignment) does mess things up. In that case, the UB would still be burn’s fault.

All of this is to say that UB must be taken seriously.

Perhaps it might be a good idea to run your test suite in CI with Miri (probably only the pure Rust parts, so e.g. just the NdArray backend and only on smaller examples since Miri is quite slow) so that you can have more reassurance that your code is safe.

@laggui
Copy link
Member

laggui commented Dec 16, 2024

@WorldSEnder thanks for keeping this up to date with the changes on main 🙏

Before merging I wanted to be sure that there were no big performance impacts, but running a quick serialization / deserialization test showed approx. 30% slowdown for deserialization. In my previous benchmarks I only tested the module loading (without saving / loading on disk), so this discrepancy seems to be directly related to the Deserialize implementation.

~Once the deserialization is on par, this should be good to merge 🙂 ~

I'll look into this tomorrow but feel free to ping me otherwise.

Benchmark Code
use std::time::{Duration, Instant};

use burn::backend::ndarray::NdArrayDevice;
use burn::backend::NdArray;
use burn::record::{FullPrecisionSettings, NamedMpkFileRecorder};
use burn::tensor::backend::Backend;
use burn::{config::Config, module::Module, nn};

#[derive(Module, Debug)]
struct BenchmarkModule<B: Backend> {
    linears: Vec<nn::Linear<B>>,
}

#[derive(Config, Debug)]
struct BenchmarkConfig {
    linear: nn::LinearConfig,
    num_layers: usize,
    path: String,
}

impl BenchmarkConfig {
    pub fn init<B: Backend>(&self, device: &B::Device) -> BenchmarkModule<B> {
        BenchmarkModule {
            linears: (0..self.num_layers)
                .map(|_| self.linear.init(device))
                .collect(),
        }
    }
}

fn min_max_median_durations(durations: &[Duration]) -> (Duration, Duration, Duration) {
    let mut sorted = durations.to_vec();
    sorted.sort();
    let min = *sorted.first().unwrap();
    let max = *sorted.last().unwrap();
    let median = *sorted.get(sorted.len() / 2).unwrap();
    (min, max, median)
}

fn main() {
    type B = NdArray;
    let device = NdArrayDevice::Cpu;
    let n = 100;

    let config = BenchmarkConfig::new(
        nn::LinearConfig::new(2048, 2048),
        12,
        "/tmp/serialize.mpk".into(),
    );

    let recorder = NamedMpkFileRecorder::<FullPrecisionSettings>::new();

    let module = config.init::<B>(&device);
    module.save_file(config.path.clone(), &recorder).unwrap();

    println!("Running {n} samples");

    let mut time = Vec::new();
    for _ in 0..n {
        let module = config.init::<B>(&device);
        let start = Instant::now();
        module
            .load_file(config.path.clone(), &recorder, &device)
            .unwrap();
        let end = start.elapsed();
        time.push(end);
    }

    let (min, max, med) = min_max_median_durations(&time);
    let mean = time.iter().sum::<Duration>() / time.len() as u32;

    println!("Min: {:?}", min);
    println!("Max: {:?}", max);
    println!("Mean: {:?}", mean);
    println!("Median: {:?}\n", med);
}

/edit: I isolated the benchmark in a simple app, looks like the results are pretty similar for lazy loading.

main

Running 100 samples
Min: 49.275069ms
Max: 73.876344ms
Mean: 60.649308ms
Median: 58.640956ms

Running 100 samples
Min: 43.957336ms
Max: 72.071386ms
Mean: 60.268724ms
Median: 59.620389ms

Running 100 samples
Min: 46.436204ms
Max: 75.203106ms
Mean: 59.823427ms
Median: 59.33012ms

Running 100 samples
Min: 44.280111ms
Max: 77.853776ms
Mean: 60.594653ms
Median: 59.622567ms

Running 100 samples
Min: 43.539663ms
Max: 91.156482ms
Mean: 64.837496ms
Median: 62.677896ms

Bytes struct

Running 100 samples
Min: 50.813914ms
Max: 81.84623ms
Mean: 60.453739ms
Median: 59.111315ms

Running 100 samples
Min: 52.386686ms
Max: 80.432294ms
Mean: 60.532419ms
Median: 59.04348ms

Running 100 samples
Min: 51.889802ms
Max: 81.399073ms
Mean: 60.823728ms
Median: 59.243563ms

Running 100 samples
Min: 56.4987ms
Max: 81.429559ms
Mean: 61.664302ms
Median: 59.56763ms

Running 100 samples
Min: 50.892315ms
Max: 82.501045ms
Mean: 63.365258ms
Median: 62.532148ms

@laggui laggui merged commit 28f99d1 into tracel-ai:main Dec 17, 2024
11 checks passed
@WorldSEnder WorldSEnder deleted the fix-vec-align branch December 18, 2024 08:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Wrong usage of unsafe Vec::from_raw_parts in TensorData
5 participants