Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allocator alignment problems - remove it? #452

Closed
wants to merge 10 commits into from

Conversation

MichaelRawson
Copy link
Contributor

Not intended for merge.

Recently I started running a sanitizer for undefined behaviour (-fsanitize=undefined) and to my surprise there are relatively few issues, mostly easy to fix.

Unfortunately, one of them is not so easy: Vampire's allocator returns memory that is not sufficiently aligned for some objects. Allocator seems to align to pointer-sized (8 bytes on my machine) boundaries, but some types - for example std::ifstream - have larger alignment requirements. So far I've seen up to std::max_align_t (16 bytes), but even greater alignment is theoretically possible, so-called "over-aligned" types.

There are several possible approaches here:

  1. Do nothing, maybe alignment isn't such a big issue. This has indeed worked so far, but compilers are getting smarter and assuming a greater amount of standards-compliance from code. I would not be surprised if compilers started emitting aligned loads where they are technically allowed to - which would be absolutely horrendous to track down, and then we'd have to fix Allocator anyway.
  2. Fix Allocator by always aligning to std::max_align_t intervals. This is what the standard says it must do, but I suspect carries some performance hit as each allocation must take up at least 16 bytes. It will also not work with over-aligned types, should these be necessary in future. System allocators tend to support over-aligned types: the GNU C++ environment supports over-aligned types with -faligned-new, switched on by default from C++17.
  3. Fix Allocator by working out what the alignment should be (or falling back to alignof(std::max_align_t)) and allocating with variable alignment. I tried this yesterday! Wiring up Vampire this way is possible if tedious, but changing Allocator to handle variably-aligned allocations seems near-impossible in the current implementation and would amount to a rewrite, possibly introducing significant slowdown along the way.
  4. (ノಠ益ಠ)ノ彡┻━┻ - remove Allocator and use the system allocator instead. This is what this PR does - the minimal change required to fall back to the system allocator. If we go down this route we should also change some other things, such as removing the need for STLAllocator, vvector and friends, and possibly removing e.g. USE_ALLOCATOR from the whole source tree. Pros and cons discussed below.

Solutions (2) or (4) seem promising, but we will probably have to accept some slowdown in either case. (2) is more conservative, but as you can tell, I'm considering (4). This has disadvantages:

  1. I'm led to believe that the initial motivation for a custom allocator was improved performance. System allocators have improved significantly in the meantime, but there is still a small (somewhat inconsistent) performance hit using the system allocator. One might argue that this is because the system allocator correctly aligns allocations. ;-)
  2. We no longer track how much memory we've used. This seems like a bigger problem than it is: if we really need this (and I'm not sure we do) we could write a much thinner wrapper around the system allocator or write platform-specific code to query the operating system.

But also some advantages:

  1. No undefined behaviour!
  2. System allocators get better over time.
  3. Compilers can increasingly recognise and transform heap allocation patterns. I expect that a custom allocator defeats some of these passes unless we can annotate Allocator suitably.
  4. Can use the system allocator without playing the traditional game of BYPASSING_ALLOCATOR whack-a-mole. This is not such a big deal for seasoned Vampire hackers, but new contributors consistently run into this. This is not as simple as "just use Lib::Stack rather than std::vector" - calling almost anything in std might allocate if you are not careful, like I/O routines. We also have to patch or work around third-party code like MiniSat or Z3.
  5. Smaller Vampire, no USE_ALLOCATOR oddities.

In summary, there is a bug wandering around, and it will most likely bite us eventually. In my view the best solution is to remove Allocator and use the system allocator, like the majority of other programs. There is a performance hit, but in exchange we get significant convenience and the system allocator is fully correct.

I'd like your thoughts on this.


To attempt to quantify the performance hit, I ran 10-second DISCOUNT runs on a random subset of TPTP until Michaela started complaining about the fan noise. Log attached - format is problem master system-allocator, problems where neither solved it are removed.

comparison.txt

@quickbeam123
Copy link
Collaborator

@MichaelRawson, how did you get your second column vampire - was it simply via -DUSE_SYSTEM_ALLOCATION=1?

@MichaelRawson
Copy link
Contributor Author

No, USE_SYSTEM_ALLOCATION seems to cause more performance regression than strictly necessary. I suspect inlining - the compiler can probably inline (some of) the system operator new across Vampire, but not stuff from Allocator.cpp.

To get the right-hand column, in this branch I manually removed Allocator and replaced it with a skeleton.

@quickbeam123
Copy link
Collaborator

So in my randomized setup, a discount10 strategy running for 50000Mi would lose around 80 problems (from first-order TPTP) on average.

This was done just with the USE_SYSTEM_ALLOCATION=1 switch. But generally, I would be hesitant to endorse such a degradation "just to be" standard complying.

(Do we know what is the alignment important on the compiler side?)

Having said that, here is a thought. With the TermList/Term representation in Vampire, we are ourselves relying on Term pointers to be 4-byte aligned (and use the two lowest bits for tagging). Since you brought up the issue with no alignment guarantee by Allocator, I suspect we could break vampire ourselves at the moment, by simply allocating a single char/byte from time to time. (To force some Terms to get allocated at non-aligned addresses.) Could that possibly work?

@MichaelRawson
Copy link
Contributor Author

So in my randomized setup, a discount10 strategy running for 50000Mi would lose around 80 problems (from first-order TPTP) on average.

Interesting - and not as bad as I feared. Thanks for running the experiment so quickly!

I would be hesitant to endorse such a degradation "just to be" standard complying.

Fair enough - it's arguably the major problem with this line of thinking. It might be possible to claw something back with some targeted optimisation if we knew what causes the slowdown - any ideas? I'll look into that.

Do we know what is the alignment important on the compiler side?

I suspect we could break vampire ourselves

The TermList scenario you suggest is actually OK - Allocator does do some alignment, but only to the level of pointers (8 bytes) - so we're OK for TermList, which only needs 4-byte alignment. We are already broken in some sense because the sanitizer reports that we have UB when allocating a std::ifstream here (and we actually reach it, too), but it doesn't seem to actually cause a crash in this case.

Do we know what is the alignment important on the compiler side?

I'm parsing this as "Why is alignment important for the compiler?" - sorry for condescension if I misunderstood.

Alignment is important for compilers - some architectures cannot do unaligned loads/stores, and for some it's inefficient, so the compiler has to work out if some data is aligned, and work around it if it is not. Intel is actually pretty permissive in this respect, it can do both and in most cases there's not much difference in terms of performance. Architectures may have separate aligned/unaligned instructions - e.g. movaps and movups respectively on Intel.

As a result, e.g. Clang will emit aligned instructions by preference if it can prove that something "should" (in a standard-compliant program) be aligned. If it turns out not to be aligned, even Intel chips will fault and the program will crash immediately. Contrived example: https://godbolt.org/z/r36Y5Kqj5 - if Vampire did not put the allocation on a 16-byte boundary, we'd crash here.

@MichaelRawson
Copy link
Contributor Author

An update - I will try to implement a less-complex allocator that is both standards-compliant and doesn't have a performance hit. If I succeed I will close this and open a new PR for that one. If not, we can return to this.

@MichaelRawson MichaelRawson deleted the michael-remove-allocator branch August 10, 2023 13:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants