I noticed that some builtins like bitreverse produce slower code than the LLVM backend, so it'd be good to optimize them. This includes SIMD intrinsics since some of them were implemented without SIMD instructions.