The current implementations generate large binaries because they have one specialized implementation for each bitwidth, and do loop unrolling. Add a flag-enabled implementation that uses a more compact scalar implementation. This would be useful for web assembly for instance.