TODO

x Map large pages
  x Already done (check via tracefs)

x Copy keys to DRAM first
  x Huge speedup
  x Use stdio to read/write records
    x also speedup
  x Or PMDK with non-temporal writes of 100B size?
    x Made it worse

* Replace qsort/memcmp with something else?
  * AVX-512 sort? Didn't find one that can treat arbitrary keys (only
    uint32_t, uint64_t).

x Split keys from pointers in DRAM
  x Made it slightly slower

x Reduce pointer size
  x Small speedup

x Allow non-power-of-two number of mappers

* qsort file-by-file (as we read them) and then merge-sort the lot?
  * To overlap with the read phase. Not sure if useful (read phase is very short).
   
x Look at membw_stdio to see if we're conducting IO efficiently
  x We are. 7.2 GB/s read & write throughput.

x qsort on NVM (to overlap compute with IO)
   x Slower. Probably latency bound.

x partition with fread didn't make a perf difference versus mmap

x Use quickselect instead of sort to find lowest elements as quickly
  as possible. Use to overlap with writing (current bottleneck)
   x Made it slightly slower
   * Maybe start another thread to write the data asynchronously?
   * Use async IO? (not sure if that's supported with NVM)