From 3b6eb55c775e8c22d89c2a584160e3fe81cf9f5a Mon Sep 17 00:00:00 2001 From: lindstro Date: Thu, 22 Aug 2024 12:45:17 -0700 Subject: [PATCH] Add FAQ on zfp block size --- docs/source/faq.rst | 88 +++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 88 insertions(+) diff --git a/docs/source/faq.rst b/docs/source/faq.rst index 35908b600..a4db2d1b5 100644 --- a/docs/source/faq.rst +++ b/docs/source/faq.rst @@ -42,6 +42,7 @@ Questions answered in this FAQ: #. :ref:`How large a buffer is needed for compressed storage? ` #. :ref:`How can I print array values? ` #. :ref:`What is known about zfp compression errors? ` + #. :ref:`Why are zfp blocks 4 * 4 * 4 values? ` ------------------------------------------------------------------------------- @@ -1312,3 +1313,90 @@ done to combat such issues by supporting optional (left), errors are biased and depend on the relative location within a |zfp| block, resulting in errors not centered on zero. With proper rounding (right), errors are both smaller and unbiased. + +------------------------------------------------------------------------------- + +.. _q-block-size: + +Q31: *Why are zfp blocks 4 * 4 * 4 values?* + +One might ask why |zfp| uses *d*-dimensional blocks of |4powd| values and not +some other, perhaps configurable block size, *n*\ :sup:`d`. There are several +reasons why *n* = 4 was chosen: + +* For good performance, *n* should be an integer power of two so that indexing + can be done efficiently using bit masks and shifts rather than more + expensive division and modulo operations. As compression demands *n* > 1, + possible choices for *n* are 2, 4, 8, 16, ... + +* When *n* = 2, blocks are too small to exhibit significant redundancy; there + simply is insufficient spatial correlation to exploit for sufficient data + reduction. Additionally, excessive software cache thrashing would likely + occur for stencil computations, as even the smallest centered difference + stencil spans more than one block. Finally, per-block overhead in storage + (e.g., shared exponent, bit offset) and computation (e.g., software cache + lookup) could be amortized over only few values. Such small blocks were + immediately dismissed. + +* When *n* = 8, blocks are too large, for several reasons: + + * Each uncompressed block occupies a large number of hardware cache lines. + For example, a single 3D block of double-precision values would occupy + 4,096 bytes, which would represent a significant fraction of L1 cache. + |zfp| reduces data movement in computations by ensuring that repeated + accesses are to cached data rather than to main memory. + + * A generalization of the |zfp| :ref:`decorrelating transform ` + to *n* = 8 would require many more operations as well as "arbitrary" + numerical constants in need of expensive multiplication instead of cheap + bit shifts. The number of operations in this more general case scales as + *d* |times| *n*\ :sup:`d+1`. For *d* = 4, *n* = 8, this implies + 2\ :sup:`17` = 131,072 multiplications and 114,688 additions per block. + Contrast this with the algorithm optimized for *n* = 4, which uses only + 1,536 bit shifts and 2,560 additions or subtractions per 4D block. + + * The additional computational cost would also significantly increase the + latency of decoding a single block or filling a pipeline of concurrently + (de)compressed blocks, as in existing |zfp| hardware implementations. + + * The computational and cache storage overhead of accessing a single value + in a block would be very large: 4,096 values in *d* = 4 dimensions would + have to be decoded even if only one value were requested. + + * "Skinny" arrays would have to be padded to multiples of *n* = 8, which + could introduce an unaccepable storage overhead. For instance, a + 30 |times| 20 |times| 3 array of 1,800 values would be padded to + 32 |times| 24 |times| 8 = 6,144 values, an increase of about 3.4 times. + In contrast, when *n* = 4, only 32 |times| 20 |times| 4 = 2,560 values + would be needed, representing a 60% overhead. + + * The opportunity for data parallelism would be reduced by a factor of + 2\ :sup:`d` compared to using *n* = 4. The finer granularity and larger + number of blocks provided by *n* = 4 helps with load balancing and maps + well to today's GPUs that can concurrently process thousands of blocks. + + * With blocks comprising as many as 8\ :sup:`4` = 4,096 values, register + spillage would be substantial in GPU kernels for compression and + decompression. + +The choice *n* = 4 seems to be a sweet spot that well balances all of the +above factors. Additionally, *n* = 4 has these benefits: + + * *n* = 4 admits a very simple lifted implementation of the decorrelating + transform that can be performed using only integer addition, subtraction, + and bit shifts. + + * *n* = 4 allows taking advantage of AVX/SSE instructions designed for + vectors of length four, both in the (de)compression algorithm and + application code. + + * For 2D and 3D data, a block is 16 and 64 values, respectively, which + either equals or is close to the warp size on current GPU hardware. This + allows multiple cooperating threads to execute the same instruction on one + value in 1-4 blocks (either during (de)compression or in the numerical + application code). + + * Using a rate of 16 bits/value (a common choice for numerical computations), + a compressed 3D block occupies 128 bytes, or 1-2 hardware cache lines on + contemporary computers. Hence, a fair number of *compressed* blocks can + also fit in hardware cache.