Reduce allocations #22

dcoutts · 2024-09-20T09:00:29Z

Set up a benchmark to measure allocations
Reduce allocations in the benchmark itself (to make allocations in the library more significant)
Reduce allocations in submitIO, and adjust to be async-exception safe
Reduce allocations in awaitIO.

Results:

Baseline allocations per IO op: 1331 bytes
After improvements to benchmark itself (not lib): 221 bytes
After improvements to the library: 109 bytes

This is about 14 words (64bit) remaining. Of which 6 words are the IOOp itself. This is when using batches of 32 IOOps, so some per-batch allocations are amortised over the operations.

Further improvements are probably possible, but low hanging fruit is now probably elsewhere. One remaining idea is to make IOOp an instance of Vector.Unbox, to allow the vector of IOOps to use an unboxed vector. This would not directly improve allocations much (only 1.5 words per op), but might allow for reduced conversions in applications, and/or reuse of (mutable) vectors.

This is as a prelude to trying to reduce allocations. Inital result for high level benchmark Allocated total: 348802552 Allocated per: 1331 That is, 1331 bytes per IO operation. This includes allocations from the benchmark itself. Indeed profiling indicates that most of the allocations are from the benchmark itself rather than the library.

Previously we generated a large list of a random permutation of all the 4k disk blocks in the input file, and then broke that up into groups of 32 blocks. We used forConcurrently_ over the groups which starts one thread per group. This approach did excessive work and allocated a lot. For starters the random permutation is expensive in time and space. It needed to track a set of remaining choices. And it had to allocate the whole list because the forConcurrently_ forces the list (of groups). And a thread per 32 blocks is a lot of threads once one uses large files. For our example target file of 262144 blocks, this amounts to 8192 threads. Each one has a stack, so it's a lot of allocations in total. The new approach is: 1. Use random block indexes rather than a permutation. This leads to duplicates but is much cheaper because it needs little state. The best solution to the duplicates will be to use O_DIRECT to avoid caching effects. 2. Only use a small number of threads and have each one submit a series of batches of work (still of size 32). This uses much less state. Then in addition, we use a lower level approach to generatng the random block indexes to reduce heap allocations. The new allocation results for the high level benchmark are: Allocated total: 57904976 Allocated per: 221 That is, 221 bytes per IO op. This is without any changes yet to the actual library, just the benchmark.

In the time and allocation profiling (using late cost centers) it helps to have things inlined or not inlined to achieve a readable structure to the profile results. This helps identifying the source of the remaining allocations.

Try to do two things: 1. reduce allocations and improve performance slightly 2. improve the sync and async exception safety of submitIO Trying to achieve exception safety is non-trivial and needs masking and multiple levels of exception handlers. The one remaining difficulty is that if we get a sync exception from the lower level submit operation then we do not yet reset the ring to clear the SQEs that we filled out.

High level benchmark allocations down to: Allocated total: 28566920 Allocated per: 109 which is just under 14 words per operation, which isn't too bad. At least, when using 32 operations per batch (many of the allocations are per-batch).

They need GADTs or TypeFamilies enabled in a module to solve constraints arising from other modules.

mheinzel

I'm still looking at the implementation changes.

mheinzel · 2024-09-20T17:22:16Z

benchmark/Bench.hs

+            forRngSplitM_ nbatches rng_task $ \ !rng_batch ->
+              submitIO ioctx $
+                generateIOOpsBatch fd buf lastBlock batchsz rng_batch
+      _ <- Async.waitAnyCancel tasks


Is that right? Shouldn't all async actions finish?

mheinzel · 2024-09-20T17:25:58Z

benchmark/Bench.hs

+    go :: V.MVector s (IOOp IO) -> Random.StdGen -> Int -> ST s ()
+    go !_ !_   !i | i == size = return ()
+    go !v !rng !i = do
+      let (!block, !rng') = Random.uniformR (0, lastBlock) rng


Just to clarify, previously we read each block exactly once, now it's random, so some will be read multiple times, some never? That should be fine for this benchmark.

jorisdral

Looking good. How is running time affected?

jorisdral · 2024-09-23T14:30:33Z

benchmark/Bench.hs

+      nbatches  = lastBlock `div` (ntasks * batchsz) -- batches per task
+      totalOps  = nbatches * batchsz * ntasks


So, we might ignore a few pages based on whether the division rounds down?

jorisdral · 2024-09-23T14:34:32Z

benchmark/Bench.hs

Do these new INLINE/NOINLINE pragmas in 6217dab change benchmark results in any way?

jorisdral · 2024-09-23T14:41:03Z

src/System/IO/BlockIO.hs

+      takeMVar iobatchCompletion
+
+submitIO ioctx@IOCtx {ioctxBatchSizeLimit'} !ioops0 =
+    -- General case. Needs multiple batches and combining results.


or it needs no batch at all, if V.null ioops. The code handles this case correctly, but the comment only mentions the V.length ioops > ioctxBatchSizeLimit' case

jorisdral · 2024-09-23T14:44:29Z

src/System/IO/BlockIO.hs

+    awaitIOBatches iobatchCompletions =
+      VU.concat <$> mapM takeMVar (reverse iobatchCompletions)


I was giving some thought to whether it is okay that takeMVar is interruptible here. I suppose if the submitting thread gets interrupted and doesn't take the MVars, that should be fine because the MVars can be GC'ed. No explicit cleanup required

jorisdral · 2024-09-23T14:45:25Z

src/System/IO/BlockIO.hs

+    awaitIOBatches iobatchCompletions =
+      VU.concat <$> mapM takeMVar (reverse iobatchCompletions)


Int terms of allocations/number of allocated objects, would reverse on the vector be better than reverse on the list?

jorisdral · 2024-09-23T14:51:38Z

src/System/IO/BlockIO.hs

+    -- We're called with async exceptions masked, but 'waitQSemN' can block and
+    -- receive exceptions. That's ok. But once we acquire the semaphore


Because waitQSemN is interruptible, right?

jorisdral · 2024-09-23T14:55:00Z

src/System/IO/BlockIO.hs

+        URing.submitIO uring
+
+        -- More async exception safety: we want to inform the completionThread
+        -- /if and only if/ we successfully submitted a bathc of IO. So now that


Suggested change

-- /if and only if/ we successfully submitted a bathc of IO. So now that

-- /if and only if/ we successfully submitted a batch of IO. So now that

jorisdral · 2024-09-23T15:00:10Z

src/System/IO/BlockIO/URing.hs

 awaitIO :: URing -> IO IOCompletion
-awaitIO (URing uringptr) =
-    alloca $ \cqeptrptr -> do
+awaitIO !URing {uringptr, cqeptrfptr} = do


Redundant bang

dcoutts added 5 commits September 18, 2024 18:03

Reduce allocations in awaitIO

b7887e9

High level benchmark allocations down to: Allocated total: 28566920 Allocated per: 109 which is just under 14 words per operation, which isn't too bad. At least, when using 32 operations per batch (many of the allocations are per-batch).

dcoutts requested a review from jorisdral as a code owner September 20, 2024 09:00

dcoutts requested a review from mheinzel September 20, 2024 09:00

dcoutts force-pushed the dcoutts/allocations-reduction branch from 12f05d4 to ad3fd1e Compare September 20, 2024 12:08

Fix build for ghc 9.2 and older

50a8a63

They need GADTs or TypeFamilies enabled in a module to solve constraints arising from other modules.

dcoutts force-pushed the dcoutts/allocations-reduction branch from ad3fd1e to 50a8a63 Compare September 20, 2024 13:01

mheinzel reviewed Sep 20, 2024

View reviewed changes

jorisdral reviewed Sep 23, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reduce allocations #22

Reduce allocations #22

dcoutts commented Sep 20, 2024

mheinzel left a comment

mheinzel Sep 20, 2024

mheinzel Sep 20, 2024

jorisdral left a comment

jorisdral Sep 23, 2024

jorisdral Sep 23, 2024

jorisdral Sep 23, 2024

jorisdral Sep 23, 2024

jorisdral Sep 23, 2024

jorisdral Sep 23, 2024

jorisdral Sep 23, 2024

jorisdral Sep 23, 2024

		nbatches = lastBlock `div` (ntasks * batchsz) -- batches per task
		totalOps = nbatches * batchsz * ntasks

		awaitIOBatches iobatchCompletions =
		VU.concat <$> mapM takeMVar (reverse iobatchCompletions)

		-- We're called with async exceptions masked, but 'waitQSemN' can block and
		-- receive exceptions. That's ok. But once we acquire the semaphore

	-- /if and only if/ we successfully submitted a bathc of IO. So now that
	-- /if and only if/ we successfully submitted a batch of IO. So now that

Reduce allocations #22

Are you sure you want to change the base?

Reduce allocations #22

Conversation

dcoutts commented Sep 20, 2024

mheinzel left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jorisdral left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment