-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reduce allocations #22
base: main
Are you sure you want to change the base?
Conversation
This is as a prelude to trying to reduce allocations. Inital result for high level benchmark Allocated total: 348802552 Allocated per: 1331 That is, 1331 bytes per IO operation. This includes allocations from the benchmark itself. Indeed profiling indicates that most of the allocations are from the benchmark itself rather than the library.
Previously we generated a large list of a random permutation of all the 4k disk blocks in the input file, and then broke that up into groups of 32 blocks. We used forConcurrently_ over the groups which starts one thread per group. This approach did excessive work and allocated a lot. For starters the random permutation is expensive in time and space. It needed to track a set of remaining choices. And it had to allocate the whole list because the forConcurrently_ forces the list (of groups). And a thread per 32 blocks is a lot of threads once one uses large files. For our example target file of 262144 blocks, this amounts to 8192 threads. Each one has a stack, so it's a lot of allocations in total. The new approach is: 1. Use random block indexes rather than a permutation. This leads to duplicates but is much cheaper because it needs little state. The best solution to the duplicates will be to use O_DIRECT to avoid caching effects. 2. Only use a small number of threads and have each one submit a series of batches of work (still of size 32). This uses much less state. Then in addition, we use a lower level approach to generatng the random block indexes to reduce heap allocations. The new allocation results for the high level benchmark are: Allocated total: 57904976 Allocated per: 221 That is, 221 bytes per IO op. This is without any changes yet to the actual library, just the benchmark.
In the time and allocation profiling (using late cost centers) it helps to have things inlined or not inlined to achieve a readable structure to the profile results. This helps identifying the source of the remaining allocations.
Try to do two things: 1. reduce allocations and improve performance slightly 2. improve the sync and async exception safety of submitIO Trying to achieve exception safety is non-trivial and needs masking and multiple levels of exception handlers. The one remaining difficulty is that if we get a sync exception from the lower level submit operation then we do not yet reset the ring to clear the SQEs that we filled out.
High level benchmark allocations down to: Allocated total: 28566920 Allocated per: 109 which is just under 14 words per operation, which isn't too bad. At least, when using 32 operations per batch (many of the allocations are per-batch).
12f05d4
to
ad3fd1e
Compare
They need GADTs or TypeFamilies enabled in a module to solve constraints arising from other modules.
ad3fd1e
to
50a8a63
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm still looking at the implementation changes.
forRngSplitM_ nbatches rng_task $ \ !rng_batch -> | ||
submitIO ioctx $ | ||
generateIOOpsBatch fd buf lastBlock batchsz rng_batch | ||
_ <- Async.waitAnyCancel tasks |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is that right? Shouldn't all async actions finish?
go :: V.MVector s (IOOp IO) -> Random.StdGen -> Int -> ST s () | ||
go !_ !_ !i | i == size = return () | ||
go !v !rng !i = do | ||
let (!block, !rng') = Random.uniformR (0, lastBlock) rng |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just to clarify, previously we read each block exactly once, now it's random, so some will be read multiple times, some never? That should be fine for this benchmark.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking good. How is running time affected?
nbatches = lastBlock `div` (ntasks * batchsz) -- batches per task | ||
totalOps = nbatches * batchsz * ntasks |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, we might ignore a few pages based on whether the division rounds down?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do these new INLINE/NOINLINE pragmas in 6217dab change benchmark results in any way?
takeMVar iobatchCompletion | ||
|
||
submitIO ioctx@IOCtx {ioctxBatchSizeLimit'} !ioops0 = | ||
-- General case. Needs multiple batches and combining results. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
or it needs no batch at all, if V.null ioops
. The code handles this case correctly, but the comment only mentions the V.length ioops > ioctxBatchSizeLimit'
case
awaitIOBatches iobatchCompletions = | ||
VU.concat <$> mapM takeMVar (reverse iobatchCompletions) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was giving some thought to whether it is okay that takeMVar
is interruptible here. I suppose if the submitting thread gets interrupted and doesn't take the MVar
s, that should be fine because the MVar
s can be GC'ed. No explicit cleanup required
awaitIOBatches iobatchCompletions = | ||
VU.concat <$> mapM takeMVar (reverse iobatchCompletions) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Int terms of allocations/number of allocated objects, would reverse on the vector be better than reverse on the list?
-- We're called with async exceptions masked, but 'waitQSemN' can block and | ||
-- receive exceptions. That's ok. But once we acquire the semaphore |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because waitQSemN
is interruptible, right?
URing.submitIO uring | ||
|
||
-- More async exception safety: we want to inform the completionThread | ||
-- /if and only if/ we successfully submitted a bathc of IO. So now that |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
-- /if and only if/ we successfully submitted a bathc of IO. So now that | |
-- /if and only if/ we successfully submitted a batch of IO. So now that |
awaitIO :: URing -> IO IOCompletion | ||
awaitIO (URing uringptr) = | ||
alloca $ \cqeptrptr -> do | ||
awaitIO !URing {uringptr, cqeptrfptr} = do |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Redundant bang
submitIO
, and adjust to be async-exception safeawaitIO
.Results:
This is about 14 words (64bit) remaining. Of which 6 words are the IOOp itself. This is when using batches of 32 IOOps, so some per-batch allocations are amortised over the operations.
Further improvements are probably possible, but low hanging fruit is now probably elsewhere. One remaining idea is to make IOOp an instance of Vector.Unbox, to allow the vector of IOOps to use an unboxed vector. This would not directly improve allocations much (only 1.5 words per op), but might allow for reduced conversions in applications, and/or reuse of (mutable) vectors.