Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fusible IntSet.fromDistinctAscList definition #951

Open
meooow25 opened this issue Jun 18, 2023 · 3 comments
Open

Fusible IntSet.fromDistinctAscList definition #951

meooow25 opened this issue Jun 18, 2023 · 3 comments

Comments

@meooow25
Copy link
Contributor

Sister issue to #949. I adapted the old algorithm from #658 into a fold to see if it improves things for IntSet like it did for Set.

fromDistinctAscList :: [Key] -> IntSet
fromDistinctAscList = finish . Foldable.foldl' next StateEmpty
  where
    next StateEmpty z = State (prefixOf z) (bitmapOf z) Nada
    next (State kx bm stk) z
      | kx == pz  = State pz (bm .|. bmz) stk
      | otherwise = State pz bmz (linkTop (branchMask z kx) kx (Tip kx bm) stk)
      where
        pz = prefixOf z
        bmz = bitmapOf z
    finish StateEmpty = Nil
    finish (State kx bm stk0) = linkAll kx (Tip kx bm) stk0
{-# INLINE fromDistinctAscList #-}

linkTop :: Mask -> Prefix -> IntSet -> Stack -> Stack
linkTop !m !px !tx (Push py ty stk)
  | shorter m mxy = linkTop m pxy (Bin pxy mxy ty tx) stk
  where
    mxy = branchMask px py
    pxy = mask px mxy
linkTop _ px tx stk = Push px tx stk
{-# INLINABLE linkTop #-}

linkAll :: Prefix -> IntSet -> Stack -> IntSet
linkAll !_ tx Nada = tx
linkAll px tx (Push py ty stk)
  | m < 0     = Bin p m tx ty
  | otherwise = linkAll p (Bin p m ty tx) stk
  where
    m = branchMask px py
    p = mask px m
{-# INLINABLE linkAll #-}

data FromDistinctMonoState = State {-# UNPACK #-} !Prefix {-# UNPACK #-} !BitMap !Stack | StateEmpty
data Stack = Push {-# UNPACK #-} !Prefix !IntSet !Stack | Nada

Benchmarking 4 scenarios

  bench "fromDistinctAscList:dense" $ whnf IS.fromDistinctAscList elems -- elems = [1..n]
, bench "fromDistinctAscList:dense:fusion" $
    whnf (\n -> IS.fromDistinctAscList [1..n]) bound
, bench "fromDistinctAscList:sparse" $ whnf IS.fromDistinctAscList elems_sparse -- elems_sparse = map (*64) [1..n]
, bench "fromDistinctAscList:sparse:fusion" $
    whnf (\n -> IS.fromDistinctAscList (map (*64) [1..n])) bound

GHC 9.2.5, current

  fromDistinctAscList:dense:         OK (0.22s)
    13.6 μs ± 701 ns, 4.0 KB allocated,   3 B  copied, 7.0 MB peak memory
  fromDistinctAscList:dense:fusion:  OK (0.20s)
    24.3 μs ± 1.5 μs, 291 KB allocated, 150 B  copied, 7.0 MB peak memory
  fromDistinctAscList:sparse:        OK (0.20s)
    52.2 μs ± 3.3 μs, 256 KB allocated,  16 KB copied, 7.0 MB peak memory
  fromDistinctAscList:sparse:fusion: OK (0.12s)
    68.3 μs ± 6.1 μs, 541 KB allocated,  17 KB copied, 7.0 MB peak memory

GHC 9.2.5, new

  fromDistinctAscList:dense:         OK (0.23s)
    14.0 μs ± 1.1 μs, 6.0 KB allocated,   3 B  copied, 7.0 MB peak memory,       same as baseline
  fromDistinctAscList:dense:fusion:  OK (0.15s)
    4.49 μs ± 349 ns, 6.0 KB allocated,   3 B  copied, 7.0 MB peak memory, 81% less than baseline
  fromDistinctAscList:sparse:        OK (0.14s)
    70.2 μs ± 5.5 μs, 383 KB allocated,  12 KB copied, 7.0 MB peak memory, 34% more than baseline
  fromDistinctAscList:sparse:fusion: OK (0.13s)
    60.9 μs ± 5.3 μs, 382 KB allocated,  12 KB copied, 7.0 MB peak memory, 10% less than baseline

GHC 9.6.2, current

  fromDistinctAscList:dense:         OK (0.13s)
    17.0 μs ± 1.5 μs, 4.0 KB allocated,   4 B  copied, 7.0 MB peak memory
  fromDistinctAscList:dense:fusion:  OK (0.21s)
    24.7 μs ± 1.5 μs, 291 KB allocated, 150 B  copied, 7.0 MB peak memory
  fromDistinctAscList:sparse:        OK (0.19s)
    45.9 μs ± 2.9 μs, 256 KB allocated,  16 KB copied, 7.0 MB peak memory
  fromDistinctAscList:sparse:fusion: OK (0.26s)
    62.3 μs ± 2.7 μs, 542 KB allocated,  17 KB copied, 7.0 MB peak memory

GHC 9.6.2, new

  fromDistinctAscList:dense:         OK (0.29s)
    16.9 μs ± 766 ns, 6.0 KB allocated,   3 B  copied, 7.0 MB peak memory,       same as baseline
  fromDistinctAscList:dense:fusion:  OK (0.14s)
    4.19 μs ± 361 ns, 6.0 KB allocated,   3 B  copied, 7.0 MB peak memory, 83% less than baseline
  fromDistinctAscList:sparse:        OK (0.22s)
    51.8 μs ± 2.8 μs, 383 KB allocated,  12 KB copied, 7.0 MB peak memory, 12% more than baseline
  fromDistinctAscList:sparse:fusion: OK (0.19s)
    46.0 μs ± 3.0 μs, 383 KB allocated,  12 KB copied, 7.0 MB peak memory, 26% less than baseline

As expected

  • It is more efficient in both time and memory when there is fusion
  • It is less efficient in memory when there is no fusion

However, it is worse in time for fromDistinctAscList:sparse, which is disappointing.
I'll try to figure out if it can be improved. Happy to hear if you have any ideas for that.

If it does not improve, I think this is still not a bad idea. The improvement on fromDistinctAscList:dense:fusion looks great. We could use rewrite rules to try to fuse and fall back to the current implementation, as we considered in #949.

@meooow25
Copy link
Contributor Author

meooow25 commented Aug 5, 2023

I tweaked the implementation to store masks on the stack instead of prefixes, saving some calculations since the mask is what is compared every time in linkTop.

With it, the new timings are

GHC 9.2.5, new

  fromDistinctAscList:dense:         OK (0.21s)
    13.2 μs ± 657 ns, 6.0 KB allocated,   3 B  copied, 7.0 MB peak memory,       same as baseline
  fromDistinctAscList:dense:fusion:  OK (0.14s)
    4.24 μs ± 396 ns, 6.0 KB allocated,   3 B  copied, 7.0 MB peak memory, 82% less than baseline
  fromDistinctAscList:sparse:        OK (0.22s)
    55.3 μs ± 3.1 μs, 383 KB allocated,  12 KB copied, 7.0 MB peak memory,  7% more than baseline
  fromDistinctAscList:sparse:fusion: OK (0.17s)
    45.8 μs ± 2.9 μs, 384 KB allocated,  12 KB copied, 7.0 MB peak memory, 32% less than baseline

GHC 9.6.2, new

  fromDistinctAscList:dense:         OK (0.14s)
    16.4 μs ± 1.4 μs, 6.0 KB allocated,   4 B  copied, 7.0 MB peak memory,       same as baseline
  fromDistinctAscList:dense:fusion:  OK (0.26s)
    4.07 μs ± 177 ns, 6.0 KB allocated,   3 B  copied, 7.0 MB peak memory, 83% less than baseline
  fromDistinctAscList:sparse:        OK (0.17s)
    40.4 μs ± 2.8 μs, 383 KB allocated,  12 KB copied, 7.0 MB peak memory, 11% less than baseline
  fromDistinctAscList:sparse:fusion: OK (0.15s)
    34.9 μs ± 3.1 μs, 383 KB allocated,  12 KB copied, 7.0 MB peak memory, 43% less than baseline

Now it always takes less time vs current on 9.6.2, even for fromDistinctAscList:sparse 🎉
Though on 9.2.5 it's still slightly slower.

The implementation
fromDistinctAscList :: [Key] -> IntSet
fromDistinctAscList = finish . Foldable.foldl' next StateEmpty
  where
    next st x = case st of
      StateEmpty -> State px bx Nada
      State py by stk
        | px == py  -> State px (by .|. bx) stk
        | otherwise -> State px bx (linkTop (branchMask px py) py (Tip py by) stk)
      where
        px = prefixOf x
        bx = bitmapOf x
    finish StateEmpty = Nil
    finish (State pr br stk) = linkAll pr (Tip pr br) stk
{-# INLINE fromDistinctAscList #-}

linkTop :: Mask -> Prefix -> IntSet -> Stack -> Stack
linkTop !m !pr !tr (Push mlr tl stk)
  | shorter m mlr = linkTop m pr (Bin (mask pr mlr) mlr tl tr) stk
linkTop m _ tl stk = Push m tl stk

linkAll :: Prefix -> IntSet -> Stack -> IntSet
linkAll !_ !tr Nada = tr
linkAll pr tr (Push mlr tl stk)
  | mlr < 0   = Bin 0 mlr tr tl
  | otherwise = linkAll pr (Bin (mask pr mlr) mlr tl tr) stk

data FromDistinctMonoState = State {-# UNPACK #-} !Prefix {-# UNPACK #-} !BitMap !Stack | StateEmpty
data Stack = Push {-# UNPACK #-} !Mask !IntSet !Stack | Nada

@treeowl
Copy link
Contributor

treeowl commented Oct 13, 2023

Do you want to open a PR?

@meooow25
Copy link
Contributor Author

I do, but I would prefer to be done with the similar Set/Map stuff before getting to this properly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants