Skip to content

Commit

Permalink
Fix Churn No Timeout Test Failure
Browse files Browse the repository at this point in the history
  • Loading branch information
bolt12 committed Feb 19, 2025
1 parent d32c7ed commit 79dd002
Show file tree
Hide file tree
Showing 6 changed files with 219 additions and 101 deletions.
65 changes: 65 additions & 0 deletions log.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,65 @@
Build profile: -w ghc-9.8.2 -O1
In order, the following will be built (use -v for more details):
- ouroboros-network-0.19.0.2 (lib) (file src/Ouroboros/Network/PeerSelection/Churn.hs changed)
- ouroboros-network-0.19.0.2 (lib:sim-tests-lib) (dependency rebuilt)
- ouroboros-network-0.19.0.2 (test:sim-tests) (dependency rebuilt)
Preprocessing library for ouroboros-network-0.19.0.2...
Building library for ouroboros-network-0.19.0.2...
[54 of 66] Compiling Ouroboros.Network.PeerSelection.Churn ( src/Ouroboros/Network/PeerSelection/Churn.hs, /home/bolt/Desktop/Bolt/UMinho/Profissional/Well-Typed/Projects/IOHK/ouroboros-network/dist-newstyle/build/x86_64-linux/ghc-9.8.2/ouroboros-network-0.19.0.2/build/Ouroboros/Network/PeerSelection/Churn.o, /home/bolt/Desktop/Bolt/UMinho/Profissional/Well-Typed/Projects/IOHK/ouroboros-network/dist-newstyle/build/x86_64-linux/ghc-9.8.2/ouroboros-network-0.19.0.2/build/Ouroboros/Network/PeerSelection/Churn.dyn_o ) [Source file changed]
[57 of 66] Compiling Ouroboros.Cardano.PeerSelection.Churn ( src/Ouroboros/Cardano/PeerSelection/Churn.hs, /home/bolt/Desktop/Bolt/UMinho/Profissional/Well-Typed/Projects/IOHK/ouroboros-network/dist-newstyle/build/x86_64-linux/ghc-9.8.2/ouroboros-network-0.19.0.2/build/Ouroboros/Cardano/PeerSelection/Churn.o, /home/bolt/Desktop/Bolt/UMinho/Profissional/Well-Typed/Projects/IOHK/ouroboros-network/dist-newstyle/build/x86_64-linux/ghc-9.8.2/ouroboros-network-0.19.0.2/build/Ouroboros/Cardano/PeerSelection/Churn.dyn_o ) [Source file changed]
Preprocessing library 'sim-tests-lib' for ouroboros-network-0.19.0.2...
Building library 'sim-tests-lib' for ouroboros-network-0.19.0.2...
[29 of 32] Compiling Test.Ouroboros.Network.Diffusion.Testnet.Cardano ( sim-tests-lib/Test/Ouroboros/Network/Diffusion/Testnet/Cardano.hs, /home/bolt/Desktop/Bolt/UMinho/Profissional/Well-Typed/Projects/IOHK/ouroboros-network/dist-newstyle/build/x86_64-linux/ghc-9.8.2/ouroboros-network-0.19.0.2/l/sim-tests-lib/build/sim-tests-lib/Test/Ouroboros/Network/Diffusion/Testnet/Cardano.o, /home/bolt/Desktop/Bolt/UMinho/Profissional/Well-Typed/Projects/IOHK/ouroboros-network/dist-newstyle/build/x86_64-linux/ghc-9.8.2/ouroboros-network-0.19.0.2/l/sim-tests-lib/build/sim-tests-lib/Test/Ouroboros/Network/Diffusion/Testnet/Cardano.dyn_o ) [Source file changed]
Preprocessing test suite 'sim-tests' for ouroboros-network-0.19.0.2...
Building test suite 'sim-tests' for ouroboros-network-0.19.0.2...
[1 of 1] Compiling Main ( sim-tests/Main.hs, /home/bolt/Desktop/Bolt/UMinho/Profissional/Well-Typed/Projects/IOHK/ouroboros-network/dist-newstyle/build/x86_64-linux/ghc-9.8.2/ouroboros-network-0.19.0.2/t/sim-tests/build/sim-tests/sim-tests-tmp/Main.o ) [Test.Ouroboros.Network.Diffusion.Testnet.Cardano changed]
[2 of 2] Linking /home/bolt/Desktop/Bolt/UMinho/Profissional/Well-Typed/Projects/IOHK/ouroboros-network/dist-newstyle/build/x86_64-linux/ghc-9.8.2/ouroboros-network-0.19.0.2/t/sim-tests/build/sim-tests/sim-tests [Objects changed]
ouroboros-network:sim-tests
Ouroboros.Network.Testnet
IOSim
Churn
no timeouts: OK (32.90s)
+++ OK, passed 100 tests:
57% sim args: SimArgs 1s 10
43% sim args: SimArgs 1s 6

57% Nº nodes: 2
43% Nº nodes: 3

45% Nº nodes in InitiatorOnlyDiffusionMode: 0
44% Nº nodes in InitiatorOnlyDiffusionMode: 1
11% Nº nodes in InitiatorOnlyDiffusionMode: 2

19% Nº active peers: 4
16% Nº active peers: 6
15% Nº active peers: 5
14% Nº active peers: 3
9% Nº active peers: 2
5% Nº active peers: 8
4% Nº active peers: 0
4% Nº active peers: 1
4% Nº active peers: 7
4% Nº active peers: 9
2% Nº active peers: 10
2% Nº active peers: 12
1% Nº active peers: 11
1% Nº active peers: 15

28% Nº active big ledger peers: 3
14% Nº active big ledger peers: 4
12% Nº active big ledger peers: 6
11% Nº active big ledger peers: 2
11% Nº active big ledger peers: 5
8% Nº active big ledger peers: 7
5% Nº active big ledger peers: 0
4% Nº active big ledger peers: 1
4% Nº active big ledger peers: 9
2% Nº active big ledger peers: 10
1% Nº active big ledger peers: 11

62% average number of active local roots: 1.0
18% average number of active local roots: 1.6666666
15% average number of active local roots: 1.3333334
5% average number of active local roots: 2.0

All 1 tests passed (32.90s)
Original file line number Diff line number Diff line change
Expand Up @@ -44,6 +44,7 @@ module Test.Ouroboros.Network.Data.Signal
-- * Set-based temporal operations
, keyedTimeout
, keyedLinger
, keyedLinger'
, keyedUntil
) where

Expand Down Expand Up @@ -303,6 +304,18 @@ until start stop =
(const False)


-- | Make a signal that keeps track of recent activity, based on observing an
-- underlying signal.
--
-- This is based on keyedLinger' function but uses a constant timeout value.
--
keyedLinger :: forall a b. Ord b
=> DiffTime
-> (a -> Set b) -- ^ The activity set signal
-> Signal a
-> Signal (Set b)
keyedLinger d arm = keyedLinger' (fmap (\x -> (x, d)) arm)

-- | Make a signal that keeps track of recent activity, based on observing an
-- underlying signal.
--
Expand All @@ -322,20 +335,19 @@ until start stop =
-- with all promotion opportunities and all the failed attempts and discard
-- those. This allow us to correctly identify valid promotion opportunities.
--
keyedLinger :: forall a b. Ord b
=> DiffTime
-> (a -> Set b) -- ^ The activity set signal
keyedLinger' :: forall a b. Ord b
=> (a -> (Set b, DiffTime)) -- ^ The activity set signal
-> Signal a
-> Signal (Set b)
keyedLinger d arm =
keyedLinger' arm =
Signal Set.empty
. go Set.empty PSQ.empty
. toTimeSeries
. fmap arm
where
go :: Set b
-> OrdPSQ b Time ()
-> [E (Set b)]
-> [E (Set b, DiffTime)]
-> [E (Set b)]
go !_ !_ [] = []

Expand All @@ -348,9 +360,9 @@ keyedLinger d arm =
= E (TS t' 0) lingerSet' : go lingerSet' lingerPSQ'' (E ts xs : txs)

go !lingerSet !lingerPSQ (E ts@(TS t _) x : txs) =
let lingerSet' = lingerSet <> x
t' = addTime d t
lingerPSQ' = Set.foldl' (\s y -> PSQ.insert y t' () s) lingerPSQ x
let lingerSet' = lingerSet <> fst x
t' = addTime (snd x) t
lingerPSQ' = Set.foldl' (\s y -> PSQ.insert y t' () s) lingerPSQ (fst x)
in if lingerSet' /= lingerSet
then E ts lingerSet' : go lingerSet' lingerPSQ' txs
else go lingerSet' lingerPSQ' txs
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -4029,41 +4029,82 @@ unit_peer_sharing =
-- | This property verifies that when nodes are running without network
-- attenuation, decreasing numbers by churn never timeouts.
--
--
-- This test revealed a scenario where the governor can become blocked due to
-- the way `EstablishedPeers` enforces connection constraints. Specifically,
-- each `EstablishedPeers` action has a guard that prevents further peer
-- selection until a new connection can be established. In a real-world
-- network with many peers, this is unlikely to be an issue. However, in our
-- controlled test environment with a limited number of peers, all of them may
-- be in a timeout state. Because the governor selects the peer with the
-- minimum `connectTime`, and exponential backoff increases this value
-- significantly after multiple failed attempts, the governor can end up
-- blocking for an extended period (e.g., ~150s). This blocking behavior
-- disrupts churn, as the governor cannot proceed with peer selection while
-- waiting. To address this, the test takes into consideration when such a
-- blocking occurs and identifies whether the governor is stuck due to a high
-- `minConnectTime`, and bypass the churn timeout in such cases. While this is
-- a workaround, it ensures the test remains meaningful without being
-- invalidated by an artificial limitation of the test environment.
--
prop_churn_notimeouts :: SimTrace Void
-> Int
-> Property
prop_churn_notimeouts ioSimTrace traceNumber =
let events :: [Events DiffusionTestTrace]
events = Trace.toList
. fmap ( Signal.eventsFromList
. fmap (\(WithName _ (WithTime t b)) -> (t, b))
)
. splitWithNameTrace
. fmap (\(WithTime t (WithName name b)) -> WithName name (WithTime t b))
. withTimeNameTraceEvents
@DiffusionTestTrace
@NtNAddr
. Trace.take traceNumber
$ ioSimTrace
. fmap ( Signal.eventsFromList
. fmap (\(WithName _ (WithTime t b)) -> (t, b))
)
. splitWithNameTrace
. fmap (\(WithTime t (WithName name b)) -> WithName name (WithTime t b))
. withTimeNameTraceEvents
@DiffusionTestTrace
@NtNAddr
. Trace.take traceNumber
$ ioSimTrace
in conjoin
$ (\evs ->
let evsList :: [TracePeerSelection Cardano.ExtraState PeerTrustable (Cardano.ExtraPeers NtNAddr) NtNAddr]
evsList = snd <$> eventsToList (selectDiffusionPeerSelectionEvents evs)
in property $ counterexample (List.intercalate "\n" (show <$> eventsToList evs))
$ all noChurnTimeout evsList
let psSig :: Signal (TracePeerSelection Cardano.ExtraState PeerTrustable (Cardano.ExtraPeers NtNAddr) NtNAddr)
psSig = fromChangeEvents TraceGovernorWakeup (selectDiffusionPeerSelectionEvents evs)

-- We have 'True' when the governor is blocked
-- waiting for peers to be available to connect.
isGovernorStuck :: Signal Bool
isGovernorStuck = fmap (not . Set.null)
. keyedLinger'
(\d -> if d > 0
then (Set.singleton (), d)
else (Set.empty, d)
)
. Signal.fromChangeEvents 0
. Signal.selectEvents
(\case
DiffusionDebugPeerSelectionTrace (TraceGovernorState _ (Just dt) _)
-> Just dt
_ -> Nothing)
$ evs

-- We can't decrease peers that are in progress sets
noChurnTimeoutSig :: Signal Bool
noChurnTimeoutSig =
(\peerSelection ->
case peerSelection of
TraceChurnTimeout _ DecreasedActivePeers _ -> False
TraceChurnTimeout _ DecreasedActiveBigLedgerPeers _ -> False
TraceChurnTimeout _ DecreasedEstablishedPeers _ -> False
TraceChurnTimeout _ DecreasedEstablishedBigLedgerPeers _ -> False
TraceChurnTimeout _ DecreasedKnownPeers _ -> False
TraceChurnTimeout _ DecreasedKnownBigLedgerPeers _ -> False
_ -> True
) <$> psSig

in signalProperty 20 show
(\(churnTimeout, governorStuck)
-> governorStuck || churnTimeout)
((,) <$> noChurnTimeoutSig <*> isGovernorStuck)
)
<$> events
where
noChurnTimeout :: TracePeerSelection extraDebugState extraFlags extraPeers NtNAddr -> Bool
noChurnTimeout (TraceChurnTimeout _ DecreasedActivePeers _) = False
noChurnTimeout (TraceChurnTimeout _ DecreasedActiveBigLedgerPeers _) = False
noChurnTimeout (TraceChurnTimeout _ DecreasedEstablishedPeers _) = False
noChurnTimeout (TraceChurnTimeout _ DecreasedEstablishedBigLedgerPeers _) = False
noChurnTimeout (TraceChurnTimeout _ DecreasedKnownPeers _) = False
noChurnTimeout (TraceChurnTimeout _ DecreasedKnownBigLedgerPeers _) = False
noChurnTimeout TraceChurnTimeout {} = True
noChurnTimeout _ = True


-- | Verify that churn trace consists of repeated list of actions:
--
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -39,7 +39,7 @@ data ExtraState =
--
, minNumberOfBigLedgerPeers :: NumberOfBigLedgerPeers
}
deriving Show
deriving (Eq, Show)

empty :: ConsensusMode -> NumberOfBigLedgerPeers -> ExtraState
empty cm minActiveBigLedgerPeers =
Expand Down
68 changes: 34 additions & 34 deletions ouroboros-network/src/Ouroboros/Cardano/PeerSelection/Churn.hs
Original file line number Diff line number Diff line change
Expand Up @@ -486,10 +486,17 @@ peerChurnGovernor PeerChurnArgs {
churnMode <- atomically updateChurnMode
traceWith tracer $ TraceChurnMode churnMode

-- Purge the worst active big ledger peers.
updateTargets DecreasedActiveBigLedgerPeers
numberOfActiveBigLedgerPeers
deactivateTimeout
decreaseActiveBigLedgerPeers
checkActiveBigLedgerPeersDecreased

-- Purge the worst active peers.
updateTargets DecreasedActivePeers
numberOfActivePeers
deactivateTimeout -- chainsync might timeout after 5mins
deactivateTimeout
decreaseActivePeers
checkActivePeersDecreased

Expand All @@ -500,33 +507,26 @@ peerChurnGovernor PeerChurnArgs {
increaseActivePeers
checkActivePeersIncreased

-- Purge the worst active big ledger peers.
updateTargets DecreasedActiveBigLedgerPeers
numberOfActiveBigLedgerPeers
deactivateTimeout -- chainsync might timeout after 5mins
decreaseActiveBigLedgerPeers
(checkActiveBigLedgerPeersDecreased)

-- Pick new active big ledger peers.
updateTargets IncreasedActiveBigLedgerPeers
numberOfActiveBigLedgerPeers
shortTimeout
increaseActiveBigLedgerPeers
checkActiveBigLedgerPeersIncreased
-- Forget the worst performing established big ledger peers.
updateTargets DecreasedEstablishedBigLedgerPeers
numberOfEstablishedBigLedgerPeers
(1 + closeConnectionTimeout)
decreaseEstablishedBigLedgerPeers
checkEstablishedBigLedgerPeersDecreased

-- Forget the worst performing established peers.
updateTargets DecreasedEstablishedPeers
numberOfEstablishedPeers
(1 + closeConnectionTimeout)
decreaseEstablishedPeers
(checkEstablishedPeersDecreased)
checkEstablishedPeersDecreased

-- Forget the worst performing established big ledger peers.
updateTargets DecreasedEstablishedBigLedgerPeers
numberOfEstablishedBigLedgerPeers
(1 + closeConnectionTimeout)
decreaseEstablishedBigLedgerPeers
checkEstablishedBigLedgerPeersDecreased
-- Forget the worst performing known big ledger peers.
updateTargets DecreasedKnownBigLedgerPeers
numberOfKnownBigLedgerPeers
shortTimeout
decreaseKnownBigLedgerPeers
checkKnownBigLedgerPeersDecreased

-- Forget the worst performing known peers (root peers, ledger peers)
updateTargets DecreasedKnownPeers
Expand All @@ -542,33 +542,33 @@ peerChurnGovernor PeerChurnArgs {
increaseKnownPeers
checkKnownPeersIncreased

-- Forget the worst performing known big ledger peers.
updateTargets DecreasedKnownBigLedgerPeers
numberOfKnownBigLedgerPeers
shortTimeout
decreaseKnownBigLedgerPeers
checkKnownBigLedgerPeersDecreased

-- Pick new known big ledger peers
updateTargets IncreasedKnownBigLedgerPeers
numberOfKnownBigLedgerPeers
(2 * requestPeersTimeout + shortTimeout)
increaseKnownBigLedgerPeers
checkKnownBigLedgerPeersIncreased

-- Pick new non-active big ledger peers
updateTargets IncreasedEstablishedBigLedgerPeers
numberOfEstablishedBigLedgerPeers
churnEstablishConnectionTimeout
increaseEstablishedBigLedgerPeers
checkEstablishedBigLedgerPeersIncreased

-- Pick new non-active peers
updateTargets IncreasedEstablishedPeers
numberOfEstablishedPeers
churnEstablishConnectionTimeout
increaseEstablishedPeers
checkEstablishedPeersIncreased

-- Pick new non-active big ledger peers
updateTargets IncreasedEstablishedBigLedgerPeers
numberOfEstablishedBigLedgerPeers
churnEstablishConnectionTimeout
increaseEstablishedBigLedgerPeers
checkEstablishedBigLedgerPeersIncreased
-- Pick new active big ledger peers.
updateTargets IncreasedActiveBigLedgerPeers
numberOfActiveBigLedgerPeers
shortTimeout
increaseActiveBigLedgerPeers
checkActiveBigLedgerPeersIncreased

endTs <- getMonotonicTime

Expand Down
Loading

0 comments on commit 79dd002

Please sign in to comment.