tests: flaky tests fixes #5487

algorandskiy · 2023-06-22T00:37:39Z

Summary

Fix libsodium copying when re-running libsodium build. This caused libsodium-fork/libsodium-fork nested directory for example in this path: ./crypto/copies/darwin/amd64/libsodium-fork/libsodium-fork/src/libsodium/crypto_sign/ed25519/ref10. This prevents from making changes in libsodium sources and compiling it with make like make crypto/libs/darwin/amd64/lib/libsodium.a.
Fix TestLedgerErrorValidate that fails when ledger flushes accounts between EnsureValidatedBlock and EnsureBlock for the same round. This causes very specific lookup error that comes from evaluator. Fix handles this specific error.
Fix TestSrvSort. Sometimes random generator draws values 0, 1 for uniform between [0, 204] and breaks test assumption. Retry up to 3 times in this case.

Test Plan

Tested manually - no nested dup dirs anymore.
Cannot repro locally.
Reproducible with go test ./tools/network/dnssec -run TestSrvSort -count 100 and gone after the fix.

codecov · 2023-06-22T00:56:17Z

Codecov Report

Merging #5487 (63cf2c2) into master (b06de44) will increase coverage by 0.29%.
The diff coverage is n/a.

@@            Coverage Diff             @@
##           master    #5487      +/-   ##
==========================================
+ Coverage   55.41%   55.70%   +0.29%     
==========================================
  Files         446      446              
  Lines       63289    63289              
==========================================
+ Hits        35069    35253     +184     
+ Misses      25826    25664     -162     
+ Partials     2394     2372      -22

see 37 files with indirect coverage changes

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

Makefile

tools/network/dnssec/sort_test.go

bbroder-algo · 2023-06-22T14:31:13Z

tools/network/dnssec/sort_test.go

+		}
+		retryCounter++
+		goto retry
+	}
 	a.Equal(net.SRV{Priority: 1, Weight: 200}, *arr[0])


The error seems to be here, in faulty expectations about who the first value of a randomized sort is going to be, rather than in not retrying it enough times? I don't understand the point of this test or the fix.

true, but most of the cases (202/204) we expect this entry to be at the very beginning of the randomized list according to its weight. Happy to change to better scenario

maybe set the weight to a million? and always run it ten times, and assert the large weight is only out of place at most once.

This should likely be rewritten with Rapid at some point - can generate the values with appropriate constraints defined.

and maybe run one array with just with weights 2 and 1 and run it a hundred times and put sigma expectation bars around the count of the unique orderings.

I do not want waste time on this minor test with fancy non-obvious conditions. I guess having a w=1M and re-running it once is good enough. I'll update to reflect this.

oh, weight is 16 bit, well

3 attempts with max uint16

gmalouf · 2023-06-22T17:33:03Z

tools/network/dnssec/sort_test.go

 	srvRecArray(arr).sortAndRand()
-	a.Equal(net.SRV{Priority: 1, Weight: 200}, *arr[0])
+	if (*arr[0] != net.SRV{Priority: 1, Weight: 0xFFFF}) {
+		// there is a small change that a random number from 0 to max uint15 would be 0 or 1


I'm going to approve in the interest of fixing tests, but why not just limit the range of the random number to be greater than 1?

the spec requires to pick a random number from 0 to sum of all weights to be ordered... I guess it is better to be more-less spec compliant.

algonautshant · 2023-06-22T16:22:28Z

data/ledger_test.go

@@ -629,6 +632,32 @@ func TestLedgerErrorValidate(t *testing.T) {
 			case <-expectedMessages:
 				// only debug messages should be reported
 			case um := <-unexpectedMessages:
+				if strings.Contains(um, "before dbRound") {


ledger_test.go:632: Error Trace: /Users/shantkarakashian/go/src/github.com/algorand/go-algorand/data/ledger_test.go:632 Error: Should be empty, but was data.EnsureBlock: could not write block 1926 to the ledger: round 1925 before dbRound 1933 Test: TestLedgerErrorValidate Messages: data.EnsureBlock: {C} could not write block 1926 to the ledger: round 1925 before dbRound 1933 --- FAIL: TestLedgerErrorValidate (0.27s) ledger_test.go:628: Error Trace: /Users/shantkarakashian/go/src/github.com/algorand/go-algorand/data/ledger_test.go:628 Error: Received unexpected error: round 0 before dbRound 2 Test: TestLedgerErrorValidate

First is coming from data/ledger.go:

363 case ledgercore.BlockInLedgerError: 364 // The block is already in the ledger. Catchup and agreement could be competing 365 // It is sufficient to report this as a Debug message 366 l.log.Debugf("data.EnsureBlock: {B} could not write block %d to the ledger: %v", round, err) 367 return 368 default: 369 l.log.Errorf("data.EnsureBlock: {C} could not write block %d to the ledger: %v", round, err) 370 } 371 372 // If there was an error add a short delay before the next attempt. 373 time.Sleep(100 * time.Millisecond)

The second from acctupdates.go:

1403// lookupWithoutRewards returns the account data for a given address at a given round. 1404func (au *accountUpdates) lookupWithoutRewards(rnd basics.Round, addr basics.Address, synchronized bool) (data ledgercore.AccountData, validThrough basics.Round\ , rewardsVersion protocol.ConsensusVersion, rewardsLevel uint64, err error) { 1405 needUnlock := false 1406 if synchronized { 1407 au.accountsMu.RLock() 1408 needUnlock = true 1409 } 1410 defer func() { 1411 if needUnlock { 1412 au.accountsMu.RUnlock() 1413 } 1414 }() 1415 var offset uint64 1416 var persistedData trackerdb.PersistedAccountData 1417 for { 1418 currentDbRound := au.cachedDBRound 1419 currentDeltaLen := len(au.deltas) 1420 offset, err = au.roundOffset(rnd, "{F}") 1421 if err != nil { 1422 return 1423 }

There are two places I could reproduce that send messages which break the test.
Are both of these legitimate? If so, the filter below should take care of the second case as well.
Note: {B} {C} {F} annotations are not in the code, I added them to track down where these messages are originated from.

round 0 before dbRound 2

Handled this as well. This is from Lookup path that is part of block evaluation. The first one handled came from EnsureBlock and the second from AddBlock but because of the same reason - while AddBlock goroutine was suspended, the EnsureValidatedBlock add 6+ blocks so that ledger committed account changed and trackers advanced.

Now we uncover another error...

ledger_test.go:677: Error Trace: /Users/shantkarakashian/go/src/github.com/algorand/go-algorand/data/ledger_test.go:677 Error: Should be empty, but was failed to delete hash '000001830087367e5228afa552f1eaf821f979a8352e527bb1e8119b761ae4598d5a5bf8' from merkle trie for account 7777777777777777777777777777777777777777777777777774MSJUVU Test: TestLedgerErrorValidate Messages: failed to delete hash '000001830087367e5228afa552f1eaf821f979a8352e527bb1e8119b761ae4598d5a5bf8' from merkle trie for account 7777777777777777777777777777777777777777777777777774MSJUVU

coming from catchpointtracker.go:

952// accountsUpdateBalances applies the given compactAccountDeltas to the merkle trie 953func (ct *catchpointTracker) accountsUpdateBalances(accountsDeltas compactAccountDeltas, resourcesDeltas compactResourcesDeltas, kvDeltas map[string]modifiedKvV$ 954 if !ct.catchpointEnabled() { 955 return nil 956 } 957 accumulatedChanges := 0 958 959 for i := 0; i < accountsDeltas.len(); i++ { 960 delta := accountsDeltas.getByIdx(i) 961 if !delta.oldAcct.AccountData.IsEmpty() { 962 deleteHash := trackerdb.AccountHashBuilderV6(delta.address, &delta.oldAcct.AccountData, protocol.Encode(&delta.oldAcct.AccountData)) 963 deleted, err := ct.balancesTrie.Delete(deleteHash) 964 if err != nil { 965 return fmt.Errorf("failed to delete hash '%s' from merkle trie for account %v: %w", hex.EncodeToString(deleteHash), delta.addres$ 966 } 967 if !deleted { 968 ct.log.Errorf("failed to delete hash '%s' from merkle trie for account %v", hex.EncodeToString(deleteHash), delta.address) 969 } else {

algonautshant

Thank you for addressing the issues.
Looks good. One more failure detected, but it is extremely rare.
The cause of the database error needs consideration outside the contest of this test.

tests: flaky tests fixes

827570a

algorandskiy added Team Carbon-11 Skip-Release-Notes labels Jun 22, 2023

algorandskiy requested review from gmalouf, algonautshant and algobarb June 22, 2023 00:37

algorandskiy self-assigned this Jun 22, 2023

algonautshant reviewed Jun 22, 2023

View reviewed changes

Makefile Show resolved Hide resolved

tools/network/dnssec/sort_test.go Outdated Show resolved Hide resolved

bbroder-algo reviewed Jun 22, 2023

View reviewed changes

CR fixes

8c15531

algorandskiy requested review from algonautshant and bbroder-algo June 22, 2023 15:47

gmalouf reviewed Jun 22, 2023

View reviewed changes

gmalouf previously approved these changes Jun 22, 2023

View reviewed changes

algonautshant reviewed Jun 22, 2023

View reviewed changes

handle "X before Y dbRound" error as well

63cf2c2

algorandskiy dismissed gmalouf’s stale review via 63cf2c2 June 22, 2023 18:02

algorandskiy mentioned this pull request Jun 22, 2023

Resolve TestVotersReloadFromDiskPassRecoveryPeriod, TestLedgerErrorValidate tests #5432

Closed

algonautshant approved these changes Jun 22, 2023

View reviewed changes

gmalouf approved these changes Jun 22, 2023

View reviewed changes

algorandskiy merged commit f5ad6a5 into algorand:master Jun 22, 2023
9 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

tests: flaky tests fixes #5487

tests: flaky tests fixes #5487

algorandskiy commented Jun 22, 2023

codecov bot commented Jun 22, 2023 •

edited

Loading

bbroder-algo Jun 22, 2023

algorandskiy Jun 22, 2023

bbroder-algo Jun 22, 2023

gmalouf Jun 22, 2023

bbroder-algo Jun 22, 2023

algorandskiy Jun 22, 2023

algorandskiy Jun 22, 2023

algorandskiy Jun 22, 2023

gmalouf Jun 22, 2023

algorandskiy Jun 22, 2023

algonautshant Jun 22, 2023

algonautshant Jun 22, 2023

algorandskiy Jun 22, 2023 •

edited

Loading

algonautshant Jun 22, 2023

algonautshant left a comment

tests: flaky tests fixes #5487

tests: flaky tests fixes #5487

Conversation

algorandskiy commented Jun 22, 2023

Summary

Test Plan

codecov bot commented Jun 22, 2023 • edited Loading

Codecov Report

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

algorandskiy Jun 22, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

algonautshant left a comment

Choose a reason for hiding this comment

codecov bot commented Jun 22, 2023 •

edited

Loading

algorandskiy Jun 22, 2023 •

edited

Loading