Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tests: flaky tests fixes #5487

Merged
merged 3 commits into from
Jun 22, 2023

Conversation

algorandskiy
Copy link
Contributor

Summary

  1. Fix libsodium copying when re-running libsodium build. This caused libsodium-fork/libsodium-fork nested directory for example in this path: ./crypto/copies/darwin/amd64/libsodium-fork/libsodium-fork/src/libsodium/crypto_sign/ed25519/ref10. This prevents from making changes in libsodium sources and compiling it with make like make crypto/libs/darwin/amd64/lib/libsodium.a.
  2. Fix TestLedgerErrorValidate that fails when ledger flushes accounts between EnsureValidatedBlock and EnsureBlock for the same round. This causes very specific lookup error that comes from evaluator. Fix handles this specific error.
  3. Fix TestSrvSort. Sometimes random generator draws values 0, 1 for uniform between [0, 204] and breaks test assumption. Retry up to 3 times in this case.

Test Plan

  1. Tested manually - no nested dup dirs anymore.
  2. Cannot repro locally.
  3. Reproducible with go test ./tools/network/dnssec -run TestSrvSort -count 100 and gone after the fix.

@codecov
Copy link

codecov bot commented Jun 22, 2023

Codecov Report

Merging #5487 (63cf2c2) into master (b06de44) will increase coverage by 0.29%.
The diff coverage is n/a.

@@            Coverage Diff             @@
##           master    #5487      +/-   ##
==========================================
+ Coverage   55.41%   55.70%   +0.29%     
==========================================
  Files         446      446              
  Lines       63289    63289              
==========================================
+ Hits        35069    35253     +184     
+ Misses      25826    25664     -162     
+ Partials     2394     2372      -22     

see 37 files with indirect coverage changes

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

Makefile Show resolved Hide resolved
tools/network/dnssec/sort_test.go Outdated Show resolved Hide resolved
}
retryCounter++
goto retry
}
a.Equal(net.SRV{Priority: 1, Weight: 200}, *arr[0])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The error seems to be here, in faulty expectations about who the first value of a randomized sort is going to be, rather than in not retrying it enough times? I don't understand the point of this test or the fix.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

true, but most of the cases (202/204) we expect this entry to be at the very beginning of the randomized list according to its weight. Happy to change to better scenario

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe set the weight to a million? and always run it ten times, and assert the large weight is only out of place at most once.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should likely be rewritten with Rapid at some point - can generate the values with appropriate constraints defined.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and maybe run one array with just with weights 2 and 1 and run it a hundred times and put sigma expectation bars around the count of the unique orderings.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not want waste time on this minor test with fancy non-obvious conditions. I guess having a w=1M and re-running it once is good enough. I'll update to reflect this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh, weight is 16 bit, well

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

3 attempts with max uint16

srvRecArray(arr).sortAndRand()
a.Equal(net.SRV{Priority: 1, Weight: 200}, *arr[0])
if (*arr[0] != net.SRV{Priority: 1, Weight: 0xFFFF}) {
// there is a small change that a random number from 0 to max uint15 would be 0 or 1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm going to approve in the interest of fixing tests, but why not just limit the range of the random number to be greater than 1?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the spec requires to pick a random number from 0 to sum of all weights to be ordered... I guess it is better to be more-less spec compliant.

gmalouf
gmalouf previously approved these changes Jun 22, 2023
@@ -629,6 +632,32 @@ func TestLedgerErrorValidate(t *testing.T) {
case <-expectedMessages:
// only debug messages should be reported
case um := <-unexpectedMessages:
if strings.Contains(um, "before dbRound") {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

    ledger_test.go:632:
        	Error Trace:	/Users/shantkarakashian/go/src/github.com/algorand/go-algorand/data/ledger_test.go:632
        	Error:      	Should be empty, but was data.EnsureBlock: could not write block 1926 to the ledger: round 1925 before dbRound 1933
        	Test:       	TestLedgerErrorValidate
        	Messages:   	data.EnsureBlock: {C} could not write block 1926 to the ledger: round 1925 before dbRound 1933
--- FAIL: TestLedgerErrorValidate (0.27s)
    ledger_test.go:628:
        	Error Trace:	/Users/shantkarakashian/go/src/github.com/algorand/go-algorand/data/ledger_test.go:628
        	Error:      	Received unexpected error:
        	            	round 0 before dbRound 2
        	Test:       	TestLedgerErrorValidate

First is coming from data/ledger.go:

363                case ledgercore.BlockInLedgerError:
364                        // The block is already in the ledger. Catchup and agreement could be competing
365                        // It is sufficient to report this as a Debug message
366                        l.log.Debugf("data.EnsureBlock: {B} could not write block %d to the ledger: %v", round, err)
367                        return
368                default:
369                        l.log.Errorf("data.EnsureBlock: {C} could not write block %d to the ledger: %v", round, err)
370                }
371
372                // If there was an error add a short delay before the next attempt.
373                time.Sleep(100 * time.Millisecond)

The second from acctupdates.go:

1403// lookupWithoutRewards returns the account data for a given address at a given round.
1404func (au *accountUpdates) lookupWithoutRewards(rnd basics.Round, addr basics.Address, synchronized bool) (data ledgercore.AccountData, validThrough basics.Round\
    , rewardsVersion protocol.ConsensusVersion, rewardsLevel uint64, err error) {
1405        needUnlock := false
1406        if synchronized {
1407                au.accountsMu.RLock()
1408                needUnlock = true
1409        }
1410        defer func() {
1411                if needUnlock {
1412                        au.accountsMu.RUnlock()
1413                }
1414        }()
1415        var offset uint64
1416        var persistedData trackerdb.PersistedAccountData
1417        for {
1418                currentDbRound := au.cachedDBRound
1419                currentDeltaLen := len(au.deltas)
1420                offset, err = au.roundOffset(rnd, "{F}")
1421                if err != nil {
1422                        return
1423                }

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are two places I could reproduce that send messages which break the test.
Are both of these legitimate? If so, the filter below should take care of the second case as well.
Note: {B} {C} {F} annotations are not in the code, I added them to track down where these messages are originated from.

Copy link
Contributor Author

@algorandskiy algorandskiy Jun 22, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

round 0 before dbRound 2

Handled this as well. This is from Lookup path that is part of block evaluation. The first one handled came from EnsureBlock and the second from AddBlock but because of the same reason - while AddBlock goroutine was suspended, the EnsureValidatedBlock add 6+ blocks so that ledger committed account changed and trackers advanced.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now we uncover another error...

    ledger_test.go:677:
        	Error Trace:	/Users/shantkarakashian/go/src/github.com/algorand/go-algorand/data/ledger_test.go:677
        	Error:      	Should be empty, but was failed to delete hash '000001830087367e5228afa552f1eaf821f979a8352e527bb1e8119b761ae4598d5a5bf8' from merkle trie for account 7777777777777777777777777777777777777777777777777774MSJUVU
        	Test:       	TestLedgerErrorValidate
        	Messages:   	failed to delete hash '000001830087367e5228afa552f1eaf821f979a8352e527bb1e8119b761ae4598d5a5bf8' from merkle trie for account 7777777777777777777777777777777777777777777777777774MSJUVU

coming from catchpointtracker.go:

 952// accountsUpdateBalances applies the given compactAccountDeltas to the merkle trie
 953func (ct *catchpointTracker) accountsUpdateBalances(accountsDeltas compactAccountDeltas, resourcesDeltas compactResourcesDeltas, kvDeltas map[string]modifiedKvV$
 954        if !ct.catchpointEnabled() {
 955                return nil
 956        }
 957        accumulatedChanges := 0
 958
 959        for i := 0; i < accountsDeltas.len(); i++ {
 960                delta := accountsDeltas.getByIdx(i)
 961                if !delta.oldAcct.AccountData.IsEmpty() {
 962                        deleteHash := trackerdb.AccountHashBuilderV6(delta.address, &delta.oldAcct.AccountData, protocol.Encode(&delta.oldAcct.AccountData))
 963                        deleted, err := ct.balancesTrie.Delete(deleteHash)
 964                        if err != nil {
 965                                return fmt.Errorf("failed to delete hash '%s' from merkle trie for account %v: %w", hex.EncodeToString(deleteHash), delta.addres$
 966                        }
 967                        if !deleted {
 968                                ct.log.Errorf("failed to delete hash '%s' from merkle trie for account %v", hex.EncodeToString(deleteHash), delta.address)
 969                        } else {

Copy link
Contributor

@algonautshant algonautshant left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for addressing the issues.
Looks good. One more failure detected, but it is extremely rare.
The cause of the database error needs consideration outside the contest of this test.

@algorandskiy algorandskiy merged commit f5ad6a5 into algorand:master Jun 22, 2023
9 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants