Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

gossip: Random test failure in gossip test "test filtering values works" #157

Open
InKryption opened this issue Jun 5, 2024 · 10 comments · Fixed by #257
Open

gossip: Random test failure in gossip test "test filtering values works" #157

InKryption opened this issue Jun 5, 2024 · 10 comments · Fixed by #257
Labels
bug Something isn't working

Comments

@InKryption
Copy link
Contributor

Description

At random, this line in the referenced test will fail:

try std.testing.expect(values.items.len > 0);

I currently do not have a stacktrace demonstrating this behavior, because I do not know the circumstances under which this failure occurs.

How to Reproduce the Bug

run zig build test -Dfilter="gossip.pull_response: test filtering values works" and pray

Additional Context

It should be noted that the function being tested, named filterSignedGossipDatas, currently instantiates and uses a PRNG seeded with std.time.milliTimeStamp(), which could explain its seemingly random and unreproducible behavior.

What would be ideal is capturing a stack trace of this occurring, as well as the exact or approximate timestamp at which this failure occurs, to diagnose what random values can cause such a failure, and identify what circumstances they lead to which must be guarded against.

At the time of writing I plan on pushing a change which would help in the above goals, without needing to actively search for this mythical seed.

@InKryption
Copy link
Contributor Author

Sorry, I'd like to keep this open until we actually have it solved. The code for helping us solve it is in place, it just needs to randomly occur at some point, or the code needs to fundamentally change such that we can guarantee it won't.

@0xNineteen
Copy link
Contributor

just hit it - The failing seed is: '1718031575764'

@InKryption
Copy link
Contributor Author

Can confirm that using '1718031575764' as the seed consistently reproduces the failure state.

@dnut
Copy link
Contributor

dnut commented Jun 12, 2024

another one: (first in list below)

EDIT: i'll keep adding them to this list as i see them:

@InKryption
Copy link
Contributor Author

Can't reproduce the behavior using '1718158590375'.

@InKryption
Copy link
Contributor Author

InKryption commented Jun 12, 2024

I just noticed there's another non-deterministic element, the filter parameter:

generated here:

var filters = try buildGossipPullFilters(
std.testing.allocator,
&gossip_table_rw,
&failed_pull_hashes,
max_bytes,
100,
);
defer deinitGossipPullFilters(&filters);
var filter = filters.items[0];

is random as well, because of this chain of code:

  • buildGossipPullFilters calls:
    var filter_set = try GossipPullFilterSet.init(alloc, num_items, bloom_size);
  • GossipPullFilterSet.init calls:
    const filter = try Bloom.random(
    alloc,
    @intFromFloat(max_items),
    FALSE_RATE,
    @intFromFloat(bloom_size_bits),
    );
  • Bloom.random does this:

    sig/src/bloom/bloom.zig

    Lines 94 to 101 in 94f2212

    const seed = @as(u64, @intCast(std.time.milliTimestamp()));
    var rnd = RndGen.init(seed);
    var keys = try ArrayList(u64).initCapacity(alloc, n_keys);
    for (0..n_keys) |_| {
    const v = rnd.random().int(u64);
    keys.appendAssumeCapacity(v);
    }

which means there's another source of RNG also based on std.time.milliTimestamp.

I think we should do an audit on this, and overall replace all uses of std.time.milliTimestamp as a seed that aren't obviously at the highest level of code, with a std.Random interface parameter, or a pointer to a concrete PRNG implementation if specific random behavior is desired/required.

@0xNineteen
Copy link
Contributor

0xNineteen commented Jun 12, 2024

yeah pulling out the random seeds to deterministic would be good - we can create a ticket - kind of low priority right now

https://github.com/orgs/Syndica/projects/2/views/1?pane=issue&itemId=67168458

@InKryption
Copy link
Contributor Author

New failing seed caught in CI:
https://github.com/Syndica/sig/actions/runs/9811929838/job/27094994026?pr=189
Seed number is '1720202203195'.

@InKryption
Copy link
Contributor Author

0xNineteen added a commit that referenced this issue Sep 3, 2024
- changed the imports across various files to match the current approach (sig.core.account; not @import)
- remove excess path naming of tests (test "bloom.bloom: ..." -> test "...")
- move `getWallclockMs` into `sig.time` (doesnt make sense to be in gossip)
- add seed parameter to `buildMessages` which is used for all random operations (prev seed was wallclock)
- add `seed` parameter to `PullRequestTask` (instead of using wallclock)
- add `now` parameter to `buildPullRequests` for more deterministic results (prev was reading wallclock)
- modifications to `test "test build pull requests"` to be a more deterministic test

note: the last two changes were aimed to make `test "test build pull requests"` more deterministic and fix #157 -- ive re-run the tests many times and it seems to be fixed

- remove unnecessary log lines in gossip
- add additional context to readme.md
@dnut
Copy link
Contributor

dnut commented Sep 4, 2024

This is still a problem. I saw a failure on the latest commit for seed 1725458185024. See my comment above for a list I'll keep updating as I see more failures.

@dnut dnut reopened this Sep 4, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
Status: 📋 Backlog
Development

Successfully merging a pull request may close this issue.

3 participants