Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

chore(code): Some WAL entries fail to decode #840

Merged
merged 11 commits into from
Feb 10, 2025
Merged

Conversation

ancazamfir
Copy link
Collaborator

@ancazamfir ancazamfir commented Feb 7, 2025

Closes: #XXX
Sometime wal entries fail to decode with errors like this:

2025-02-07T20:30:44.054030Z ERROR node{moniker=test-2}:wal{height=144}: Failed to decode WAL entry 4: failed to fill whole buffer

Looks like for PrevoteTimeLimit and PrecommitTimeLimit we still write something to the WAL (just the tag) and later fail to decode.

  • Add test that shows the error
  • Fix the error

Note: A better fix might be to move the entry filtering logic in consensus who would decide which entries should be logged. For timeouts it could check the type here


PR author checklist

For all contributors

For external contributors

Copy link

codecov bot commented Feb 7, 2025

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 75.73%. Comparing base (614a46c) to head (ea04fba).
Report is 2 commits behind head on main.

✅ All tests successful. No failed tests found.

Additional details and impacted files
@@            Coverage Diff             @@
##             main     #840      +/-   ##
==========================================
- Coverage   75.79%   75.73%   -0.06%     
==========================================
  Files         169      172       +3     
  Lines       14682    14801     +119     
==========================================
+ Hits        11128    11209      +81     
- Misses       3554     3592      +38     
Flag Coverage Δ
integration 75.44% <ø> (-0.06%) ⬇️
mbt 14.63% <ø> (-6.00%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Components Coverage Δ
core ∅ <ø> (∅)
engine ∅ <ø> (∅)
app ∅ <ø> (∅)
starknet ∅ <ø> (∅)

// Write timeout
encode_timeout(timeout, &mut buf)?;
// Write tag and timeout if applicable
encode_timeout(Self::TAG_TIMEOUT, timeout, &mut buf)?;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice catch! Thanks so much for figuring this out 🙌

Copy link
Collaborator Author

@ancazamfir ancazamfir Feb 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wondering if we should have consensus decide which entries should be logged and have wal write everything it's told to (by consensus).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

break 'inner;
}
}
Event::WalReplayError => {
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure this is the best place to catch errors (WAL or others) in tests.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure either, but since we already detect errors in the background task, I moved it there.

@ancazamfir ancazamfir marked this pull request as ready for review February 10, 2025 13:31
@ancazamfir ancazamfir added this pull request to the merge queue Feb 10, 2025
Merged via the queue into main with commit eaf84cc Feb 10, 2025
22 checks passed
@ancazamfir ancazamfir deleted the anca/wal_decode branch February 10, 2025 16:32
romac added a commit that referenced this pull request Feb 11, 2025
* Add test that shows: failed to fill whole buffer

* Add WAL replay error when unable to decode all entries

* Don't add tag w/o value for skipped timers or empty buf to wal

* Move new test to wal

* Logging cleanup

* Fix missing info in logs

* Add error detail to WalReplayError event

* Add tag to wal only if vote encoding is succesful

* Do not instruct WAL to persist step timeouts

* Move WalReplayError detection to background task

* WAL writes all timeouts it is given, leave choice up to consensus

---------

Co-authored-by: Romain Ruetschi <[email protected]>
github-merge-queue bot pushed a commit that referenced this pull request Feb 11, 2025
* feat(test): Add test app based on the example app

* Improve logging in tests

* Better error handling

* Formatting

* Fix test suite

* Implement `RestreamProposal` in example and test app

* Disable `proposal_only` consensus mode tests

* Start from latest height in store plus one

* Do not run MBT tests in integration test suite

* Fix missing logs

* Store synced values in database for later retrieval, increase history length to 100

* Formatting

* Run integration tests for test app in their own job

* Run test app tests with `nextest` for now

* Do not fail CI for test app integration test failure

* Go back to running tests with maelstrom

* Exclude test app from code coverage for now

* Ignore unsupported tests

* Add back full nodes tests

* Wait a bit until starting the first height

* Fix off-by-one error in starting height of test and example apps

* Fix typo

* Slow down test app to ease debugging

* Update state.rs

Co-authored-by: Anca Zamfir <[email protected]>
Signed-off-by: Romain Ruetschi <[email protected]>

* feat(code/app/starknet): Adapt Starknet app to latest P2P protos (#819)

* Update Starknet protos to their latest version

* Adapt Starknet app to new protos

* Use height and round as stream id in example app

* Use `ConsensusStreamId` proto message for stream id in Starknet app

* Fix formatting of stream ids

* Update protos for interop

* Add `ProposalCommitment` type

* Add `BlockInfo` type

* Update proposal building and assembly from parts to account for BlockInfo and ProposalCommitment

* Update protos

* Adapt to latest proto changes

* Post-merge fixes

* Disable integration tests with proposal mode, keep parts only

* Set parts-only in spwan.bash

* Add detail to panic when explicit proposal

* Fix fmt

---------

Co-authored-by: Anca Zamfir <[email protected]>

* chore(code): Some WAL entries fail to decode (#840)

* Add test that shows: failed to fill whole buffer

* Add WAL replay error when unable to decode all entries

* Don't add tag w/o value for skipped timers or empty buf to wal

* Move new test to wal

* Logging cleanup

* Fix missing info in logs

* Add error detail to WalReplayError event

* Add tag to wal only if vote encoding is succesful

* Do not instruct WAL to persist step timeouts

* Move WalReplayError detection to background task

* WAL writes all timeouts it is given, leave choice up to consensus

---------

Co-authored-by: Romain Ruetschi <[email protected]>

* Post-merge fixes

* Refactor parts signature verification

* Add parts signature verification to test app

* Restream proposal parts when reproposing a value

* Formatting

* Remove decided proposal from undecided proposals table

* Remove duplicated code

* Include test app tests in code coverage

* Abort whole test on failure

* Fix full node test

* Show log when request sync value

---------

Signed-off-by: Romain Ruetschi <[email protected]>
Co-authored-by: Anca Zamfir <[email protected]>
Co-authored-by: Anca Zamfir <[email protected]>
github-merge-queue bot pushed a commit that referenced this pull request Feb 12, 2025
…ny app (#836)

* feat(test): Add test app based on the example app

* Improve logging in tests

* Better error handling

* Formatting

* Fix test suite

* Implement `RestreamProposal` in example and test app

* Disable `proposal_only` consensus mode tests

* Start from latest height in store plus one

* Do not run MBT tests in integration test suite

* Fix missing logs

* Store synced values in database for later retrieval, increase history length to 100

* Formatting

* Run integration tests for test app in their own job

* Run test app tests with `nextest` for now

* Do not fail CI for test app integration test failure

* Generalize the test framework to make it work with any app

* Fix unit test

* Fix bug in config generation

* Cleanup

* Typo

* Cleanup

* Fix late starting full node test

* Cleanup

* Introduce `NodeHandle` abstraction

* Cleanup

* Wait until the main actor is dead before returning

* Go back to running tests with maelstrom

* Fix full node tests

* Fix config generation in test app tests

* Exclude test app from code coverage for now

* Ignore unsupported tests

* Add back full nodes tests

* Remove to initiate logger externally

* Wait a bit until starting the first height

* Fix off-by-one error in starting height of test and example apps

* Fix typo

* Slow down test app to ease debugging

* Cleanup

* Update state.rs

Co-authored-by: Anca Zamfir <[email protected]>
Signed-off-by: Romain Ruetschi <[email protected]>

* feat(code/app/starknet): Adapt Starknet app to latest P2P protos (#819)

* Update Starknet protos to their latest version

* Adapt Starknet app to new protos

* Use height and round as stream id in example app

* Use `ConsensusStreamId` proto message for stream id in Starknet app

* Fix formatting of stream ids

* Update protos for interop

* Add `ProposalCommitment` type

* Add `BlockInfo` type

* Update proposal building and assembly from parts to account for BlockInfo and ProposalCommitment

* Update protos

* Adapt to latest proto changes

* Post-merge fixes

* Disable integration tests with proposal mode, keep parts only

* Set parts-only in spwan.bash

* Add detail to panic when explicit proposal

* Fix fmt

---------

Co-authored-by: Anca Zamfir <[email protected]>

* chore(code): Some WAL entries fail to decode (#840)

* Add test that shows: failed to fill whole buffer

* Add WAL replay error when unable to decode all entries

* Don't add tag w/o value for skipped timers or empty buf to wal

* Move new test to wal

* Logging cleanup

* Fix missing info in logs

* Add error detail to WalReplayError event

* Add tag to wal only if vote encoding is succesful

* Do not instruct WAL to persist step timeouts

* Move WalReplayError detection to background task

* WAL writes all timeouts it is given, leave choice up to consensus

---------

Co-authored-by: Romain Ruetschi <[email protected]>

* Post-merge fixes

* Refactor parts signature verification

* Add parts signature verification to test app

* Restream proposal parts when reproposing a value

* Formatting

* Remove decided proposal from undecided proposals table

* Remove duplicated code

* Include test app tests in code coverage

* Abort whole test on failure

* Fix full node test

* Show log when request sync value

* Make sure Starknet app does not go into round 1 at height 1

* Fix WAL tests for test app

* Fix off by one error in logs due to potential race condition in test framework

* Clarify that `Default` impl of Height must return the zero height

* Require both Starknet and test app tests to pass on CI

* Update coverage exclude list

* Post-merge fixes

---------

Signed-off-by: Romain Ruetschi <[email protected]>
Co-authored-by: Anca Zamfir <[email protected]>
Co-authored-by: Anca Zamfir <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants