Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TestMaxBlockSize flake #3086

Closed
rootulp opened this issue Feb 7, 2024 · 20 comments · Fixed by #3183
Closed

TestMaxBlockSize flake #3086

rootulp opened this issue Feb 7, 2024 · 20 comments · Fixed by #3183
Assignees
Labels
flake A test flake that occurred on Github CI warn:blocked item is not currently being worked on but is still blocked WS: V2 ✌️ lemongrass hardfork related
Milestone

Comments

@rootulp
Copy link
Collaborator

rootulp commented Feb 7, 2024

Problem

--- FAIL: TestIntegrationTestSuite (21.86s)
    --- FAIL: TestIntegrationTestSuite/TestMaxBlockSize (13.12s)
        integration_test.go:[27](https://github.com/celestiaorg/celestia-app/actions/runs/7817375365/job/21325106901?pr=3085#step:4:28)1: 
            	Error Trace:	/home/runner/work/celestia-app/celestia-app/app/test/integration_test.go:271
            	            				/home/runner/work/celestia-app/celestia-app/app/test/integration_test.go:174
            	            				/home/runner/go/pkg/mod/github.com/stretchr/[email protected]/suite/suite.go:112
            	Error:      	Not equal: 
            	            	expected: []byte{0xe8, 0xf8, 0x5b, 0xba, 0xe5, 0x85, 0xc6, 0x19, 0xf3, 0xbd, 0x98, 0x4, 0x5e, 0xe5, 0x4d, 0x63, 0xc1, 0xd, 0x77, 0x8d, 0x22, 0x8a, 0x33, 0xea, 0xc6, 0x60, 0x[28](https://github.com/celestiaorg/celestia-app/actions/runs/7817375365/job/21325106901?pr=3085#step:4:29), 0x78, 0x82, 0x28, 0xbf, 0xbd}
            	            	actual  : []byte{0xd5, 0x5c, 0xaf, 0xbb, 0x44, 0x65, 0x4e, 0xc1, 0xb5, 0x7e, 0xf9, 0x43, 0x83, 0x97, 0x9e, 0x7c, 0x[31](https://github.com/celestiaorg/celestia-app/actions/runs/7817375365/job/21325106901?pr=3085#step:4:32), 0xc, 0xd, 0xcc, 0x92, 0x75, 0x7c, 0xde, 0xfd, 0xdd, 0x9b, 0xa5, 0x53, 0x6b, 0xc4, 0x5}
            	            	
            	            	Diff:
            	            	--- Expected
            	            	+++ Actual
            	            	@@ -1,4 +1,4 @@
            	            	 ([]uint8) (len=[32](https://github.com/celestiaorg/celestia-app/actions/runs/7817375365/job/21325106901?pr=3085#step:4:33)) {
            	            	- 00000000  e8 f8 5b ba e5 85 c6 19  f3 bd 98 04 5e e5 4d 63  |..[.........^.Mc|
            	            	- 00000010  c1 0d 77 8d 22 8a [33](https://github.com/celestiaorg/celestia-app/actions/runs/7817375365/job/21325106901?pr=3085#step:4:34) ea  c6 60 28 78 82 28 bf bd  |..w.".3..`(x.(..|
            	            	+ 00000000  d5 5c af bb 44 65 4e c1  b5 7e f9 [43](https://github.com/celestiaorg/celestia-app/actions/runs/7817375365/job/21325106901?pr=3085#step:4:44) 83 97 9e 7c  |.\..DeN..~.C...||
            	            	+ 00000010  31 0c 0d cc 92 75 7c de  fd dd 9b a5 53 6b c4 05  |1....u|.....Sk..|
            	            	 }
            	Test:       	TestIntegrationTestSuite/TestMaxBlockSize
    network.go:48: tearing down testnode
FAIL

Occurences

@rootulp rootulp added the flake A test flake that occurred on Github CI label Feb 7, 2024
@rootulp
Copy link
Collaborator Author

rootulp commented Feb 7, 2024

I got this to reproduce locally so here's more info: bad_block_EJVzDj.json

My hypothesis was that this test starts with app version 1 and then the bad block gets produced with app version 2 because some other test upgraded the testnode version but I re-ran the test locally with debug logs and this time it passed and all blocks appear to use app version 2.

height 238 app version 2
height 239 app version 2
height 240 app version 2
height 241 app version 2
height 242 app version 2

@rootulp rootulp self-assigned this Feb 7, 2024
@rootulp
Copy link
Collaborator Author

rootulp commented Feb 7, 2024

It looks like the test node app version is on 0 but the blocks are getting produced with app version 2.

app version 0 block header 0
app version 0 block header 2
app version 0 block header 0
app version 0 block header 2

@rootulp
Copy link
Collaborator Author

rootulp commented Feb 8, 2024

Notes while revisiting this:

  1. The test name TestMaxBlockSize isn't accurate because this test verifies the max square size is hit, not the max block size. Word choice matters here because block size != square size so the max block size != max square size.
  2. I don't understand why the TestMaxSquareSize performs ExtendBlockTest. IMO these should be two separate tests to make it easier to identify what broke when one test fails.
  3. I'm trying to figure out why blocks get produced with app version 2 when it looks like app version is 0 in prepare and process proposal.
app version in prepare proposal 0
app version 0 block header 0
app version in process proposal 0
app version 0 block header 2

@cmwaters
Copy link
Contributor

cmwaters commented Feb 9, 2024

I'm trying to figure out why blocks get produced with app version 2 when it looks like app version is 0 in prepare and process proposal.

The app version of the block is actually set by comet which is probably on version 2 whereas prepare proposal and process proposal have v0 because the context its not correct set

@rootulp
Copy link
Collaborator Author

rootulp commented Feb 26, 2024

Sorry I put this on pause because a few weeks ago we thought this would be resolved by some app version fixes:

  1. fix: pass version when converting between proto consensus params types celestia-core#1202
  2. fix: use genesis file app version if it is set celestia-core#1227

TBH I thought there was one other issue / PR version related but can't find it (cc: @cmwaters ). Regardless, since this is still occurring on main, it's likely that none of the version fixes actually resolved this issue.

@rootulp
Copy link
Collaborator Author

rootulp commented Feb 28, 2024

To repro more quickly, one can run:

cd app/test && go test -run TestIntegrationTestSuite/TestMaxBlockSize -count 5 -v

@rootulp
Copy link
Collaborator Author

rootulp commented Mar 1, 2024

The test was skipped in #3142 so this issue should make it unflaky and then re-enable it

@rootulp rootulp added the WS: V2 ✌️ lemongrass hardfork related label Mar 1, 2024
@rootulp rootulp added this to the v2 milestone Mar 1, 2024
@evan-forbes evan-forbes self-assigned this Mar 4, 2024
@rootulp
Copy link
Collaborator Author

rootulp commented Mar 4, 2024

I wasn't able to get visibility because I was trying to log the version in prepare/process proposal using:

	fmt.Printf("app version in prepare proposal %v\n", sdkCtx.ConsensusParams().Version.AppVersion)

which was silently panicing so I only observed test timeouts. It is possible to print the version via:

	fmt.Printf("app version in prepare proposal %v\n", sdkCtx.ConsensusParams().Version.GetAppVersion())

@rootulp
Copy link
Collaborator Author

rootulp commented Mar 4, 2024

Interestingly can't repro with single blob or multi blob test cases. Can only repro with randomTxGen. The app version in prepare / process proposal was 2 before and after:

app version in prepare proposal 2
app version in process proposal 2
=== NAME  TestIntegrationTestSuite/TestMaxBlockSize
    integration_test.go:277:
        	Error Trace:	/Users/rootulp/git/rootulp/celestiaorg/celestia-app/app/test/integration_test.go:277
        	            				/Users/rootulp/git/rootulp/celestiaorg/celestia-app/app/test/integration_test.go:174
        	            				/Users/rootulp/go/pkg/mod/github.com/stretchr/[email protected]/suite/suite.go:115
        	Error:      	Not equal:
        	            	expected: []byte{0xd7, 0x72, 0x7a, 0xd0, 0x5f, 0xef, 0xa3, 0x9f, 0x52, 0x7f, 0xe3, 0x4f, 0x82, 0xcf, 0xa2, 0x87, 0xbc, 0xbb, 0x8a, 0xc0, 0xcf, 0x52, 0x72, 0x3e, 0xc4, 0x21, 0x3d, 0x43, 0x8c, 0x97, 0xf9, 0x62}
        	            	actual  : []byte{0x35, 0x48, 0x18, 0x18, 0x2b, 0xba, 0x90, 0xb5, 0xd8, 0x6b, 0xde, 0xad, 0x9c, 0x36, 0x29, 0x75, 0x60, 0xa1, 0x6, 0x6d, 0x3, 0xe4, 0x1, 0x2e, 0x4c, 0x4b, 0x92, 0x1f, 0x81, 0x19, 0x67, 0x18}

        	            	Diff:
        	            	--- Expected
        	            	+++ Actual
        	            	@@ -1,4 +1,4 @@
        	            	 ([]uint8) (len=32) {
        	            	- 00000000  d7 72 7a d0 5f ef a3 9f  52 7f e3 4f 82 cf a2 87  |.rz._...R..O....|
        	            	- 00000010  bc bb 8a c0 cf 52 72 3e  c4 21 3d 43 8c 97 f9 62  |.....Rr>.!=C...b|
        	            	+ 00000000  35 48 18 18 2b ba 90 b5  d8 6b de ad 9c 36 29 75  |5H..+....k...6)u|
        	            	+ 00000010  60 a1 06 6d 03 e4 01 2e  4c 4b 92 1f 81 19 67 18  |`..m....LK....g.|
        	            	 }
        	Test:       	TestIntegrationTestSuite/TestMaxBlockSize
app version in prepare proposal 2
app version in process proposal 2

@rootulp
Copy link
Collaborator Author

rootulp commented Mar 4, 2024

There is a discrepancy in the square sizes used in process_proposal.go and extend_block.go. The latter looks incorrect because it always uses the square size upper bound rather than the max effective square size (i.e. the minimum of: gov max square size, upper bound versioned const).

app.GovSquareSizeUpperBound(sdkCtx),

appconsts.SquareSizeUpperBound(appVersion),

@rootulp
Copy link
Collaborator Author

rootulp commented Mar 4, 2024

At least locally, I was able to resolve flakiness via the diff in #3150

@rootulp
Copy link
Collaborator Author

rootulp commented Mar 4, 2024

Fortunately it looks like celestia-node doesn't use the ExtendBlock function exported from celestia-app (see this search). But it does have it's own buggy implementation (see here) because I think that should be using the max effective square size.

So ExtendBlock in this repo should be updated to account for the max effective square size. This means it needs access to the app (i.e. it could be moved from a stand-alone exported function to a method on app).

Separately, celestia-node needs a way to figure out the max effective square size from app and then use that to extend the block.

@rootulp
Copy link
Collaborator Author

rootulp commented Mar 4, 2024

A few concrete action items:

  1. [breaking] Modify ExtendBlock to be a method on app. It likely also needs an sdkCtx so the function signature will change.
  2. [breaking] Consider un-exporting ExtendBlock because it isn't used by celestia-node.
  3. [breaking] Rename GovSquareSizeUpperBound to MaxEffectiveSquareSize.
  4. Add unit tests to ExtendBlock that are stand-alone from the TestMaxBlockSize.

@evan-forbes
Copy link
Member

evan-forbes commented Mar 4, 2024

this might stem all the way back to #1772, where we might have
covered up an underlying issue with square by accidently using the same stateful param in ProcessProposal instead of fixing the non-determinism issue we're seeing now

@rootulp
Copy link
Collaborator Author

rootulp commented Mar 4, 2024

Copying from a Slack convo, we may actually want to use the upper bound in process proposal and extend block so that celestia-node isn't stateful (i.e. doesn't need access to gov max square size) when it tries to extend squares.

@cmwaters
Copy link
Contributor

cmwaters commented Mar 5, 2024

Separately, celestia-node needs a way to figure out the max effective square size from app and then use that to extend the block.

It can query the gov max square size and perform the same method of taking the min of the two. The problem is that if this were to change celestia-node wouldn't know unless it was constantly querying the square size each block. I remember talking about this problem several months ago and just concluding that it wasn't a problem if light nodes just use the upper bound and we rely on the validator set to ensure that it is within bounds.

As an aside, I thought the whole point of ExtendBlock was for celestia-node to use it

@rootulp
Copy link
Collaborator Author

rootulp commented Mar 5, 2024

As an aside, I thought the whole point of ExtendBlock was for celestia-node to use it

+1 but celestia-node doesn't use it currently.

I remember talking about this problem several months ago and just concluding that it wasn't a problem if light nodes just use the upper bound and we rely on the validator set to ensure that it is within bounds.

I vaguely recall the same outcome. I think we agreed to try and avoid celestia-node having to query the gov max square size param each block. If celestia-node doesn't query the gov max square size then this should hold: square.Construct(_, 128, _) == square.Construct(_, 64, _) for the same set of transactions but based on the flakiness of this test case, I don't think that's the case.

@rootulp
Copy link
Collaborator Author

rootulp commented Mar 5, 2024

On celestia-app v1.x, ProcessProposal and ExtendBlock both have a worstCaseShareIndexes that always uses 128.

func worstCaseShareIndexes(blobs int, appVersion uint64) []uint32 {
maxSquareSize := appconsts.SquareSizeUpperBound(appVersion)

On celestia-app main, we use go-square's implementation of worstCaseShareIndexes which uses a max square size plumbed in via square.Construct(). See here:

func worstCaseShareIndexes(blobs, maxSquareSize int) []uint32 {
	shareIndexes := make([]uint32, blobs)

ProcessProposal calls square.Construct with 64 and ExtendBlock calls square.Construct with 128.

Therefore, the bug is not present on celestia-app v1.x or any celestia-node release. The bug is present on go-square v1.0.0 and celestia-app main.

@rootulp
Copy link
Collaborator Author

rootulp commented Mar 5, 2024

Created celestiaorg/go-square#47

@rootulp rootulp added the warn:blocked item is not currently being worked on but is still blocked label Mar 14, 2024
@rootulp
Copy link
Collaborator Author

rootulp commented Mar 14, 2024

Blocked on a release of go-square with celestiaorg/go-square#49. The team has different opinions on whether it should be released as 1.0.1 or 2.0.0. We're meeting tomorrow to decide.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
flake A test flake that occurred on Github CI warn:blocked item is not currently being worked on but is still blocked WS: V2 ✌️ lemongrass hardfork related
Projects
None yet
3 participants