-
Notifications
You must be signed in to change notification settings - Fork 798
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Failing test zombienet-cumulus-0005-migrate_solo_to_para
#7105
Comments
With the collation fairness (the PR you linked) we limit the number of collations a collator can provide up to the number of claims for its para id in the claim queue. So the collator is building two blocks per relay parent no matter what and they are built on top of each other(right?). In this case sooner or later a collation will be dropped and the rest of the collations will be unbackable, because of the missing candidate. DQ but solo chain (Dave) means just a single collator node producing blocks on its own. Is this correct? If yes - there are no validators to throttle it and that's why it is producing blocks. I haven't run the test locally but I'm a bit puzzled why Eve doesn't produce blocks at all. I think it should produce at least one or two blocks depending on the claim queue and async backing parameter values. I also believe the test will work fine with |
The way this works is that the lookahead collator is producing two blocks per sslot until the unincluded segment is full. Then it will produce one block per relay.
Dave is a normal parachain here. Eve will stay at 0 and not produce blocks until dave has emitted its custom header from the pvf. This is expected. Eve does not find its own included header on the relay chain. Basically dave plants it there, then dave stops production and eve starts.
The main problem I am seeing is that this is a relay chain side change. So if this goes live it will cause this situation for all live parachains with the lookahead right? We don't have control when parachains will upgrade. I am not familiar with the changes done in that PR, so the question is, will this cause problems for running collators? |
As discussed offline the change should be fine once lookahead is bumped to 3 on Polkadot and Kusama (#6491). With |
Let's make sure to bump these values to 3 for polkadot and kusama before the next release (in march). @tdimitrov can you handle that please? |
@tdimitrov can we disable the test until we bump the value? |
I find it very weird that even outside of this test, with With
With polkadot-parachain, there are no such logs |
Isn't it better to set the |
I think it's just pure coincidence, due to timing. I investigated what's going on and it boils down to this: we have a lookahead of 2. let's say we have three collations. A B C. This wasn't a problem before #6491 because we had more relaxed rate-limiting. Setting the lookahead to 3 is what I think is best, to not break the lookahead collator |
We could set this in the default testnet genesis |
Quick PR that does this and should fix this test #7252 |
The proper fix should be that |
you mean that the lookahead collator should do this? |
But that is a different story. For now I would fix it properly and then backport the fix. Not that many people are using the lookahead and we could communicate with these teams to update their collators. If that doesn't work, we can still increase the number on the production networks before the change goes live.
Apparently it is violating the protocol? Otherwise it would not get banned or whatever happens? Thus, it should be written in a way that doesn't require "changing" the protocol. |
We changed the protocol to take into consideration allowed_ancestry_len (which was 2), is equivalent to a lookahead of 3 now. lookahead used to be 2 by default, so we want to preserve backwards compatibility. |
I don't see a reason why this cannot go in incrementally. Fix the relay chain side and preserve compatibility (this step). Work on more robust collators as a second step (which I presume @skunert wants to do, or we'd probably need to do anyway as part of the collator protocol revamp) |
Yes, I also think having the lookahead bumped on the relay chain side is the right step. With |
As I said, the production networks are different and can be fixed this way. Still not correct that this was done this way, but now it is too late... For the testing I mainly want that the test will start working after lookahead got fixed. |
zombienet-cumulus-0005-migrate_solo_to_para
has been failing consistently.File:
polkadot-sdk/cumulus/zombienet/tests/0005-migrate_solo_to_para.toml
Line 23 in 4059282
Expectation
This test spawns two parachains with the same para id but different genesis, dave and eve. Node dave will produce blocks and after a few blocks output custom
head_data
through the PVF. Thishead_data
contains the header that matches node eves parachain, which should now start producing blocks.Reality
Dave starts producing blocks, but after the custom head data is submitted, eve does not start producing blocks. Instead dave continues to produce blocks.
Investigation
The culprit seems to be #4880, before that PR the test was passing consistently. The question is why the custom head data presented by dave is not set as included head.
In the logs the following messages appear a lot:
Apparently some seconding limit was reached and the block in question not included. Modifying the lookahead collator here to build always only a single block fixes the issue and makes the test pass:
polkadot-sdk/cumulus/client/consensus/aura/src/collators/lookahead.rs
Line 360 in 4059282
cc @tdimitrov any ideas what is happening here?
The text was updated successfully, but these errors were encountered: