Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: avalanche blockchain addValidator is failed for a PoS L1 deployed to Fuji #2526

Closed
0xbarchitect opened this issue Jan 7, 2025 · 13 comments
Assignees
Labels
bug Something isn't working

Comments

@0xbarchitect
Copy link

Describe the bug
Our team at derachain are trying to leverage ACP-77 to create our PoS L1 blockchain. We have successfully created the blockchain using avalanche-cli, then deployed it to Fuji testnet. Once we tried to add the new validators to L1 on Fuji, it seems does not work as expected.

To Reproduce

  1. Create the blockchain using CLI, selecting PoS as the chain type.
  2. Deploy L1 to Fuji Testnet successfully.
  3. Add new validators to L1 on Fuji, the logs is printed out as the transaction is succeeded.
  4. Verify the validator set using CLI, it turns out as empty.

Expected behavior
The validator set is added successfully.

Screenshots

  1. Blockchain config
    chain-config
  • SubnetID 2L6CmVuASKTG99FKhkqbRpitvgP4p4aXh52bFPwhc1FQFNnF6t
  • BlockchainID HLySu3YcULhhhW19VSeeLwKkrbujGBJ2tzmiNB7y5mRcQRKiV
  1. Deployment to Fuji Testnet is succeeded and it is tracked on the Subnet-networks site https://subnets-test.avax.network/subnets/2L6CmVuASKTG99FKhkqbRpitvgP4p4aXh52bFPwhc1FQFNnF6t

  2. The validator set is added using CLI command avalanche blockchain addValidator , we have added totally 5 validators, with weight equal 20 for each node. All the nodes is configured to track and fully bootstrapped the L1.

bootstrapped

NODE1
Validator weight: 20
ValidationID: p3msdN2oiSPeiTYii8Mjv1DzkoVp798uHjFnmvc91G3z155i
RegisterL1ValidatorTx fee: 0.000099502 AVAX
RegisterL1ValidatorTx ID: 21eoE6tHNaMma5J7S39ayihVW4QU5pHo5zjMkEVeGvFH6Hqw3d

NODE2
Validator weight: 20
ValidationID: wpLhcSPJMTDBFtyb8PiFUE8CHcVnwD3M6gu1eVjk5aBvWfK9Y
RegisterL1ValidatorTx fee: 0.000099502 AVAX
RegisterL1ValidatorTx ID: 2HxztTo2xP85fadfM1JPmgyc9xXY8WGUbwHdvUuLKjjqDaoqAc

NODE3
Validator weight: 20
ValidationID: uE21rPFMqFbZy1TnbnWaGoMuPiUzt87iYrKXFp2WuTWav4HVu
RegisterL1ValidatorTx fee: 0.000093580 AVAX
RegisterL1ValidatorTx ID: 8KJzUpjSWakC7BbpjP8N86yyiFjhwFbuxMZPSTdmc4YktnBd7

NODE4
Validator weight: 20
ValidationID: 2aJu2G5Gbtdgu5s2Hh3Vtqeq9aD86miDC5CjhxqiW2c9uTpr44
RegisterL1ValidatorTx fee: 0.000093620 AVAX
RegisterL1ValidatorTx ID: TVtGp69Za3tmpmE9bwcUavwRhy3YGfNe5ZPhWTYs3SN188Thn

NODE5
Validator weight: 20
ValidationID: 67gYBXqmVTqeXweYzTnaAX9MWwq1hfMTYLyo3hWvSxQA2tyPz
RegisterL1ValidatorTx fee: 0.000093660 AVAX
RegisterL1ValidatorTx ID: jcmzk2UoRaHxF3RuB4NBew1rFDqhvgiJxdAwPi42GdQsQa6FL

  1. Verify the validator set using CLI command avalanche blockchain validators returns empty
    validator-list

Logs
If applicable, please include the relevant logs that indicate a problem.
HLySu3YcULhhhW19VSeeLwKkrbujGBJ2tzmiNB7y5mRcQRKiV.zip

Operating System
Ubuntu 22.04

Additional context
Add any other context about the problem here.

@0xbarchitect 0xbarchitect added the bug Something isn't working label Jan 7, 2025
@felipemadero
Copy link
Collaborator

Thanks for the issue. Please use the command avalanche validator list and check the commands under that command group avalanche validator . We will deprecated or adapt avalanche blockchain validators pretty soon.

@felipemadero felipemadero self-assigned this Jan 8, 2025
@0xbarchitect
Copy link
Author

Hi @felipemadero
Using the avalanche validator list command as you suggested, I retrieved the list as in the following screenshots.
validator-list-2

I have recognized the 5 latter nodes in the list is exactly the 5 validators that I had added beforehand. But I got confused with the first node, the NodeID-111111111111111111116DBWJs, which has weight of 100 and zero remaining-balance. What is this node and how is it added to our L1 validator set?

Because there is an error that shown in the screenshots:

could not get balance for node NodeID-111111111111111111116DBWJs due to failed to decode client response: fetching L1 validator "11111111111111111111111111111111LpoYY" failed: not found

I worry this error will lead to a more significant problem for L1 operation.

Moreover, when I check the logs of the local bootstrap validator node, which is created during deployment of the L1 to Fuji testnet and resides in the folder ~/.avalanche-cli/local/deratest250102-local-node-fuji/node1 , I got this error:

[01-04|16:16:49.501] WARN health/worker.go:252 check started failing {"name": "health", "name": "HLySu3YcULhhhW19VSeeLwKkrbujGBJ2tzmiNB7y5mRcQRKiV", "tags": ["2L6CmVuASKTG99FKhkqbRpitvgP4p4aXh52bFPwhc1FQFNnF6t"], "error": "not connected to enough stake: connected to 50.000000%; required at least 100.000000%"}

Screenshots:
error

How to fix the error?

@felipemadero
Copy link
Collaborator

felipemadero commented Jan 8, 2025

The local bootstrap validator node created during deploy was indeed a 6th validator on your setup. The issue is that
local bootstrap validator used all its balance to pay validation fees (which was the bootstrap validator balance and the balance of the other validators?).

@felipemadero
Copy link
Collaborator

In order to recover the L1, you should add more balance to the validator that got its balance to 0.
I advice to always try to keep the validator balance of any validator to be > 0.
In this case the validator ID is yZTCuyx14TrdQCQ7Z6hxvDYaAQNxBTLXJ1vVK2tas6yWhg1KZ
Please try to use avalanche validator increaseBalance using that validator ID to try to increase
it, then check the list command again and give us feedback.

@michaelkaplan13
Copy link
Collaborator

Yeah I agree with @felipemadero's suggestion above. For some additional context, when a validator runs out of AVAX balance for the pay-as-you-go fee on the P-Chain, it becomes "inactive", meaning that it nodes are unable to connect to it, and that it is unable to participate in ICM signatures. In this case, it looks like the validator that became inactive due to running out of balance represented 50% of the weight of the L1, so other nodes are unable to progress at all while it is inactive, and ICM messages (which require 67% of stake) can't be constructed.

@0xbarchitect
Copy link
Author

Thanks for valuable advice and suggestions.

I have increased the node balance using avalanche validator increaseBalance command as suggested. The balance is added successfully.
increase-balance

I rechecked the validator set using command avalanche validator list, the error has gone and the new balance is updated to the local-bootstrap node correctly.
validator-list-3

The validator list is also updated on the Subnets site properly.

I checked the logs of local-bootstrap validator node, it has passed the stake-amount health check.
stake-heathcheck

The L1 resumes to work such that it completes the tx and continue building the new blocks.

However, there is a minor issue in the Subnet site that I think might be an UI error, the number of validators is zero as shown in the screenshots.
validators-number-zero

Moreover, a bigger problem that I have found during resolving this issue is how do we figure out a proper validator set configuration for our L1. Let's discuss about it.
During creation and deployment of L1 to Fuji Testnet, the CLI has automatically created the local-bootstrap validator node and set to it a fixed weight of 100. Thereafter, upon adding a new validator to L1, we are able to set to it a maximum weight of 20 only, such that any weight that exceeds 20 will lead to a churn rate exceeded error, as show in the screenshots (I have tried to set a weight of 50).
churn-rate-exceeded

The problem lies in our limit computing resources, such that we are only able to allocate up to 5 validator nodes for L1 testnet, including the bootstrap node. Thus with the weight of 20 for each non-bootstrap node, and the weight of 100 for the bootstrap node, we have total weight of 180 (or 200 in case we add 5 non-bootstrap nodes) for all validators of the L1. In my opinion, this weight distribution is bias towards the bootstrap node quite heavily. If the bootstrap node is stopped of working due to hardware problems or out of AVAX balance in our case, the whole L1 is halted.

Is it possible to adjust the maximum weight that we can set to the non-bootstrap validator node upon adding it to L1, for example, weight of 50? With this possibility, I think we can operate the L1 with 5 physical nodes more safety and better fault tolerance.

@sukantoraymond sukantoraymond added this to the Reported Issues milestone Jan 9, 2025
@felipemadero
Copy link
Collaborator

Hei! Taking note on the validators set to 0 on subnet site.

There is no way to circumvent atm the total weight 20% churn restriction for the validators
created or removed after the bootstrap one (churn restriction is a security measure). A procedure
that can be used to arrive to a setting similar to the one you want is to add more validators
until you can enough weight to remove the boostrap one.
(you need to get a total weight of 500, you can always add up to 20% of the current total weight).
Afterwards you will more easily add and remove validators.

@0xbarchitect
Copy link
Author

I understand the total weight 20% churn restriction, but in order to get a total weight of 500 with this limitation, we need to provision at least (correct if me wrong) (500-100) : 20 = 20 non-bootstrap nodes. It is indeed a huge number and unaffordable for us.
I have wondered if we can utilize the avalanche blockchain changeWeight to increase the weight of validators? I have tried it but got failure as shown in the screenshots.
change-weight-error

The command has been executed after the churn period had been expired, and the error message pointed out that the node is not a validator of L1, while it is indeed a validator.

Is it possible to change validator's weight using CLI command? Can you give us the advice?

@felipemadero
Copy link
Collaborator

Other possibility is to set to use say 5 bootstrap validators on blockchain deploy (--num-local-nodes=5) . For that case, the total weight is going to be 500, and it will be easy to add new validators with weight around 100, and also remove bootstrap validators.

@felipemadero
Copy link
Collaborator

I told about the first option (1 bootstrap, and then add validators) because you already have an L1. It is needed to add 9 validators in this case, not 20, as per 100*1.2**9 > 500. You can increase the weight each time you add a validator taking into account the current total weight.
The changeWeight command is currently being fixed for certain issues (including the one you mention) on PR #2545. Anyway, it is also restricted by the churn rate change.

@felipemadero
Copy link
Collaborator

Also, considering the option of adding 9 validators, you can opt to add some of them locally on the same machine where the boostrap validator is executing. Use my local machine to spin up an additional validator.
in that case, you can add some number of temporary validators so as to be able to remove the bootstrap one, and then remove the temporary ones also.

@0xbarchitect
Copy link
Author

0xbarchitect commented Jan 13, 2025

I have tried to go with the first option, i.e. launching the L1 with 5 local bootstrap validator nodes. The achieved result is good so far, thus I want to wrap the things up and share the result here in hoping that it will provide valuable insight to other teams troubling with this PoS issues.

Step 1. Create L1 blockchain using avalanche blockchain create command as usual.
step1-create-blockchain

Step 2. Deploy L1 to Fuji testnet, with --num-local-nodes arguments equals 5

$ avalanche blockchain deploy deratest250111 --num-local-nodes=5 -f

Waiting for the L1 is deployed to Fuji testnet and bootstrapped with 5 local validator nodes, all has weight equal to 100. The initial validator set can be confirmed using the CLI command

$ avalanche validator list deratest250111 -f

You can notice that the balance of all validators is low as 0.1 AVAX only.

The L1 should be deloyed to Fuji successfully and be shown in the Subnet site .

Step 3. Spin up non-bootstrap validator node and replace the local bootstrap validator node with the physical node.

  • Setup the AvalancheGo fullnode and configure it to track the L1, I leave this procedure to official developer docs , please check it there. Once the node is fully boostrapped the L1, it can be confirmed with RPC method info.isBootstrapped call.
    step3-boostrap-nodes

  • Retrieve NodeID info using RPC method info.getNodeID
    step3-nodeID

  • Add the node to the L1 validator set using avalanche blockchain addValidator command, using the NodeID info that is retrieved from previous step.

$ avalanche blockchain addValidator deratest250111 -f

The weight is 100 and the AVAX balance to deposit is 1 AVAX (but you can choose to deposit with greater balance). The logs of command should be like following:

==============================================
Initializing a validator registration with PoS validator manager
Using rpcURL: http://127.0.0.1:9658/ext/bc/A19HfLGD92ZbCUW6uycTM5N69BCR12JihXVp2qdDLw7Mev4Hg/rpc
NodeID: NodeID-2mhirVhzPrgDMc1nZVJwXSXg8dKr9YwGh staking 100 for 2592000s
==============================================
Validator weight: 100
ValidationID: AGZiRSc8MRpkaNA5t8a5BLTafzhPxntT5HJyFrL6czD3bKNHo
RegisterL1ValidatorTx fee: 0.000099622 AVAX
RegisterL1ValidatorTx ID: 2hPQV6xssLaTEW7A7zX5c3epX3uvDQ6LsYmYgeZ28L72Yw45p9
Waiting for P-Chain to update validator information ... 100% [===============]           

  NodeID: NodeID-2mhirVhzPrgDMc1nZVJwXSXg8dKr9YwGh
  Network: Fuji
  Balance: 1
✓ Validator successfully added to the L1
  • Remove the local bootstrap node from validator set using avalanche blockchain removeValidator command.
$ avalanche blockchain removeValidator deratest250111 --node-id=<local-bootstrap-nodeID> -f

The local bootstrap nodeID can be referenced using avalanche validator list command. The local bootstrap node should be removed successfully with logs similar to following:

ValidationID: 2rtk3pNWfHiRct7CxiyuZbBEcQEg5CKqdQ5izkoztcYBJMuJzR
SetL1ValidatorWeightTx fee: 0.000078876 AVAX
SetL1ValidatorWeightTx ID: yupG8iFbAxsgFn1vcFBp8PmXoJGhUs9YDcNZdYZ41drcLYzjh
Waiting for P-Chain to update validator information ... 100% [===============]           

✓ Validator successfully removed from the Subnet
  • Doing this steps repeatedly with all 5 physical nodes, we have replaced all local bootstrap nodes with physical nodes, the result should be similar to this.
    step3-validator-set

The new validator set is also updated in the Subnet detail page.

Step 4. Fault tolerance test.

  • In order to verify that the L1 can tolerate 1/5 validator failed, I conducted a fault tolerance test as following.
    Stop one validator node - because I have configured the AvalancheGo as systemd service, I stop it using systemctl stop command.
    step4-stopnode

  • Stopping one node from set, hence only 4/5 validator nodes are operating, we try to process transaction on L1. The transaction is completed normally, i.e. the L1 has tolerated 1 node stopped from operating. The tx detail is shown here.

  • Furthermore, stop one more validator, hence only 3/5 validator nodes are operating, we witness that the L1 is halted. The transaction is hanging up as shown in the screenshots.
    IMG_3733

  • Restart one validator, hence 4/5 validators are operating again, the L1 resume working and tx is processed successfully.

  • This fault tolerance test has proven that the L1 is 4/5 fault tolerance, i.e. with 5 equal weight validators, the L1 can tolerate 1 validators stopped from working.

We are satisfied with this result and get confidence to go to the next step. Wish all other team achieve this good result too.

@felipemadero
Copy link
Collaborator

Thanks so much on this detailed feedback for other teams! Also happy you unblocked.

@github-project-automation github-project-automation bot moved this from Backlog 🗄️ to Done ✅ in Platform Engineering Group Jan 14, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
Archived in project
Development

No branches or pull requests

4 participants