Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

breakage against Noble testnet v8.0.0-rc.2 #4899

Open
conorsch opened this issue Oct 21, 2024 · 2 comments
Open

breakage against Noble testnet v8.0.0-rc.2 #4899

conorsch opened this issue Oct 21, 2024 · 2 comments
Assignees
Milestone

Comments

@conorsch
Copy link
Contributor

conorsch commented Oct 21, 2024

Describe the bug

An upcoming Noble chain upgrade to v8 is being prepared on the Noble testnet. For the Penumbra Labs testnet (https://testnet.plinfra.net), we've been running a version of Hermes that relays between penumbra-testnet-phobos-2 and grand-1. On or around 2024-10-17, we started observing breakage when communicating with the Noble testnet node endpoint run by Polkachu:

I confirmed out of band with Polkachu that this breakage corresponded to deployment of the https://github.com/noble-assets/noble/releases/tag/v8.0.0-rc.2 tag to the testnet endpoint.

We first discovered this breakage when testing the behavior of the diff in #4878. Similar breakage is also evident in the hermes relayer that PL is running.

Example error messages

When running on the feature branch for #4878:

❯ cargo run -q --release --bin pcli --  --home ~/.local/share/pcli view noble-address --channel channel-221 --noble-node http://noble-testnet-grpc.polkachu.com:21590/
Error: status: Internal, message: "failed to decode Protobuf message: TxResponse.raw_log: BroadcastTxResponse.tx_response: invalid string value: data is not UTF-8 encoded", details: [], metadata: MetadataMap { headers: {"content-type": "application/grpc", "x-cosmos-block-height": "15378766"} }

When viewing the logs for the hermes relayer instance between testnets:

Oct 17 23:17:22 hermes hermes[1971991]: 2024-10-17T23:17:22.462927Z ERROR ThreadId(27) event_source.rpc{chain.id=grand-1}: failed to collect events: RPC error: serde parse error: invalid utf-8 sequence of 1 bytes from index 0 at line 1 column 312, retrying in 1.5s... height=15322789
Oct 17 23:17:24 hermes hermes[1971991]: 2024-10-17T23:17:24.305214Z ERROR ThreadId(27) event_source.rpc{chain.id=grand-1}: failed to collect events: RPC error: serde parse error: invalid utf-8 sequence of 1 bytes from index 0 at line 1 column 312, retrying in 2s... height=15322789
Oct 17 23:17:26 hermes hermes[1971991]: 2024-10-17T23:17:26.646205Z ERROR ThreadId(27) event_source.rpc{chain.id=grand-1}: failed to collect events after 4 attempts: RPC error: serde parse error: invalid utf-8 sequence of 1 bytes from index 0 at line 1 column 312 height=15322789
Oct 17 23:37:58 hermes hermes[1971991]: 2024-10-17T23:37:58.192927Z ERROR ThreadId(23) spawn:chain{chain=grand-1}:client{client=07-tendermint-317}:connection{connection=connection-267}:channel{channel=channel-221}:worker.client.refresh{client=07-tendermint-317 src_chain=penumbra-testnet-phobos-2 dst_chain=grand-1}:foreign_client.refresh{client=penumbra-testnet-phobos-2->grand-1:07-tendermint-317}:foreign_client.validated_client_state{client=penumbra-testnet-phobos-2->grand-1:07-tendermint-317}: client state is not valid: latest height is outside of trusting period! latest_height=2-465929 network_timestamp=2024-10-17T23:37:51.810809503Z consensus_state_timestamp=2024-10-17T15:24:04.628116411Z elapsed=29627.182693092s

Additional context

The gRPC endpoint is at least functional enough to return service descriptors:

❯ grpcurl -plaintext noble-testnet-grpc.polkachu.com:21590 list | grep noble.forwarding
noble.forwarding.v1.Query

We also know that the cometbft rpc is returning structured data:

❯ curl -s https://noble-testnet-rpc.polkachu.com/status | jq -r .result.node_info.version
0.38.12

Although we should be careful to determine how the structure violates assumptions in the code, given the parse error.

@github-actions github-actions bot added the needs-refinement unclear, incomplete, or stub issue that needs work label Oct 21, 2024
@conorsch conorsch changed the title breakage against Noble testnet v8.0.0-rc2 breakage against Noble testnet v8.0.0-rc.2 Oct 21, 2024
@aubrika aubrika added this to the Sprint 15 milestone Oct 21, 2024
conorsch added a commit to prax-wallet/registry that referenced this issue Oct 22, 2024
@conorsch conorsch removed the needs-refinement unclear, incomplete, or stub issue that needs work label Oct 23, 2024
@conorsch
Copy link
Contributor Author

Based on a research spike by @avahowell in collaboration with Astria, we tried setting compat_mode = '0.37' in the hermes config for the noble testnet. With that setting, Hermes was able to create new channels, and can read chain state while starting up, but quickly lapses back into failing to parse rpc messages from the noble testnet node. Debug logs:

Oct 23 17:52:05 hermes hermes[2249147]: 2024-10-23T17:52:05.570757Z DEBUG ThreadId(27) event_source.rpc{chain.id=grand-1}: incoming response status=200 OK body={"jsonrpc":"2.0","id":"62c2c7de-e025-4fdb-b615-bb7946bc25d8","result":{"height":"15720881","txs_results":null,"finalize_block_events":null,"validator_updates":null,"consensus_param_updates":{"block":{"max_bytes":"5242880","max_gas":"-1"},"evidence":{"max_age_num_blocks":"100000","max_age_duration":"172800000000000","max_bytes":"1048576"},"validator":{"pub_key_types":["ed25519"]}},"app_hash":"nKlinSRSovQLIAX/VprNAPdNEVmw+ePctUKF0nS4o4s="}}
Oct 23 17:52:05 hermes hermes[2249147]: 2024-10-23T17:52:05.570822Z ERROR ThreadId(27) event_source.rpc{chain.id=grand-1}: failed to collect events: RPC error: serde parse error: subtle encoding error: bad encoding at line 1 column 441, retrying in 1.5s... height=15720881

We're surprised because this setting did resolve testnet relaying for Astria, but it hasn't for us. Another possible resolution is bumping the version of tendermint-rs that we rely on, to include bug fixes like in v0.38.10:

This release fixes a bug in v0.38.x that prevented ABCI responses from being correctly read when upgrading from v0.37.x or below. It also includes a few other bug fixes and performance improvements.

Unclear whether upgrading the tendermint-rs version would constitute a consensus-breaking change. At the very least, we should understand whether bumping the dep resolves the issue we're seeing.

@conorsch
Copy link
Contributor Author

Paired with @avahowell to investigate the hermes setup. Turns out that despite the logged error messages, hermes does still properly relay packets. The current penumbra testnet has a short unbonding period, which results in short-lived ibc clients (on the order of 20m or so currently). We confirmed that:

  1. Hermes can create channels fine, and reports no errors
  2. Testnet withdrawals from penumbra-testnet-phobos-2 -> grand-1 are relayed successfully by hermes
  3. Hermes properly posts client update msgs to keep the channel open

The error messages are unfortunate, but also present on the penumbra/osmosis testnet service, which also uses cometbft v0.38.x on the counterparty side. We should rebase Hermes on latest upstream main, but that work should be tracked separately. We're also investigating a plan to publish the Penumbra workspace crates to crate.io, to support upstreaming the Penumbra config into hermes.

Still unresolved is the grpc problem that originally motivated this ticket. But as for the potential of breakage when Noble v8 is released, it appears that hermes operators should at compat_mode = '0.37' to relevant chain configs—i.e., for any chain that's using cometbft v0.38.x—and then relaying will continue to work.

conorsch added a commit to prax-wallet/registry that referenced this issue Oct 23, 2024
@zbuc zbuc self-assigned this Nov 4, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants