Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Node hangs when started on broken network, after worker tries to shut down #3175

Closed
teor2345 opened this issue Oct 28, 2024 · 12 comments · Fixed by #3176 or #3177
Closed

Node hangs when started on broken network, after worker tries to shut down #3175

teor2345 opened this issue Oct 28, 2024 · 12 comments · Fixed by #3176 or #3177
Assignees
Labels
bug Something isn't working node Node (service library/node app)

Comments

@teor2345
Copy link
Member

teor2345 commented Oct 28, 2024

On macOS 13.7 on M1 Max, I ran the following command:

./subspace-node run --chain devnet --tmp --bootstrap-node /dns/bootstrap-0.devnet.subspace.foundation/tcp/30333/p2p/12D3KooWKd4qisMjuBXU4YvpbJgepiX278iumqMjCC3GiUjjZEZS --dsn-bootstrap-node /dns/bootstrap-0.devnet.subspace.foundation/tcp/30533/p2p/12D3KooWSKAQm66N7jQme72obbPPquBSZuSqMyx5mH5ki1aZ4pGv --pot-external-entropy 0xddca5a66 --farmer

(Yes, I know this network is partly shut down!)

Using the binaries from:
https://github.com/autonomys/subspace/actions/runs/11418607109

This caused a node hang after snap sync panicked, and the worker tried to shut down. After the panic, the node didn't exit when I pressed Ctrl-C, or when I used kill (SIGTERM) on its pid.

The logs were:

2024-10-28T00:20:47.863466Z  INFO subspace_node::commands::run: Subspace
2024-10-28T00:20:47.863993Z  INFO subspace_node::commands::run: ✌️  version 0.1.0-38f06b5d94a
2024-10-28T00:20:47.864005Z  INFO subspace_node::commands::run: ❤️  by Subspace Labs <https://subspace.network>
2024-10-28T00:20:47.864010Z  INFO subspace_node::commands::run: 📋 Chain specification: Subspace Dev network
2024-10-28T00:20:47.864013Z  INFO subspace_node::commands::run: 🏷  Node name: tawdry-branch-7197
2024-10-28T00:20:47.864017Z  INFO subspace_node::commands::run: 💾 Node path: /var/folders/ys/4_gnfm5j1_1g7h7v0r2_k_2m0000gp/T/subspace-node-gwOSXZ
2024-10-28T00:20:48.535420Z  INFO Consensus: sc_service::client::client: 🔨 Initializing Genesis block/state (state: 0xe647…796a, header-hash: 0x1666…c57d)    
2024-10-28T00:20:49.343101Z  INFO Consensus: subspace_networking::constructor: DSN instance configured. allow_non_global_addresses_in_dht=false peer_id=12D3KooWD2szwQe5XQaGf4FvpLSKPaPNgGoHidTBXDVJDTUZTwKA protocol_version=/subspace/2/1666067b80762b374703f5267360f8c63c0a8c068579b63c7455c6b2302ec57d
2024-10-28T00:20:49.345350Z  INFO Consensus: libp2p_swarm: local_peer_id=12D3KooWD2szwQe5XQaGf4FvpLSKPaPNgGoHidTBXDVJDTUZTwKA
2024-10-28T00:20:49.346760Z  INFO Consensus: subspace_service: Subspace networking initialized: Node ID is 12D3KooWD2szwQe5XQaGf4FvpLSKPaPNgGoHidTBXDVJDTUZTwKA
2024-10-28T00:20:49.348459Z  INFO Consensus: subspace_service: DSN listening on /ip4/127.0.0.1/tcp/30433/p2p/12D3KooWD2szwQe5XQaGf4FvpLSKPaPNgGoHidTBXDVJDTUZTwKA
2024-10-28T00:20:49.349216Z  INFO Consensus: block_relay: relay::consensus block server: starting
2024-10-28T00:20:49.350207Z  INFO Consensus: subspace_service: DSN listening on /ip6/::1/tcp/30433/p2p/12D3KooWD2szwQe5XQaGf4FvpLSKPaPNgGoHidTBXDVJDTUZTwKA
2024-10-28T00:20:49.350222Z  INFO Consensus: subspace_service: DSN listening on /ip4/192.168.18.222/tcp/30433/p2p/12D3KooWD2szwQe5XQaGf4FvpLSKPaPNgGoHidTBXDVJDTUZTwKA
2024-10-28T00:20:49.350232Z  INFO Consensus: subspace_service: DSN listening on /ip4/192.168.18.2/tcp/30433/p2p/12D3KooWD2szwQe5XQaGf4FvpLSKPaPNgGoHidTBXDVJDTUZTwKA
2024-10-28T00:20:49.350967Z  INFO Consensus: sub-libp2p: 🏷  Local node identity is: 12D3KooWD2szwQe5XQaGf4FvpLSKPaPNgGoHidTBXDVJDTUZTwKA    
2024-10-28T00:20:49.351027Z  INFO Consensus: sub-libp2p: Running libp2p network backend    
2024-10-28T00:20:49.353106Z  WARN Consensus: sub-libp2p: Can't listen on /ip4/0.0.0.0/tcp/30333 because: Other(Custom { kind: Other, error: Custom { kind: Other, error: Other(Left(Left(Left(Left(Right(Transport(Os { code: 48, kind: AddrInUse, message: "Address already in use" }))))))) } })    
2024-10-28T00:20:49.353555Z  WARN Consensus: sub-libp2p: Can't listen on /ip6/::/tcp/30333 because: Other(Custom { kind: Other, error: Custom { kind: Other, error: Other(Left(Left(Left(Left(Right(Transport(Os { code: 48, kind: AddrInUse, message: "Address already in use" }))))))) } })    
2024-10-28T00:20:49.353697Z  INFO Consensus: sc_consensus_subspace::archiver: Starting archiving from genesis
2024-10-28T00:20:49.353722Z  INFO Consensus: sc_consensus_subspace::archiver: Archiving already produced blocks 0..=0
2024-10-28T00:20:50.272233Z  WARN Consensus: libp2p_kad::behaviour: Failed to trigger bootstrap: No known peers.
thread 'tokio-runtime-worker' panicked at /Users/hetzner/actions-runner/_work/subspace/subspace/crates/subspace-service/src/sync_from_dsn/snap_sync.rs:169:9:
Snap sync is impossible - not enough archived history: wipe the DB folder and rerun with --sync=full
stack backtrace:
2024-10-28T00:21:24.583020Z  INFO Consensus: subspace: 🧑‍🌾 Starting Subspace Authorship worker
2024-10-28T00:21:24.583073Z  INFO Consensus: sc_sysinfo: 💻 Operating system: macos    
2024-10-28T00:21:24.583077Z  INFO Consensus: sc_sysinfo: 💻 CPU architecture: aarch64    
2024-10-28T00:21:24.583161Z  INFO Consensus: subspace_service::task_spawner: 📦 Highest known block at #0
2024-10-28T00:21:24.584955Z  INFO Consensus: sc_rpc_server: Running JSON-RPC server: addr=127.0.0.1:59846,    
   0: _rust_begin_unwind
   1: core::panicking::panic_fmt
   2: subspace_service::sync_from_dsn::snap_sync::sync::{{closure}}
   3: subspace_service::new_full::{{closure}}::{{closure}}
   4: <futures_util::future::future::Map<Fut,F> as core::future::future::Future>::poll
   5: <tracing_futures::Instrumented<T> as core::future::future::Future>::poll
   6: tokio::runtime::task::raw::poll
note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace.
2024-10-28T00:21:24.710924Z ERROR Consensus: sc_service::task_manager: Essential task `worker` failed. Shutting down service.

I'll re-run to get a full stacktrace, and also see which threads/stacks are still active.

@teor2345 teor2345 added bug Something isn't working node Node (service library/node app) labels Oct 28, 2024
@teor2345 teor2345 self-assigned this Oct 28, 2024
@nazar-pc
Copy link
Member

Did you try Ctrl+C after above panic? We'll want to fix that panic for sure, but I suspect it is the reason Ctrl+C didn't work, it probably broke the signal handler somehow.

@teor2345
Copy link
Member Author

The node didn't exit when I pressed Ctrl-C

@teor2345
Copy link
Member Author

teor2345 commented Oct 28, 2024

The panic is in snap sync, saying we don't have enough archived history. (The DB folder should be empty, because I used --tmp.)

This appears to be a race condition, the second time I ran it, the node exited with an error:

$ export RUST_BACKTRACE=full
$ ./subspace-node run --chain devnet --tmp --bootstrap-node /dns/bootstrap-0.devnet.subspace.foundation/tcp/30333/p2p/12D3KooWKd4qisMjuBXU4YvpbJgepiX278iumqMjCC3GiUjjZEZS --dsn-bootstrap-node /dns/bootstrap-0.devnet.subspace.foundation/tcp/30533/p2p/12D3KooWSKAQm66N7jQme72obbPPquBSZuSqMyx5mH5ki1aZ4pGv --pot-external-entropy 0xddca5a66 --farmer
2024-10-28T00:30:43.516307Z  INFO subspace_node::commands::run: Subspace
2024-10-28T00:30:43.516318Z  INFO subspace_node::commands::run: ✌️  version 0.1.0-38f06b5d94a
2024-10-28T00:30:43.516320Z  INFO subspace_node::commands::run: ❤️  by Subspace Labs <https://subspace.network>
2024-10-28T00:30:43.516322Z  INFO subspace_node::commands::run: 📋 Chain specification: Subspace Dev network
2024-10-28T00:30:43.516323Z  INFO subspace_node::commands::run: 🏷  Node name: frightened-bridge-3106
2024-10-28T00:30:43.516325Z  INFO subspace_node::commands::run: 💾 Node path: /var/folders/ys/4_gnfm5j1_1g7h7v0r2_k_2m0000gp/T/subspace-node-q78nt5
2024-10-28T00:30:43.887243Z  INFO Consensus: sc_service::client::client: 🔨 Initializing Genesis block/state (state: 0xe647…796a, header-hash: 0x1666…c57d)    
2024-10-28T00:30:44.311245Z  INFO Consensus: subspace_networking::constructor: DSN instance configured. allow_non_global_addresses_in_dht=false peer_id=12D3KooWRxEZdLgr5guC91bXj2HtKeYTQiFKk2AGtjwGegPJbr1K protocol_version=/subspace/2/1666067b80762b374703f5267360f8c63c0a8c068579b63c7455c6b2302ec57d
2024-10-28T00:30:44.311438Z  INFO Consensus: libp2p_swarm: local_peer_id=12D3KooWRxEZdLgr5guC91bXj2HtKeYTQiFKk2AGtjwGegPJbr1K
2024-10-28T00:30:44.311813Z  INFO Consensus: subspace_service: Subspace networking initialized: Node ID is 12D3KooWRxEZdLgr5guC91bXj2HtKeYTQiFKk2AGtjwGegPJbr1K
2024-10-28T00:30:44.311920Z  INFO Consensus: block_relay: relay::consensus block server: starting
2024-10-28T00:30:44.311929Z  INFO Consensus: subspace_service: DSN listening on /ip4/127.0.0.1/tcp/30433/p2p/12D3KooWRxEZdLgr5guC91bXj2HtKeYTQiFKk2AGtjwGegPJbr1K
2024-10-28T00:30:44.312010Z  INFO Consensus: subspace_service: DSN listening on /ip6/::1/tcp/30433/p2p/12D3KooWRxEZdLgr5guC91bXj2HtKeYTQiFKk2AGtjwGegPJbr1K
2024-10-28T00:30:44.312019Z  INFO Consensus: subspace_service: DSN listening on /ip4/192.168.18.222/tcp/30433/p2p/12D3KooWRxEZdLgr5guC91bXj2HtKeYTQiFKk2AGtjwGegPJbr1K
2024-10-28T00:30:44.312033Z  INFO Consensus: subspace_service: DSN listening on /ip4/192.168.18.2/tcp/30433/p2p/12D3KooWRxEZdLgr5guC91bXj2HtKeYTQiFKk2AGtjwGegPJbr1K
2024-10-28T00:30:44.312097Z  INFO Consensus: sub-libp2p: 🏷  Local node identity is: 12D3KooWRxEZdLgr5guC91bXj2HtKeYTQiFKk2AGtjwGegPJbr1K    
2024-10-28T00:30:44.312104Z  INFO Consensus: sub-libp2p: Running libp2p network backend    
2024-10-28T00:30:44.312453Z  WARN Consensus: sub-libp2p: Can't listen on /ip4/0.0.0.0/tcp/30333 because: Other(Custom { kind: Other, error: Custom { kind: Other, error: Other(Left(Left(Left(Left(Right(Transport(Os { code: 48, kind: AddrInUse, message: "Address already in use" }))))))) } })    
2024-10-28T00:30:44.312496Z  WARN Consensus: sub-libp2p: Can't listen on /ip6/::/tcp/30333 because: Other(Custom { kind: Other, error: Custom { kind: Other, error: Other(Left(Left(Left(Left(Right(Transport(Os { code: 48, kind: AddrInUse, message: "Address already in use" }))))))) } })    
2024-10-28T00:30:44.312619Z  INFO Consensus: sc_consensus_subspace::archiver: Starting archiving from genesis
2024-10-28T00:30:44.312634Z  INFO Consensus: sc_consensus_subspace::archiver: Archiving already produced blocks 0..=0
2024-10-28T00:30:45.227849Z  WARN Consensus: libp2p_kad::behaviour: Failed to trigger bootstrap: No known peers.
2024-10-28T00:30:52.045921Z  INFO Consensus: subspace: 🧑‍🌾 Starting Subspace Authorship worker
2024-10-28T00:30:52.045960Z  INFO Consensus: sc_sysinfo: 💻 Operating system: macos    
2024-10-28T00:30:52.045963Z  INFO Consensus: sc_sysinfo: 💻 CPU architecture: aarch64    
2024-10-28T00:30:52.045994Z  INFO Consensus: subspace_service::task_spawner: 📦 Highest known block at #0
thread 'tokio-runtime-worker' panicked at /Users/hetzner/actions-runner/_work/subspace/subspace/crates/subspace-service/src/sync_from_dsn/snap_sync.rs:169:9:
Snap sync is impossible - not enough archived history: wipe the DB folder and rerun with --sync=full
stack backtrace:
2024-10-28T00:30:52.046449Z  INFO Consensus: sc_rpc_server: Running JSON-RPC server: addr=127.0.0.1:61647,    
   0:        0x1033b5e7c - <std::sys::backtrace::BacktraceLock::print::DisplayBacktrace as core::fmt::Display>::fmt::h48e477a2a6df82fa
   1:        0x1029bdfa0 - core::fmt::write::ha2b7a9a42bc45315
   2:        0x1033af4e4 - std::io::Write::write_fmt::h864ab19bb099f451
   3:        0x1033c5978 - std::panicking::default_hook::{{closure}}::h22d4e057cb9d116f
   4:        0x1033c66bc - std::panicking::rust_panic_with_hook::h0a56be67e390565a
   5:        0x1033b62f0 - std::panicking::begin_panic_handler::{{closure}}::hc09ab2aa629b2aba
   6:        0x1033b6288 - std::sys::backtrace::__rust_end_short_backtrace::h3891c78f0c9067f6
   7:        0x1033c5cf4 - _rust_begin_unwind
   8:        0x10434cef4 - core::panicking::panic_fmt::hb948a9c9b33f98e8
   9:        0x1035f5e64 - subspace_service::sync_from_dsn::snap_sync::sync::{{closure}}::hfe6011ba45229784
  10:        0x1035f160c - subspace_service::new_full::{{closure}}::{{closure}}::h82ae7daa6f602f33
  11:        0x103203ca0 - <futures_util::future::future::Map<Fut,F> as core::future::future::Future>::poll::he85d9384076bdddf
  12:        0x10320640c - <tracing_futures::Instrumented<T> as core::future::future::Future>::poll::h23279404acc12293
  13:        0x1031f164c - tokio::runtime::task::raw::poll::h7a299ca9ca38aae1
  14:        0x1040410cc - std::sys::backtrace::__rust_begin_short_backtrace::h64a6dbaa1e51bcce
  15:        0x1040391c8 - core::ops::function::FnOnce::call_once{{vtable.shim}}::h703a83a7d5d5ae69
  16:        0x1033bb160 - std::sys::pal::unix::thread::Thread::new::thread_start::h8578131b5b53b3d0
  17:        0x195a33fa8 - __pthread_joiner_wake
2024-10-28T00:30:52.051577Z ERROR Consensus: sc_service::task_manager: Essential task `worker` failed. Shutting down service.    
2024-10-28T00:30:52.052481Z  WARN trust_dns_proto::xfer::dns_exchange: failed to associate send_message response to the sender
2024-10-28T00:30:52.052488Z  WARN trust_dns_proto::xfer::dns_exchange: failed to associate send_message response to the sender
2024-10-28T00:30:52.052821Z  WARN trust_dns_proto::xfer::dns_exchange: failed to associate send_message response to the sender
2024-10-28T00:30:52.052823Z  WARN trust_dns_proto::xfer::dns_exchange: failed to associate send_message response to the sender
Error: SubstrateService(Other("Essential task failed."))

@nazar-pc
Copy link
Member

I understand that the panic happened, I'm asking about when were you trying to press Ctrl+C: before or after the panic?

@teor2345
Copy link
Member Author

teor2345 commented Oct 28, 2024

I understand that the panic happened, I'm asking about when were you trying to press Ctrl+C: before or after the panic?

After the panic. Our signal handler is an async future run by the tokio runtime, so as soon as the runtime starts to shut down, it stops working. (Runtimes drop pending tasks when they shut down, then wait for running tasks to yield.)

That's why in PR #3170 I launched a Ctrl-C handler in another runtime. Maybe something we could think about.

@nazar-pc
Copy link
Member

In general we don't use .unwrap() and do not expect any panics to happen in runtime at all. So I think the fix here is to fix the possibility of the panic occurring, then the handler should work as well. I find it very annoying that Tokio doesn't crash the whole app when panic happens, I wish we just used panic = "abort" (which hypothetically allows for interesting optimizations as well), but apparently Substrate's runtime still requires unwinding (and I would prefer to not do special casing just for runtime).

@teor2345
Copy link
Member Author

teor2345 commented Oct 28, 2024

The stack trace of the stuck node isn't that helpful, all it shows is that the tokio runtime is blocked waiting for an async task to yield or a spawn_blocking() thread to finish:
flamegraph

(Any task or thread could have yielded to tokio and been in the stack trace below the tokio task reaper.)

@teor2345
Copy link
Member Author

I find it very annoying that Tokio doesn't crash the whole app when panic happens

We could build our runtimes with Builder::unhandled_panic(ShutdownRuntime)?
https://docs.rs/tokio/latest/tokio/runtime/struct.Builder.html#method.unhandled_panic

@nazar-pc
Copy link
Member

Actually yes, we should probably do that. Runtime is typically created by Substate and not custimizable, but we do create it using much lower-level APIs and we do have full control over it. I think we should use that for all apps.

@teor2345
Copy link
Member Author

Actually yes, we should probably do that. Runtime is typically created by Substate and not custimizable, but we do create it using much lower-level APIs and we do have full control over it. I think we should use that for all apps.

Oh, actually, that method is currently only supported in the current-thread runtime, which we can't use:
https://docs.rs/tokio/latest/src/tokio/runtime/builder.rs.html#1021

@teor2345
Copy link
Member Author

We could install a panic hook that exits the process instead?
(After running the default hook, which logs a panic message and a backtrace if that's configured.)

@nazar-pc
Copy link
Member

And it is also unstable. I really don't like these implicit and statically uncheckable Tokio APIs. It is easy to write, but blows up in runtime, exactly the things I'm able to avoid in Rust most of the time except when dealing with Tokio and Tracing.

Exiting process on panic is fine by me.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working node Node (service library/node app)
Projects
None yet
2 participants