Node hangs when started on broken network, after worker tries to shut down #3175

teor2345 · 2024-10-28T00:30:06Z

On macOS 13.7 on M1 Max, I ran the following command:

./subspace-node run --chain devnet --tmp --bootstrap-node /dns/bootstrap-0.devnet.subspace.foundation/tcp/30333/p2p/12D3KooWKd4qisMjuBXU4YvpbJgepiX278iumqMjCC3GiUjjZEZS --dsn-bootstrap-node /dns/bootstrap-0.devnet.subspace.foundation/tcp/30533/p2p/12D3KooWSKAQm66N7jQme72obbPPquBSZuSqMyx5mH5ki1aZ4pGv --pot-external-entropy 0xddca5a66 --farmer

(Yes, I know this network is partly shut down!)

Using the binaries from:
https://github.com/autonomys/subspace/actions/runs/11418607109

This caused a node hang after snap sync panicked, and the worker tried to shut down. After the panic, the node didn't exit when I pressed Ctrl-C, or when I used kill (SIGTERM) on its pid.

The logs were:

2024-10-28T00:20:47.863466Z  INFO subspace_node::commands::run: Subspace
2024-10-28T00:20:47.863993Z  INFO subspace_node::commands::run: ✌️  version 0.1.0-38f06b5d94a
2024-10-28T00:20:47.864005Z  INFO subspace_node::commands::run: ❤️  by Subspace Labs <https://subspace.network>
2024-10-28T00:20:47.864010Z  INFO subspace_node::commands::run: 📋 Chain specification: Subspace Dev network
2024-10-28T00:20:47.864013Z  INFO subspace_node::commands::run: 🏷  Node name: tawdry-branch-7197
2024-10-28T00:20:47.864017Z  INFO subspace_node::commands::run: 💾 Node path: /var/folders/ys/4_gnfm5j1_1g7h7v0r2_k_2m0000gp/T/subspace-node-gwOSXZ
2024-10-28T00:20:48.535420Z  INFO Consensus: sc_service::client::client: 🔨 Initializing Genesis block/state (state: 0xe647…796a, header-hash: 0x1666…c57d)    
2024-10-28T00:20:49.343101Z  INFO Consensus: subspace_networking::constructor: DSN instance configured. allow_non_global_addresses_in_dht=false peer_id=12D3KooWD2szwQe5XQaGf4FvpLSKPaPNgGoHidTBXDVJDTUZTwKA protocol_version=/subspace/2/1666067b80762b374703f5267360f8c63c0a8c068579b63c7455c6b2302ec57d
2024-10-28T00:20:49.345350Z  INFO Consensus: libp2p_swarm: local_peer_id=12D3KooWD2szwQe5XQaGf4FvpLSKPaPNgGoHidTBXDVJDTUZTwKA
2024-10-28T00:20:49.346760Z  INFO Consensus: subspace_service: Subspace networking initialized: Node ID is 12D3KooWD2szwQe5XQaGf4FvpLSKPaPNgGoHidTBXDVJDTUZTwKA
2024-10-28T00:20:49.348459Z  INFO Consensus: subspace_service: DSN listening on /ip4/127.0.0.1/tcp/30433/p2p/12D3KooWD2szwQe5XQaGf4FvpLSKPaPNgGoHidTBXDVJDTUZTwKA
2024-10-28T00:20:49.349216Z  INFO Consensus: block_relay: relay::consensus block server: starting
2024-10-28T00:20:49.350207Z  INFO Consensus: subspace_service: DSN listening on /ip6/::1/tcp/30433/p2p/12D3KooWD2szwQe5XQaGf4FvpLSKPaPNgGoHidTBXDVJDTUZTwKA
2024-10-28T00:20:49.350222Z  INFO Consensus: subspace_service: DSN listening on /ip4/192.168.18.222/tcp/30433/p2p/12D3KooWD2szwQe5XQaGf4FvpLSKPaPNgGoHidTBXDVJDTUZTwKA
2024-10-28T00:20:49.350232Z  INFO Consensus: subspace_service: DSN listening on /ip4/192.168.18.2/tcp/30433/p2p/12D3KooWD2szwQe5XQaGf4FvpLSKPaPNgGoHidTBXDVJDTUZTwKA
2024-10-28T00:20:49.350967Z  INFO Consensus: sub-libp2p: 🏷  Local node identity is: 12D3KooWD2szwQe5XQaGf4FvpLSKPaPNgGoHidTBXDVJDTUZTwKA    
2024-10-28T00:20:49.351027Z  INFO Consensus: sub-libp2p: Running libp2p network backend    
2024-10-28T00:20:49.353106Z  WARN Consensus: sub-libp2p: Can't listen on /ip4/0.0.0.0/tcp/30333 because: Other(Custom { kind: Other, error: Custom { kind: Other, error: Other(Left(Left(Left(Left(Right(Transport(Os { code: 48, kind: AddrInUse, message: "Address already in use" }))))))) } })    
2024-10-28T00:20:49.353555Z  WARN Consensus: sub-libp2p: Can't listen on /ip6/::/tcp/30333 because: Other(Custom { kind: Other, error: Custom { kind: Other, error: Other(Left(Left(Left(Left(Right(Transport(Os { code: 48, kind: AddrInUse, message: "Address already in use" }))))))) } })    
2024-10-28T00:20:49.353697Z  INFO Consensus: sc_consensus_subspace::archiver: Starting archiving from genesis
2024-10-28T00:20:49.353722Z  INFO Consensus: sc_consensus_subspace::archiver: Archiving already produced blocks 0..=0
2024-10-28T00:20:50.272233Z  WARN Consensus: libp2p_kad::behaviour: Failed to trigger bootstrap: No known peers.
thread 'tokio-runtime-worker' panicked at /Users/hetzner/actions-runner/_work/subspace/subspace/crates/subspace-service/src/sync_from_dsn/snap_sync.rs:169:9:
Snap sync is impossible - not enough archived history: wipe the DB folder and rerun with --sync=full
stack backtrace:
2024-10-28T00:21:24.583020Z  INFO Consensus: subspace: 🧑‍🌾 Starting Subspace Authorship worker
2024-10-28T00:21:24.583073Z  INFO Consensus: sc_sysinfo: 💻 Operating system: macos    
2024-10-28T00:21:24.583077Z  INFO Consensus: sc_sysinfo: 💻 CPU architecture: aarch64    
2024-10-28T00:21:24.583161Z  INFO Consensus: subspace_service::task_spawner: 📦 Highest known block at #0
2024-10-28T00:21:24.584955Z  INFO Consensus: sc_rpc_server: Running JSON-RPC server: addr=127.0.0.1:59846,    
   0: _rust_begin_unwind
   1: core::panicking::panic_fmt
   2: subspace_service::sync_from_dsn::snap_sync::sync::{{closure}}
   3: subspace_service::new_full::{{closure}}::{{closure}}
   4: <futures_util::future::future::Map<Fut,F> as core::future::future::Future>::poll
   5: <tracing_futures::Instrumented<T> as core::future::future::Future>::poll
   6: tokio::runtime::task::raw::poll
note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace.
2024-10-28T00:21:24.710924Z ERROR Consensus: sc_service::task_manager: Essential task `worker` failed. Shutting down service.

I'll re-run to get a full stacktrace, and also see which threads/stacks are still active.

The text was updated successfully, but these errors were encountered:

nazar-pc · 2024-10-28T00:32:39Z

Did you try Ctrl+C after above panic? We'll want to fix that panic for sure, but I suspect it is the reason Ctrl+C didn't work, it probably broke the signal handler somehow.

teor2345 · 2024-10-28T00:33:48Z

The node didn't exit when I pressed Ctrl-C

teor2345 · 2024-10-28T00:35:54Z

The panic is in snap sync, saying we don't have enough archived history. (The DB folder should be empty, because I used --tmp.)

This appears to be a race condition, the second time I ran it, the node exited with an error:

$ export RUST_BACKTRACE=full
$ ./subspace-node run --chain devnet --tmp --bootstrap-node /dns/bootstrap-0.devnet.subspace.foundation/tcp/30333/p2p/12D3KooWKd4qisMjuBXU4YvpbJgepiX278iumqMjCC3GiUjjZEZS --dsn-bootstrap-node /dns/bootstrap-0.devnet.subspace.foundation/tcp/30533/p2p/12D3KooWSKAQm66N7jQme72obbPPquBSZuSqMyx5mH5ki1aZ4pGv --pot-external-entropy 0xddca5a66 --farmer
2024-10-28T00:30:43.516307Z  INFO subspace_node::commands::run: Subspace
2024-10-28T00:30:43.516318Z  INFO subspace_node::commands::run: ✌️  version 0.1.0-38f06b5d94a
2024-10-28T00:30:43.516320Z  INFO subspace_node::commands::run: ❤️  by Subspace Labs <https://subspace.network>
2024-10-28T00:30:43.516322Z  INFO subspace_node::commands::run: 📋 Chain specification: Subspace Dev network
2024-10-28T00:30:43.516323Z  INFO subspace_node::commands::run: 🏷  Node name: frightened-bridge-3106
2024-10-28T00:30:43.516325Z  INFO subspace_node::commands::run: 💾 Node path: /var/folders/ys/4_gnfm5j1_1g7h7v0r2_k_2m0000gp/T/subspace-node-q78nt5
2024-10-28T00:30:43.887243Z  INFO Consensus: sc_service::client::client: 🔨 Initializing Genesis block/state (state: 0xe647…796a, header-hash: 0x1666…c57d)    
2024-10-28T00:30:44.311245Z  INFO Consensus: subspace_networking::constructor: DSN instance configured. allow_non_global_addresses_in_dht=false peer_id=12D3KooWRxEZdLgr5guC91bXj2HtKeYTQiFKk2AGtjwGegPJbr1K protocol_version=/subspace/2/1666067b80762b374703f5267360f8c63c0a8c068579b63c7455c6b2302ec57d
2024-10-28T00:30:44.311438Z  INFO Consensus: libp2p_swarm: local_peer_id=12D3KooWRxEZdLgr5guC91bXj2HtKeYTQiFKk2AGtjwGegPJbr1K
2024-10-28T00:30:44.311813Z  INFO Consensus: subspace_service: Subspace networking initialized: Node ID is 12D3KooWRxEZdLgr5guC91bXj2HtKeYTQiFKk2AGtjwGegPJbr1K
2024-10-28T00:30:44.311920Z  INFO Consensus: block_relay: relay::consensus block server: starting
2024-10-28T00:30:44.311929Z  INFO Consensus: subspace_service: DSN listening on /ip4/127.0.0.1/tcp/30433/p2p/12D3KooWRxEZdLgr5guC91bXj2HtKeYTQiFKk2AGtjwGegPJbr1K
2024-10-28T00:30:44.312010Z  INFO Consensus: subspace_service: DSN listening on /ip6/::1/tcp/30433/p2p/12D3KooWRxEZdLgr5guC91bXj2HtKeYTQiFKk2AGtjwGegPJbr1K
2024-10-28T00:30:44.312019Z  INFO Consensus: subspace_service: DSN listening on /ip4/192.168.18.222/tcp/30433/p2p/12D3KooWRxEZdLgr5guC91bXj2HtKeYTQiFKk2AGtjwGegPJbr1K
2024-10-28T00:30:44.312033Z  INFO Consensus: subspace_service: DSN listening on /ip4/192.168.18.2/tcp/30433/p2p/12D3KooWRxEZdLgr5guC91bXj2HtKeYTQiFKk2AGtjwGegPJbr1K
2024-10-28T00:30:44.312097Z  INFO Consensus: sub-libp2p: 🏷  Local node identity is: 12D3KooWRxEZdLgr5guC91bXj2HtKeYTQiFKk2AGtjwGegPJbr1K    
2024-10-28T00:30:44.312104Z  INFO Consensus: sub-libp2p: Running libp2p network backend    
2024-10-28T00:30:44.312453Z  WARN Consensus: sub-libp2p: Can't listen on /ip4/0.0.0.0/tcp/30333 because: Other(Custom { kind: Other, error: Custom { kind: Other, error: Other(Left(Left(Left(Left(Right(Transport(Os { code: 48, kind: AddrInUse, message: "Address already in use" }))))))) } })    
2024-10-28T00:30:44.312496Z  WARN Consensus: sub-libp2p: Can't listen on /ip6/::/tcp/30333 because: Other(Custom { kind: Other, error: Custom { kind: Other, error: Other(Left(Left(Left(Left(Right(Transport(Os { code: 48, kind: AddrInUse, message: "Address already in use" }))))))) } })    
2024-10-28T00:30:44.312619Z  INFO Consensus: sc_consensus_subspace::archiver: Starting archiving from genesis
2024-10-28T00:30:44.312634Z  INFO Consensus: sc_consensus_subspace::archiver: Archiving already produced blocks 0..=0
2024-10-28T00:30:45.227849Z  WARN Consensus: libp2p_kad::behaviour: Failed to trigger bootstrap: No known peers.
2024-10-28T00:30:52.045921Z  INFO Consensus: subspace: 🧑‍🌾 Starting Subspace Authorship worker
2024-10-28T00:30:52.045960Z  INFO Consensus: sc_sysinfo: 💻 Operating system: macos    
2024-10-28T00:30:52.045963Z  INFO Consensus: sc_sysinfo: 💻 CPU architecture: aarch64    
2024-10-28T00:30:52.045994Z  INFO Consensus: subspace_service::task_spawner: 📦 Highest known block at #0
thread 'tokio-runtime-worker' panicked at /Users/hetzner/actions-runner/_work/subspace/subspace/crates/subspace-service/src/sync_from_dsn/snap_sync.rs:169:9:
Snap sync is impossible - not enough archived history: wipe the DB folder and rerun with --sync=full
stack backtrace:
2024-10-28T00:30:52.046449Z  INFO Consensus: sc_rpc_server: Running JSON-RPC server: addr=127.0.0.1:61647,    
   0:        0x1033b5e7c - <std::sys::backtrace::BacktraceLock::print::DisplayBacktrace as core::fmt::Display>::fmt::h48e477a2a6df82fa
   1:        0x1029bdfa0 - core::fmt::write::ha2b7a9a42bc45315
   2:        0x1033af4e4 - std::io::Write::write_fmt::h864ab19bb099f451
   3:        0x1033c5978 - std::panicking::default_hook::{{closure}}::h22d4e057cb9d116f
   4:        0x1033c66bc - std::panicking::rust_panic_with_hook::h0a56be67e390565a
   5:        0x1033b62f0 - std::panicking::begin_panic_handler::{{closure}}::hc09ab2aa629b2aba
   6:        0x1033b6288 - std::sys::backtrace::__rust_end_short_backtrace::h3891c78f0c9067f6
   7:        0x1033c5cf4 - _rust_begin_unwind
   8:        0x10434cef4 - core::panicking::panic_fmt::hb948a9c9b33f98e8
   9:        0x1035f5e64 - subspace_service::sync_from_dsn::snap_sync::sync::{{closure}}::hfe6011ba45229784
  10:        0x1035f160c - subspace_service::new_full::{{closure}}::{{closure}}::h82ae7daa6f602f33
  11:        0x103203ca0 - <futures_util::future::future::Map<Fut,F> as core::future::future::Future>::poll::he85d9384076bdddf
  12:        0x10320640c - <tracing_futures::Instrumented<T> as core::future::future::Future>::poll::h23279404acc12293
  13:        0x1031f164c - tokio::runtime::task::raw::poll::h7a299ca9ca38aae1
  14:        0x1040410cc - std::sys::backtrace::__rust_begin_short_backtrace::h64a6dbaa1e51bcce
  15:        0x1040391c8 - core::ops::function::FnOnce::call_once{{vtable.shim}}::h703a83a7d5d5ae69
  16:        0x1033bb160 - std::sys::pal::unix::thread::Thread::new::thread_start::h8578131b5b53b3d0
  17:        0x195a33fa8 - __pthread_joiner_wake
2024-10-28T00:30:52.051577Z ERROR Consensus: sc_service::task_manager: Essential task `worker` failed. Shutting down service.    
2024-10-28T00:30:52.052481Z  WARN trust_dns_proto::xfer::dns_exchange: failed to associate send_message response to the sender
2024-10-28T00:30:52.052488Z  WARN trust_dns_proto::xfer::dns_exchange: failed to associate send_message response to the sender
2024-10-28T00:30:52.052821Z  WARN trust_dns_proto::xfer::dns_exchange: failed to associate send_message response to the sender
2024-10-28T00:30:52.052823Z  WARN trust_dns_proto::xfer::dns_exchange: failed to associate send_message response to the sender
Error: SubstrateService(Other("Essential task failed."))

nazar-pc · 2024-10-28T00:38:04Z

I understand that the panic happened, I'm asking about when were you trying to press Ctrl+C: before or after the panic?

teor2345 · 2024-10-28T00:40:19Z

I understand that the panic happened, I'm asking about when were you trying to press Ctrl+C: before or after the panic?

After the panic. Our signal handler is an async future run by the tokio runtime, so as soon as the runtime starts to shut down, it stops working. (Runtimes drop pending tasks when they shut down, then wait for running tasks to yield.)

That's why in PR #3170 I launched a Ctrl-C handler in another runtime. Maybe something we could think about.

nazar-pc · 2024-10-28T00:43:13Z

In general we don't use .unwrap() and do not expect any panics to happen in runtime at all. So I think the fix here is to fix the possibility of the panic occurring, then the handler should work as well. I find it very annoying that Tokio doesn't crash the whole app when panic happens, I wish we just used panic = "abort" (which hypothetically allows for interesting optimizations as well), but apparently Substrate's runtime still requires unwinding (and I would prefer to not do special casing just for runtime).

teor2345 · 2024-10-28T00:44:49Z

The stack trace of the stuck node isn't that helpful, all it shows is that the tokio runtime is blocked waiting for an async task to yield or a spawn_blocking() thread to finish:

(Any task or thread could have yielded to tokio and been in the stack trace below the tokio task reaper.)

teor2345 · 2024-10-28T00:46:31Z

I find it very annoying that Tokio doesn't crash the whole app when panic happens

We could build our runtimes with Builder::unhandled_panic(ShutdownRuntime)?
https://docs.rs/tokio/latest/tokio/runtime/struct.Builder.html#method.unhandled_panic

nazar-pc · 2024-10-28T00:54:49Z

Actually yes, we should probably do that. Runtime is typically created by Substate and not custimizable, but we do create it using much lower-level APIs and we do have full control over it. I think we should use that for all apps.

teor2345 · 2024-10-28T01:22:57Z

Actually yes, we should probably do that. Runtime is typically created by Substate and not custimizable, but we do create it using much lower-level APIs and we do have full control over it. I think we should use that for all apps.

Oh, actually, that method is currently only supported in the current-thread runtime, which we can't use:
https://docs.rs/tokio/latest/src/tokio/runtime/builder.rs.html#1021

teor2345 · 2024-10-28T01:24:51Z

We could install a panic hook that exits the process instead?
(After running the default hook, which logs a panic message and a backtrace if that's configured.)

nazar-pc · 2024-10-28T01:32:30Z

And it is also unstable. I really don't like these implicit and statically uncheckable Tokio APIs. It is easy to write, but blows up in runtime, exactly the things I'm able to avoid in Rust most of the time except when dealing with Tokio and Tracing.

Exiting process on panic is fine by me.

teor2345 added bug Something isn't working node Node (service library/node app) labels Oct 28, 2024

teor2345 self-assigned this Oct 28, 2024

teor2345 mentioned this issue Oct 28, 2024

Replace panic with error in snap sync #3176

Merged

1 task

teor2345 mentioned this issue Oct 28, 2024

Exit on panics, rather than unwinding #3177

Merged

1 task

teor2345 closed this as completed in #3176 Oct 28, 2024

teor2345 mentioned this issue Oct 31, 2024

Add a shutdown watchdog to subspace binaries #3170

Closed

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Node hangs when started on broken network, after worker tries to shut down #3175

Node hangs when started on broken network, after worker tries to shut down #3175

teor2345 commented Oct 28, 2024 •

edited

Loading

nazar-pc commented Oct 28, 2024

teor2345 commented Oct 28, 2024

teor2345 commented Oct 28, 2024 •

edited

Loading

nazar-pc commented Oct 28, 2024

teor2345 commented Oct 28, 2024 •

edited

Loading

nazar-pc commented Oct 28, 2024

teor2345 commented Oct 28, 2024 •

edited

Loading

teor2345 commented Oct 28, 2024

nazar-pc commented Oct 28, 2024

teor2345 commented Oct 28, 2024

teor2345 commented Oct 28, 2024

nazar-pc commented Oct 28, 2024

Node hangs when started on broken network, after worker tries to shut down #3175

Node hangs when started on broken network, after worker tries to shut down #3175

Comments

teor2345 commented Oct 28, 2024 • edited Loading

nazar-pc commented Oct 28, 2024

teor2345 commented Oct 28, 2024

teor2345 commented Oct 28, 2024 • edited Loading

nazar-pc commented Oct 28, 2024

teor2345 commented Oct 28, 2024 • edited Loading

nazar-pc commented Oct 28, 2024

teor2345 commented Oct 28, 2024 • edited Loading

teor2345 commented Oct 28, 2024

nazar-pc commented Oct 28, 2024

teor2345 commented Oct 28, 2024

teor2345 commented Oct 28, 2024

nazar-pc commented Oct 28, 2024

teor2345 commented Oct 28, 2024 •

edited

Loading

teor2345 commented Oct 28, 2024 •

edited

Loading

teor2345 commented Oct 28, 2024 •

edited

Loading

teor2345 commented Oct 28, 2024 •

edited

Loading