Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

global write-lock held for two minutes by RPC next_receiving_address. #223

Open
dan-da opened this issue Nov 3, 2024 · 2 comments
Open

Comments

@dan-da
Copy link
Collaborator

dan-da commented Nov 3, 2024

I've begun experimenting with logging duration that the GlobalState write-lock is held for and I encountered a behavior I can't yet explain.

Anything that holds the write-lock effectively blocks processing for all other tasks until it completes, including the RPC server. So shorter is always better.

Some of the rpc timeout behavior observed and documented in issue #219 made me think that the write-lock is probably being held too long. To find out where, I modifed the AtomicRw to log a duration and the caller location when the write-lock-guard gets dropped, but only if the duration exceeds 100 milliseconds. So we only log slow processes.

I then followed the steps in issue #219 and observed an rpc timeout after visiting the Receive screen. Some time later, I found in the log this warning:

2024-11-02T21:41:48.416065456Z  WARN ThreadId(126) RPC{rpc.trace_id=00 rpc.deadline=2024-11-02T21:39:56.658040570Z otel.kind="server" otel.name="RPC.next_receiving_address"}: neptune_core::locks::tokio::atomic_rw: write-lock held for 121.75746 seconds. (exceeds max: 100 milli)  location: /home/danda/neptune-core/src/rpc_server.rs:621:64

The referenced code that acquires the write-lock is:

    async fn next_receiving_address(
        mut self,
        _context: tarpc::context::Context,
        key_type: KeyType,
    ) -> ReceivingAddress {
        let mut global_state_mut = self.state.lock_guard_mut().await;

        let address = global_state_mut
            .wallet_state
            .next_unused_spending_key(key_type)
            .to_address();

        // persist wallet state to disk
        global_state_mut.persist_wallet().await.expect("flushed");

        address
    }

It is very strange to me that this code is holding the write-lock for 2 minutes.

(note: after an hour running a node and mining 2 blocks and attempting to send a tx there are less than 10 warnings in the log, and the rest are under 1 sec, so 2 minutes is a huge anomaly).

Observations:

  1. The elapsed 2 minutes occurs after we already acquired the lock. So it does not include any time spent waiting to acquire. It only includes actual processing time (or sleeps) until it is released. And it should take only a few milliseconds, if that.
  2. the write-lock is held across a call to next_unused_spending_key() and persist_wallet(), so seemingly one of those is the offender.
  3. The primary work in next_unused_spending_key() is deriving a Generation address at index 0. This is done over and over again in tests and never takes 2 minutes (that I have ever observed).
  4. persist_wallet() may be a suspect. Perhaps the DB could be doing some kind of compaction or maintenance? strange though for a DB that has just been created and should be tiny.
  5. If we acquired the lock separately for both calls then it would be obvious which causes it. (If we can ever recreate)
  6. hypothesis: perhaps a proving task was executing at this time and hogging so much CPU that this task didn't make progress. 2 minutes seems awfully long though and machine (mjolnir) did not seem to be in any kind of frozen state.

For now my plan is to complete development of the lock logging feature and just monitor logs for any slow-write-lock warnings. Maybe this is not repeatable.

also: I noticed that this particular RPC call logged warnings of up to 0.3 secs on the other 2 nodes. Maybe deriving the generation address is somehow more expensive than I would think.

@dan-da
Copy link
Collaborator Author

dan-da commented Nov 4, 2024

The persist_wallet() call is seeming more likely to be the culprit.

I jut ran:
cargo +nightly test --features log-slow-write-lock rpc_server -- --nocapture

The output includes this warning:
2024-11-04T02:31:47.278758Z WARN send_to_many_test:Constructing transaction: neptune_core::locks::tokio::atomic_rw: write-lock held for 2.04554 seconds. (exceeds max: 0.001 secs) location: src/models/state/mod.rs:196:31

The code at indicated line is:
self.lock_guard_mut().await.flush_databases().await

So here we have an isolated instance where flushing the databases took 2 seconds during the rpc server send_to_many_test. This is much less than 2 minutes, but still unexpectedly long for a nearly empty db.

@dan-da
Copy link
Collaborator Author

dan-da commented Nov 4, 2024

related:

2024-11-04T04:29:17.582661159Z ERROR ThreadId(127) RPC{rpc.trace_id=00 rpc.deadline=2024-11-04T04:26:29.418856282Z otel.kind="server" otel.name="RPC.send"}:Constructing transaction:RPC{rpc.trace_id=00 rpc.deadline=2024-11-04T04:29:17.579245784Z otel.kind="server" otel.name="RPC.next_receiving_address"}: tarpc::server::in_flight_requests: DeadlineExceeded
2024-11-04T04:29:20.446725646Z ERROR ThreadId(02) RPC{rpc.trace_id=00 rpc.deadline=2024-11-04T04:29:20.443506379Z otel.kind="server" otel.name="RPC.next_receiving_address"}: tarpc::server::in_flight_requests: DeadlineExceeded
2024-11-04T04:29:21.564633677Z ERROR ThreadId(127) RPC{rpc.trace_id=00 rpc.deadline=2024-11-04T04:29:21.562294527Z otel.kind="server" otel.name="RPC.dashboard_overview_data"}: tarpc::server::in_flight_requests: DeadlineExceeded

note that tarpc cancels tasks on the server when a client-provided deadline is exceeded. By default that deadline is 10 seconds.

observations:

  1. these rpc calls should not be exceeding 10 secs on the server, especially since I am not seeing warnings about write-lock being held for that long.
  2. somehow next-receiving-address was able to execute 2 minutes as reported in OP. That exceeds the 10 secs. I guess this is possible because the rpc task did not call an await during those 2 mins, and thus no chance to be cancelled.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant