Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(utxo): prioritize electrum connections #1966

Open
wants to merge 180 commits into
base: dev
Choose a base branch
from

Conversation

rozhkovdmitrii
Copy link

@rozhkovdmitrii rozhkovdmitrii commented Sep 11, 2023

This PR refactors the electrum part of the RPC clients. And also introduces a new feature related to connection management decisions.

Within the electrum enable request, one could now specify two parameters (min_connected & max_connected).

  • min_connected: KDF needs to keep at least min_connected working active (maintained) connections at all times.
  • max_connected: KDF needs to keep at most max_connected working active (maintained) connections at any time.
  • To achieve the selective/single-server effect set these parameters to (1, 1)
  • To achieve the multiple/all-servers effect, set these parameters to (1, len(servers)) (or don't set at all, as this is used as the default)
    • The connection manager will initially try to connect to a maximum of len(servers) connections and keep them maintained. As some connections get lost for whatever reason, they will not be retried right away. Retries only take place:
      • every 10 minutes (should we decrease?)
      • if we fall below min_connected, which is 1 in this case.
    • So one could use (10, 68) for instance to make sure we are connected to at least 10 servers at all times, and if we ever fall below 10 we fire another round of reconnections. After this round we are guaranteed to have <= 68 connections.
      • e.g. we have 10 connections now, we lose 1 and have 9 now. A round of reconnections start and we end up with some number of active connections between 9 (assuming only our 9 connected servers were the non-faulty ones) and 68. If after the reconnections round we still fall below 10=min_connected connections, we keep retrying.
  • These parameters must satisfy: 0 < min_connected <= max_connected <= len(servers)

The connection manager maintains these maintained (working active) connections and these are the ones used for any electrum request (they are all used concurrently and whichever returns the success result first is considered the response for this request).

For requests directed for specific servers (like server_version, get_block_count_from, etc...), the connections to these servers are established first if they are not in the maintained servers set and then the request is performed. So even if the server isn't maintained/active, it could still be queried.

The servers order in the electrum enable request encode an implicit priority (previously in this PR, was encoded as primary & secondary). Servers that appear first in the list are assumed to be more robust and thus have higher priority compared to servers after.

Since we do an altruistic round of retries every 10 minutes, if we ever had the case were a high priority server was down and thus wasn't maintained by us and now is back online, we will prefer maintaining it over another already-maintained low priority server (this is assuming the maintained servers set is full, i.e. len(maintained server) == min_connected). Simply put, high priority servers will kick out low priority ones when they come back online if we have no room to fit both.

Addresses #791

@rozhkovdmitrii rozhkovdmitrii added the in progress Changes will be made from the author label Sep 11, 2023
@rozhkovdmitrii rozhkovdmitrii self-assigned this Sep 11, 2023
@onur-ozkan onur-ozkan changed the title feat(utxo): prioritize electerum nodes connection feat(utxo): prioritize electrum connections Sep 12, 2023
@shamardy shamardy removed the request for review from onur-ozkan September 12, 2023 08:26
Copy link
Member

@onur-ozkan onur-ozkan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for your effort!

The PR seems to be in its early stages right now. This is not an actual review, I just did a quick review to point out some areas that should not be forgotten.

README.md Outdated Show resolved Hide resolved
mm2src/adex_cli/src/tests/mod.rs Outdated Show resolved Hide resolved
mm2src/coins/lp_coins.rs Outdated Show resolved Hide resolved
mm2src/gossipsub/src/behaviour.rs Outdated Show resolved Hide resolved
mm2src/coins/utxo/utxo_builder/utxo_coin_builder.rs Outdated Show resolved Hide resolved
Copy link
Collaborator

@shamardy shamardy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this is still in progress, I left some comments that can help you improve the code.
I will do a more thorough review when this is ready for review!

Please also merge with latest dev and start adding doc comments.

pub struct ElectrumClientImpl {
weak_self: Mutex<Option<Weak<ElectrumClientImpl>>>,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a very bad design choice, the struct should never reference it self to not create any unstable behavior.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm sorry, what unstable behavior can there be if we use Weak reference? The only purpose of this weak_self is to be spawned as an argument of self hosted futures, e.g. server_ping

In the previous implementation, the corresponding code was where it shouldn't be [1, 2]

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not just introduce a new struct ? e.g

struct ElectrumClientImplWeak(Mutex<Option<Weak<ElectrumClientImpl>>>)

mm2src/adex_cli/src/rpc_data.rs Outdated Show resolved Hide resolved
mm2src/coins/utxo/rpc_clients.rs Outdated Show resolved Hide resolved
mm2src/coins/lp_coins.rs Outdated Show resolved Hide resolved
struct ConnMng(Arc<ConnMngImpl>);

impl ConnMng {
async fn suspend_server(self, address: String) -> Result<(), String> {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this is a new implementation, please start using specific error types and MmError

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment on lines 2751 to 2754
spawner.spawn(async move {
let state = conn_mng.0.guarded.lock().await;
state.connecting.store(false, AtomicOrdering::Relaxed);
})
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please elaborate on this more? ConnectingStateCtx will be considered dropped before connecting is set to false if the lock was held by other operation for sometime.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Certainly should have been more detailed! Thanks for pointing this out! This place is the heart of selective connectivity.

done

Comment on lines 2732 to 2738
let address: String = {
if address.is_some() {
address.map(|internal| internal.to_string())
} else {
guard.active.as_ref().cloned()
}
}?;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
let address: String = {
if address.is_some() {
address.map(|internal| internal.to_string())
} else {
guard.active.as_ref().cloned()
}
}?;
let address = address.map_or_else(|| guard.active.as_ref().cloned(), |internal| Some(internal.to_string()))?;

mm2src/coins/utxo/rpc_clients.rs Outdated Show resolved Hide resolved
mm2src/coins/utxo/rpc_clients.rs Outdated Show resolved Hide resolved
mm2src/coins/utxo/rpc_clients.rs Outdated Show resolved Hide resolved
splitting the big function into smaller ones and sharing the sharable parts between native and wasm
Copy link
Collaborator

@shamardy shamardy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Next iteration! I wanted to finish the review completely this time, but decided to give it another last look after the below comments are addressed!

Please fix failing CI steps. One thing to add to the tracking issue checklist (ref. #1966 (comment)) is an RPC to get the list of currently connected / used electrums, so we remember to add this in the future.

@mariocynicys mariocynicys force-pushed the 1520-prioritize-nodes branch 2 times, most recently from 22c21e0 to 555b958 Compare September 12, 2024 15:22
…s fail

this is kind of doing the connection managers work as a last resort fail
safe mechanisim when electrum request fails, that is, we establish all
connections sequentially and try them.

things we need to consider:
- we prolly need to introduce wait_till_connected back (now that we
  don't wait forever)
- we don't disconnect any of the established connections, they will
  disconnect automatically when not used for a long time, but for this
  long time, we might break the `max_connected` rule.
let the background thread of the connection manager respect the
`max_connected` threshold (not entirely really, we can reach up to
`max_connected + 1`).

This is done by refactoring the connection establishment to not
establish all connections but only connections that we know we will
maintain for sure. This also is lighter overall than the old approach.
but it's also slower since connections are tried sequentially, we don't
need the speed here anyways since if an electrum request fails we fail
back to try all connections which is a logic that isn't controlled by
the connection manager in any way.
removed write_maintained_connections and added maintain & unmaintain methods.
disconnecting like that broke a lot of requests, we probably shouldn't
disconnect from some thread since we don't know if another thread is
holding the connection to use it
Copy link
Collaborator

@shamardy shamardy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the fixes, reviewed the latest changes only, still one more review to go!

This connection manager can become a great crate in the future to be used by bitcoin wallet devs or any other UTXO wallet dev, I don't think they will find a better electrum client than this. But in time they will just use KDF straight away :)

A few notes for you @mariocynicys / me to remember:

This reverts commit d98fd21.
This should make a lot of tests fail again, proving we shouldn't
randomly just disconnect a connection from different threads.
… checking

when doing version checking, the connection isn't marked as maintained yet, this means flagging the not needed will disconnect it immediately.
a fix for this is to keep a state of `Connecting` for example to avoid disconnecting a connection which is in progress, eaiser solution is to not ask for disconnection for connections that we didn't force connect.
so only if we caused the connection to connect will we call for it to be not needed anymore
for the fall back case. But delegate whether to disconnect or not to the connection manager.
this helps with concurrent requests disconnecting for no reason (so no error) and breaking other requests
this was used to make sure the background task have found a working connection by the time we start querying electrum requests. we don't need this now though. because if there are no working/maintained connections we fall back to using any/all other connections (and we establish the connection inline first)
the test was failing because the expected request id was 0 and not 1. this is becasue the server version request was constructed AFTER the get verbose transaction request. this might not be always the case, so the background thread might in some (rare) instances be fast enough and establish the connection and query for version before get verbose transaction is called, so the id 0 is reserved then for the server version query.

lets just remove the id part in the expected string as we don't really care about it
to avoid a potential starvation due to a potential busy waiting loop
This currently isn't really problamatic, but if re-used later could have a hard to debug nightmares
the solution here would be to totally order both operations favoring abortion over finishing. In `AbortableQueue` we would need to realize this by checking the future ID in question and not call `on_future_finished` if it's already aborted
this was the side effect we had (described in the previous commit) in abortable queue. on_future_finished was called AFTER the abortable queue has been fully cleared and reset. We need to make sure first that an abort handle the future actually exists in the abort handlers vector and that the future is aborted (i.e. the same future ID wasn't used to run a new future that is still running).
this is a concatenation of multiple runs, some fail and some succeed. wierdly, all the succeeding ones were performed while connected to my vpn (vpn.mariocynicys.me), all the failing are with no vpn!
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
6 participants