Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Setting TCP_QUICKACK reduces trylock latency by 40ms. #254

Open
wants to merge 1 commit into
base: trunk
Choose a base branch
from

Conversation

madars
Copy link
Member

@madars madars commented Apr 5, 2024

When running PArSEC on Linux (e.g. Debian bookworm on aarch64), there is a 40ms delay between requesting a lock and obtaining a response.

For example, in a 1 agent, 1 shard localhost configuration lua_bench with 2 wallets produces the following trace for the agent:

[2024-04-05 02:16:17.990] [INFO ] using sha2:  standard
[2024-04-05 02:16:17.990] [INFO ] Connecting to shards...
[2024-04-05 02:16:17.990] [INFO ] Connected to shards, connecting to ticketer...
[2024-04-05 02:16:17.990] [INFO ] Requesting broker recovery...
[2024-04-05 02:16:17.991] [TRACE] 0xaaaae9d42170 Broker handled get_tickets for shard 0
[2024-04-05 02:16:18.026] [INFO ] Agent running
[2024-04-05 02:16:32.009] [TRACE] received request with m_function  7061790100000000000000  and param size  144
[2024-04-05 02:16:32.010] [TRACE] received request with m_function  7061790000000000000000  and param size  144
[2024-04-05 02:16:32.012] [TRACE] do_start  7061790100000000000000
[2024-04-05 02:16:32.012] [TRACE] do_start  7061790000000000000000
[2024-04-05 02:16:32.056] [TRACE] 0xaaaae9d42170 Broker locked key for 1000
(snipped)

In the shard log the relevant lines are:

[2024-04-05 02:16:32.055] [TRACE] 1000 requesting lock on 7061790100000000000000 0
[2024-04-05 02:16:32.056] [TRACE] Assigning read lock on 7061790100000000000000 to 1000
[2024-04-05 02:16:32.056] [TRACE] 0xffffb4001580 shard handled try_lock for 1000

Note that agent starts executing the contract at 02:16:32.012 but shard processes its first trylock whole 43ms later - 02:16:32.055, all while the communication is entirely done over a loopback interface. Using tcpdump -i lo to inspect the local traffic one sees the following sequence of packets for the same timestamp range (shard is listening on port 5556):

No.	Time	Source	Destination	Protocol	Length	Info
110	2024-04-05 02:16:32.012472	::1	::1	TCP	94	47948 → 5556 [PSH, ACK] Seq=1 Ack=1 Win=512 Len=8 TSval=956576386 TSecr=956562365
111	2024-04-05 02:16:32.055574	::1	::1	TCP	86	5556 → 47948 [ACK] Seq=1 Ack=9 Win=512 Len=0 TSval=956576429 TSecr=956576386
112	2024-04-05 02:16:32.055611	::1	::1	TCP	186	47948 → 5556 [PSH, ACK] Seq=9 Ack=1 Win=512 Len=100 TSval=956576429 TSecr=956576429

That is, there is an agent->shard TCP packet that only gets ACK'd by the shard 43ms later. However after that ACK, the shard's trylock response is almost immediate.

I believe this comes from a delayed ACK feature in Linux where ACKs get briefly delayed to coalesce sending of ACK receipts and improve throughput (at expense of latency). According to https://topic.alibabacloud.com/a/linux-tcp-delay-confirmation_1_16_31848681.html, the default delay is exactly 40ms, suspiciously close to number observed above. This hypothesis is further strengthened by the fact that this PR (which enables TCP_QUICKACK) improves 5 wallet lua_bench tx submission rate by around 100x (from ~15tps to ~1500tps).

As suggested by netty/netty#13610, this needs to be done after each recv - in fact, doing it only at the end of tcp_socket::receive (and once when setting up the socket) only improved the tx submission rate by 2x (to ~28tps)).

I think that in our setting enabling TCP_NODELAY (which disables coalescing on the sending side) also makes sense for our protocol but I did not see significant performance changes, so maybe that can be omitted. (The best option for our trylock interface would probably be doing each trylock request and response inside a single writev or TCP_CORK.) For now, this patch does not do any error handling and does not gate setsockopt to platforms where TCP_QUICKACK is defined (I guess, only Linux). How should this be improved?

@madars madars requested a review from maurermi April 5, 2024 06:39
Copy link
Collaborator

@anders94 anders94 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks reasonable - would want to check on the TCP_NODELAY setting.

@@ -36,6 +37,10 @@ namespace cbdc::network {
continue;
}

static constexpr int one = 1;
setsockopt(m_sock_fd, IPPROTO_TCP, TCP_NODELAY, &one, sizeof(one)); // sending side. needed?
Copy link
Collaborator

@anders94 anders94 Apr 5, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should check this.

Copy link
Collaborator

@maurermi maurermi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Initial T-ACK. This works and greatly speeds up small scale tests on Linux, fixing an issue observed where small scale tests were performing dramatically better on macos.

We should consider making this a config option at build time. Maybe this also requires header guards to only include on Linux builds

@HalosGhost
Copy link
Collaborator

Two failing CI checks need to be addressed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants