Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Don't delay ACKs for significant window updates #935

Merged
merged 2 commits into from
Jul 13, 2024

Conversation

lrh2000
Copy link
Contributor

@lrh2000 lrh2000 commented May 24, 2024

Hi smoltcp maintainers, thanks for your great work!

I've been experiencing extremely low performance (< 1Gbps) for two TCP sockets connected via the loopback device (PoC attached at the end of this PR). Further investigation reveals that ACKs carrying significant window updates are being delayed due to the delayed ACK mechanism. Although disabling the ACK delay serves as a workaround, I think it's better to go ahead and fix the bug here, so I'm submitting this PR.

According to the Linux kernel implementation, ACKs are always sent immediately when the receive window is significantly increased. Here "significantly" means doubling the receive window. However, this logic does not seem to be implemented by smoltcp, where no ACKs will be sent unless the delay ACK timer expires:

smoltcp/src/socket/tcp.rs

Lines 2205 to 2207 in ef67e7b

} else if self.window_to_update() && self.delayed_ack_expired(cx.now()) {
// If we have window length increase to advertise, do it.
tcp_trace!("outgoing segment will update window");

Another Problem

When I take a look at the window_to_update function, I also notice another problem: In the following code snippet, self.remote_last_win is the length of the last receive window, which cannot be directly compared with the length of the current receive window, because the start positions of the two windows are different.

smoltcp/src/socket/tcp.rs

Lines 2133 to 2142 in ef67e7b

fn window_to_update(&self) -> bool {
match self.state {
State::SynSent
| State::SynReceived
| State::Established
| State::FinWait1
| State::FinWait2 => self.scaled_window() > self.remote_last_win,
_ => false,
}
}

For example, suppose the sender has sent packets that fill all the slots in the receive window, which means the sender cannot send any more packets unless the receiver responds with a window update. To be aware of this fact, the effective length of the last receive window should be treated as zero instead of the previously advertised length.

I've added a new function called last_scaled_window to fix the second problem, which should work in the same way as the tcp_receive_window function found in the Linux kernel implementation.

PoC

The PoC is modified from examples/loopback.rs:

#![cfg_attr(not(feature = "std"), no_std)]
#![allow(unused_mut)]
#![allow(clippy::collapsible_if)]

#[cfg(feature = "std")]
#[allow(dead_code)]
mod utils;

use log::debug;

use smoltcp::iface::{Config, Interface, SocketSet};
use smoltcp::phy::{Device, Loopback, Medium};
use smoltcp::socket::tcp;
use smoltcp::time::Instant;
use smoltcp::wire::{EthernetAddress, IpAddress, IpCidr};

fn main() {
    let device = Loopback::new(Medium::Ethernet);

    #[cfg(feature = "std")]
    let mut device = {
        utils::setup_logging("info");

        let (mut opts, mut free) = utils::create_options();
        utils::add_middleware_options(&mut opts, &mut free);

        let mut matches = utils::parse_options(&opts, free);
        utils::parse_middleware_options(&mut matches, device, /*loopback=*/ true)
    };

    // Create interface
    let mut config = match device.capabilities().medium {
        Medium::Ethernet => {
            Config::new(EthernetAddress([0x02, 0x00, 0x00, 0x00, 0x00, 0x01]).into())
        }
        Medium::Ip => Config::new(smoltcp::wire::HardwareAddress::Ip),
        Medium::Ieee802154 => todo!(),
    };

    let mut iface = Interface::new(config, &mut device, Instant::now());
    iface.update_ip_addrs(|ip_addrs| {
        ip_addrs
            .push(IpCidr::new(IpAddress::v4(127, 0, 0, 1), 8))
            .unwrap();
    });

    // Create sockets
    let server_socket = {
        // It is not strictly necessary to use a `static mut` and unsafe code here, but
        // on embedded systems that smoltcp targets it is far better to allocate the data
        // statically to verify that it fits into RAM rather than get undefined behavior
        // when stack overflows.
        static mut TCP_SERVER_RX_DATA: [u8; 65536] = [0; 65536];
        static mut TCP_SERVER_TX_DATA: [u8; 65536] = [0; 65536];
        let tcp_rx_buffer = tcp::SocketBuffer::new(unsafe { &mut TCP_SERVER_RX_DATA[..] });
        let tcp_tx_buffer = tcp::SocketBuffer::new(unsafe { &mut TCP_SERVER_TX_DATA[..] });
        tcp::Socket::new(tcp_rx_buffer, tcp_tx_buffer)
    };

    let client_socket = {
        static mut TCP_CLIENT_RX_DATA: [u8; 65536] = [0; 65536];
        static mut TCP_CLIENT_TX_DATA: [u8; 65536] = [0; 65536];
        let tcp_rx_buffer = tcp::SocketBuffer::new(unsafe { &mut TCP_CLIENT_RX_DATA[..] });
        let tcp_tx_buffer = tcp::SocketBuffer::new(unsafe { &mut TCP_CLIENT_TX_DATA[..] });
        tcp::Socket::new(tcp_rx_buffer, tcp_tx_buffer)
    };

    let mut sockets: [_; 2] = Default::default();
    let mut sockets = SocketSet::new(&mut sockets[..]);
    let server_handle = sockets.add(server_socket);
    let client_handle = sockets.add(client_socket);

    let dummy_data = vec![0; 4096];

    let start_time = Instant::now();

    let mut did_listen = false;
    let mut did_connect = false;
    let mut processed = 0;
    while processed < 1024 * 1024 * 1024 {
        iface.poll(Instant::now(), &mut device, &mut sockets);

        let mut socket = sockets.get_mut::<tcp::Socket>(server_handle);
        if !socket.is_active() && !socket.is_listening() {
            if !did_listen {
                debug!("listening");
                socket.listen(1234).unwrap();
                did_listen = true;
            }
        }

        while socket.can_recv() {
            let received = socket.recv(|buffer| (buffer.len(), buffer.len())).unwrap();
            debug!("got {:?}", received,);
            processed += received;
        }

        let mut socket = sockets.get_mut::<tcp::Socket>(client_handle);
        let cx = iface.context();
        if !socket.is_open() {
            if !did_connect {
                debug!("connecting");
                socket
                    .connect(cx, (IpAddress::v4(127, 0, 0, 1), 1234), 65000)
                    .unwrap();
                did_connect = true;
            }
        }

        while socket.can_send() {
            debug!("sending");
            socket.send_slice(&dummy_data[..]).unwrap();
        }

        // TODO: Call `poll_delay` and sleep
    }

    let duration = Instant::now() - start_time;
    println!(
        "done in {} s, bandwidth is {} Gbps",
        duration.total_millis() as f64 / 1000.0,
        (processed as u64 * 8 / duration.total_millis()) as f64 / 1000000.0
    );
}

Without this fix:

done in 10.778 s, bandwidth is 0.797002 Gbps
cargo run --release --example loopback  10.76s user 0.05s system 99% cpu 10.913 total

With this fix:

done in 0.564 s, bandwidth is 15.231305 Gbps
cargo run --release --example loopback  12.04s user 0.71s system 333% cpu 3.821 total

@tomDev5
Copy link
Contributor

tomDev5 commented Jul 8, 2024

This seems like a small, well explained and important PR, I think all dependent projects would benefit
@Dirbaio / @whitequark, could you take a look?
As for the PR itself, looks like the POC can be added as a benchmark in the benches directory with relative ease, would be cool to see

Overall, impressive performance boost!

Copy link
Contributor

@whitequark whitequark left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks OK to me other but I'd like to see some more comments.

| State::FinWait2 => {
let new_win = self.scaled_window();
if let Some(last_win) = self.last_scaled_window() {
new_win > 0 && new_win / 2 >= last_win
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a comment here explaining the calculation (the definition of "significantly" from the PR body)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@whitequark
Copy link
Contributor

Also, can you update the benchmark results in the README please?

@lrh2000
Copy link
Contributor Author

lrh2000 commented Jul 12, 2024

Thanks to @tomDev5 and @whitequark for your comments and review!

As for the PR itself, looks like the POC can be added as a benchmark in the benches directory with relative ease, would be cool to see

I'll add it as examples/loopback_benchmark.rs, in the following commit.

Also, can you update the benchmark results in the README please?

Do you mean the results of examples/benchmark.rs? After looking into it, I found that the interaction between the benchmark results and this PR is quite complex.

The benchmark itself needs some fixes

Since the TCP buffer is a ring buffer, the following recv call may not receive all the data:

if socket.can_recv() {
if processed < AMOUNT {
let length = socket
.recv(|buffer| {
let length = cmp::min(buffer.len(), AMOUNT - processed);
(length, length)
})
.unwrap();
processed += length;
}
}

This is problematic because after receiving the trailing part of the ring buffer in this poll round, we can only receive the leading part of the ring buffer in the next poll round, but when we start the next poll round depends on many things, such as whether we have the delayed ACK timer and whether the sender sends data to the receiver immediately (the answer may be false if the receive window is less than one packet).

I believe the intention here is to receive all the data, like:

        while socket.can_recv() && processed < AMOUNT {
            let length = socket
                .recv(|buffer| {
                    let length = cmp::min(buffer.len(), AMOUNT - processed);
                    (length, length)
                })
                .unwrap();
            processed += length;
        }

After I made this change, the throughput of benchmark -- --tap tap0 writer is increased from ~1Gbps to ~8Gbps in my laptop, without the mechanisms in this PR.

The benchmark itself is not related to this PR

The MTU of tap0 is 1500 (unlike the PoC in this PR, where the MTU of lo is 65536). This means that the receiver will always receive a large number of packets within one poll round. The delayed ACK timer is not started, and ACKs are always replied immediately, because:

smoltcp/src/socket/tcp.rs

Lines 2049 to 2055 in c937695

// RFC1122 says "in a stream of full-sized segments there SHOULD be an ACK
// for at least every second segment".
// For now, we send an ACK every second received packet, full-sized or not.
AckDelayTimer::Waiting(_) => {
tcp_trace!("delayed ack timer already started, forcing expiry");
AckDelayTimer::Immediate
}

If the ACK timer is not even started, the delayed_ack_expired method will always return true, so all subsequent window updates (not necessarily significant ones) will not be delayed, as shown in the following code:

smoltcp/src/socket/tcp.rs

Lines 2240 to 2242 in c937695

} else if self.window_to_update() && self.delayed_ack_expired(cx.now()) {
// If we have window length increase to advertise, do it.
tcp_trace!("outgoing segment will update window");

Conclusion

Given the above analysis, I think I need to open a separate PR to fix the benchmarks and update the results instead of doing it in this PR. (EDIT: See #952.)

Copy link
Contributor

@whitequark whitequark left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks!

@whitequark whitequark added this pull request to the merge queue Jul 13, 2024
Merged via the queue into smoltcp-rs:main with commit 0847643 Jul 13, 2024
9 checks passed
@lrh2000 lrh2000 deleted the window-update branch September 19, 2024 04:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

3 participants