Don't delay ACKs for significant window updates #935

lrh2000 · 2024-05-24T04:54:08Z

Hi smoltcp maintainers, thanks for your great work!

I've been experiencing extremely low performance (< 1Gbps) for two TCP sockets connected via the loopback device (PoC attached at the end of this PR). Further investigation reveals that ACKs carrying significant window updates are being delayed due to the delayed ACK mechanism. Although disabling the ACK delay serves as a workaround, I think it's better to go ahead and fix the bug here, so I'm submitting this PR.

According to the Linux kernel implementation, ACKs are always sent immediately when the receive window is significantly increased. Here "significantly" means doubling the receive window. However, this logic does not seem to be implemented by smoltcp, where no ACKs will be sent unless the delay ACK timer expires:

smoltcp/src/socket/tcp.rs

Lines 2205 to 2207 in ef67e7b

    
           } else if self.window_to_update() && self.delayed_ack_expired(cx.now()) { 
        
               // If we have window length increase to advertise, do it. 
        
               tcp_trace!("outgoing segment will update window");

Another Problem

When I take a look at the window_to_update function, I also notice another problem: In the following code snippet, self.remote_last_win is the length of the last receive window, which cannot be directly compared with the length of the current receive window, because the start positions of the two windows are different.

smoltcp/src/socket/tcp.rs

Lines 2133 to 2142 in ef67e7b

    
           fn window_to_update(&self) -> bool { 
        
               match self.state { 
        
                   State::SynSent 
        
                   | State::SynReceived 
        
                   | State::Established 
        
                   | State::FinWait1 
        
                   | State::FinWait2 => self.scaled_window() > self.remote_last_win, 
        
                   _ => false, 
        
               } 
        
           }

For example, suppose the sender has sent packets that fill all the slots in the receive window, which means the sender cannot send any more packets unless the receiver responds with a window update. To be aware of this fact, the effective length of the last receive window should be treated as zero instead of the previously advertised length.

I've added a new function called last_scaled_window to fix the second problem, which should work in the same way as the tcp_receive_window function found in the Linux kernel implementation.

PoC

The PoC is modified from examples/loopback.rs:

#![cfg_attr(not(feature = "std"), no_std)]
#![allow(unused_mut)]
#![allow(clippy::collapsible_if)]

#[cfg(feature = "std")]
#[allow(dead_code)]
mod utils;

use log::debug;

use smoltcp::iface::{Config, Interface, SocketSet};
use smoltcp::phy::{Device, Loopback, Medium};
use smoltcp::socket::tcp;
use smoltcp::time::Instant;
use smoltcp::wire::{EthernetAddress, IpAddress, IpCidr};

fn main() {
    let device = Loopback::new(Medium::Ethernet);

    #[cfg(feature = "std")]
    let mut device = {
        utils::setup_logging("info");

        let (mut opts, mut free) = utils::create_options();
        utils::add_middleware_options(&mut opts, &mut free);

        let mut matches = utils::parse_options(&opts, free);
        utils::parse_middleware_options(&mut matches, device, /*loopback=*/ true)
    };

    // Create interface
    let mut config = match device.capabilities().medium {
        Medium::Ethernet => {
            Config::new(EthernetAddress([0x02, 0x00, 0x00, 0x00, 0x00, 0x01]).into())
        }
        Medium::Ip => Config::new(smoltcp::wire::HardwareAddress::Ip),
        Medium::Ieee802154 => todo!(),
    };

    let mut iface = Interface::new(config, &mut device, Instant::now());
    iface.update_ip_addrs(|ip_addrs| {
        ip_addrs
            .push(IpCidr::new(IpAddress::v4(127, 0, 0, 1), 8))
            .unwrap();
    });

    // Create sockets
    let server_socket = {
        // It is not strictly necessary to use a `static mut` and unsafe code here, but
        // on embedded systems that smoltcp targets it is far better to allocate the data
        // statically to verify that it fits into RAM rather than get undefined behavior
        // when stack overflows.
        static mut TCP_SERVER_RX_DATA: [u8; 65536] = [0; 65536];
        static mut TCP_SERVER_TX_DATA: [u8; 65536] = [0; 65536];
        let tcp_rx_buffer = tcp::SocketBuffer::new(unsafe { &mut TCP_SERVER_RX_DATA[..] });
        let tcp_tx_buffer = tcp::SocketBuffer::new(unsafe { &mut TCP_SERVER_TX_DATA[..] });
        tcp::Socket::new(tcp_rx_buffer, tcp_tx_buffer)
    };

    let client_socket = {
        static mut TCP_CLIENT_RX_DATA: [u8; 65536] = [0; 65536];
        static mut TCP_CLIENT_TX_DATA: [u8; 65536] = [0; 65536];
        let tcp_rx_buffer = tcp::SocketBuffer::new(unsafe { &mut TCP_CLIENT_RX_DATA[..] });
        let tcp_tx_buffer = tcp::SocketBuffer::new(unsafe { &mut TCP_CLIENT_TX_DATA[..] });
        tcp::Socket::new(tcp_rx_buffer, tcp_tx_buffer)
    };

    let mut sockets: [_; 2] = Default::default();
    let mut sockets = SocketSet::new(&mut sockets[..]);
    let server_handle = sockets.add(server_socket);
    let client_handle = sockets.add(client_socket);

    let dummy_data = vec![0; 4096];

    let start_time = Instant::now();

    let mut did_listen = false;
    let mut did_connect = false;
    let mut processed = 0;
    while processed < 1024 * 1024 * 1024 {
        iface.poll(Instant::now(), &mut device, &mut sockets);

        let mut socket = sockets.get_mut::<tcp::Socket>(server_handle);
        if !socket.is_active() && !socket.is_listening() {
            if !did_listen {
                debug!("listening");
                socket.listen(1234).unwrap();
                did_listen = true;
            }
        }

        while socket.can_recv() {
            let received = socket.recv(|buffer| (buffer.len(), buffer.len())).unwrap();
            debug!("got {:?}", received,);
            processed += received;
        }

        let mut socket = sockets.get_mut::<tcp::Socket>(client_handle);
        let cx = iface.context();
        if !socket.is_open() {
            if !did_connect {
                debug!("connecting");
                socket
                    .connect(cx, (IpAddress::v4(127, 0, 0, 1), 1234), 65000)
                    .unwrap();
                did_connect = true;
            }
        }

        while socket.can_send() {
            debug!("sending");
            socket.send_slice(&dummy_data[..]).unwrap();
        }

        // TODO: Call `poll_delay` and sleep
    }

    let duration = Instant::now() - start_time;
    println!(
        "done in {} s, bandwidth is {} Gbps",
        duration.total_millis() as f64 / 1000.0,
        (processed as u64 * 8 / duration.total_millis()) as f64 / 1000000.0
    );
}

Without this fix:

done in 10.778 s, bandwidth is 0.797002 Gbps
cargo run --release --example loopback  10.76s user 0.05s system 99% cpu 10.913 total

With this fix:

done in 0.564 s, bandwidth is 15.231305 Gbps
cargo run --release --example loopback  12.04s user 0.71s system 333% cpu 3.821 total

tomDev5 · 2024-07-08T20:13:11Z

This seems like a small, well explained and important PR, I think all dependent projects would benefit
@Dirbaio / @whitequark, could you take a look?
As for the PR itself, looks like the POC can be added as a benchmark in the benches directory with relative ease, would be cool to see

Overall, impressive performance boost!

whitequark

Looks OK to me other but I'd like to see some more comments.

whitequark · 2024-07-08T20:21:39Z

src/socket/tcp.rs

+            | State::FinWait2 => {
+                let new_win = self.scaled_window();
+                if let Some(last_win) = self.last_scaled_window() {
+                    new_win > 0 && new_win / 2 >= last_win


Can you add a comment here explaining the calculation (the definition of "significantly" from the PR body)?

Sure. I've added as the doc comments, see https://github.com/smoltcp-rs/smoltcp/compare/ca9749d3e8c8a075568e0506ccc008d63cf7ea0f..fe179c918415d6c251384b24fdd6204060265431.

whitequark · 2024-07-08T20:22:35Z

Also, can you update the benchmark results in the README please?

lrh2000 · 2024-07-12T18:35:50Z

Thanks to @tomDev5 and @whitequark for your comments and review!

As for the PR itself, looks like the POC can be added as a benchmark in the benches directory with relative ease, would be cool to see

I'll add it as examples/loopback_benchmark.rs, in the following commit.

Also, can you update the benchmark results in the README please?

Do you mean the results of examples/benchmark.rs? After looking into it, I found that the interaction between the benchmark results and this PR is quite complex.

The benchmark itself needs some fixes

Since the TCP buffer is a ring buffer, the following recv call may not receive all the data:

smoltcp/examples/benchmark.rs

Lines 142 to 152 in c937695

    
           if socket.can_recv() { 
        
               if processed < AMOUNT { 
        
                   let length = socket 
        
                       .recv(|buffer| { 
        
                           let length = cmp::min(buffer.len(), AMOUNT - processed); 
        
                           (length, length) 
        
                       }) 
        
                       .unwrap(); 
        
                   processed += length; 
        
               } 
        
           }

This is problematic because after receiving the trailing part of the ring buffer in this poll round, we can only receive the leading part of the ring buffer in the next poll round, but when we start the next poll round depends on many things, such as whether we have the delayed ACK timer and whether the sender sends data to the receiver immediately (the answer may be false if the receive window is less than one packet).

I believe the intention here is to receive all the data, like:

        while socket.can_recv() && processed < AMOUNT {
            let length = socket
                .recv(|buffer| {
                    let length = cmp::min(buffer.len(), AMOUNT - processed);
                    (length, length)
                })
                .unwrap();
            processed += length;
        }

After I made this change, the throughput of benchmark -- --tap tap0 writer is increased from ~1Gbps to ~8Gbps in my laptop, without the mechanisms in this PR.

The benchmark itself is not related to this PR

The MTU of tap0 is 1500 (unlike the PoC in this PR, where the MTU of lo is 65536). This means that the receiver will always receive a large number of packets within one poll round. The delayed ACK timer is not started, and ACKs are always replied immediately, because:

smoltcp/src/socket/tcp.rs

Lines 2049 to 2055 in c937695

    
           // RFC1122 says "in a stream of full-sized segments there SHOULD be an ACK 
        
           // for at least every second segment". 
        
           // For now, we send an ACK every second received packet, full-sized or not. 
        
           AckDelayTimer::Waiting(_) => { 
        
               tcp_trace!("delayed ack timer already started, forcing expiry"); 
        
               AckDelayTimer::Immediate 
        
           }

If the ACK timer is not even started, the delayed_ack_expired method will always return true, so all subsequent window updates (not necessarily significant ones) will not be delayed, as shown in the following code:

smoltcp/src/socket/tcp.rs

Lines 2240 to 2242 in c937695

    
           } else if self.window_to_update() && self.delayed_ack_expired(cx.now()) { 
        
               // If we have window length increase to advertise, do it. 
        
               tcp_trace!("outgoing segment will update window");

Conclusion

Given the above analysis, I think I need to open a separate PR to fix the benchmarks and update the results instead of doing it in this PR. (EDIT: See #952.)

whitequark

LGTM, thanks!

lrh2000 force-pushed the window-update branch from 0f2ef84 to ca9749d Compare May 24, 2024 04:59

whitequark reviewed Jul 8, 2024

View reviewed changes

Don't delay ACKs for significant window updates

fe179c9

lrh2000 force-pushed the window-update branch from ca9749d to fe179c9 Compare July 12, 2024 18:31

Add the loopback_benchmark

bfbff60

lrh2000 force-pushed the window-update branch from 8753ef7 to bfbff60 Compare July 12, 2024 18:46

lrh2000 mentioned this pull request Jul 12, 2024

Fix benchmark and update results #952

Merged

whitequark approved these changes Jul 13, 2024

View reviewed changes

whitequark added this pull request to the merge queue Jul 13, 2024

Merged via the queue into smoltcp-rs:main with commit 0847643 Jul 13, 2024
9 checks passed

lrh2000 mentioned this pull request Jul 22, 2024

Tracking issue for supporting iperf3 asterinas/asterinas#1086

Closed

5 tasks

lrh2000 mentioned this pull request Aug 18, 2024

Bump smoltcp to newest git version asterinas/asterinas#1198

Merged

lrh2000 deleted the window-update branch September 19, 2024 04:19

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Don't delay ACKs for significant window updates #935

Don't delay ACKs for significant window updates #935

lrh2000 commented May 24, 2024 •

edited

Loading

tomDev5 commented Jul 8, 2024

whitequark left a comment

whitequark Jul 8, 2024

lrh2000 Jul 12, 2024

whitequark commented Jul 8, 2024

lrh2000 commented Jul 12, 2024 •

edited

Loading

whitequark left a comment

	} else if self.window_to_update() && self.delayed_ack_expired(cx.now()) {
	// If we have window length increase to advertise, do it.
	tcp_trace!("outgoing segment will update window");

	fn window_to_update(&self) -> bool {
	match self.state {
	State::SynSent
	\| State::SynReceived
	\| State::Established
	\| State::FinWait1
	\| State::FinWait2 => self.scaled_window() > self.remote_last_win,
	_ => false,
	}
	}

Don't delay ACKs for significant window updates #935

Don't delay ACKs for significant window updates #935

Conversation

lrh2000 commented May 24, 2024 • edited Loading

Another Problem

PoC

tomDev5 commented Jul 8, 2024

whitequark left a comment

Choose a reason for hiding this comment

whitequark Jul 8, 2024

Choose a reason for hiding this comment

lrh2000 Jul 12, 2024

Choose a reason for hiding this comment

whitequark commented Jul 8, 2024

lrh2000 commented Jul 12, 2024 • edited Loading

The benchmark itself needs some fixes

The benchmark itself is not related to this PR

Conclusion

whitequark left a comment

Choose a reason for hiding this comment

lrh2000 commented May 24, 2024 •

edited

Loading

lrh2000 commented Jul 12, 2024 •

edited

Loading