Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

refactor(transport/server): always process each connection #1929

Merged
merged 14 commits into from
Jul 2, 2024

Conversation

mxinden
Copy link
Collaborator

@mxinden mxinden commented Jun 17, 2024

Problem

Say a neqo_transport::Server is managing a single neqo_transport::Connection in Server::connections. Assume the following chain of events:

  1. A user (e.g. neqo-http3) calls Server::process.

    pub fn process(&mut self, dgram: Option<&Datagram>, now: Instant) -> Output {
    if self.wake_at.map_or(false, |c| c <= now) {
    self.wake_at = None;
    }
    dgram
    .and_then(|d| self.process_input(d, now))
    .or_else(|| self.process_next_output(now))
    .map(|d| {
    qtrace!([self], "Send packet: {:?}", d);
    Output::Datagram(d)
    })
    .or_else(|| self.wake_at.take().map(|c| Output::Callback(c - now)))
    .unwrap_or_else(|| {
    qtrace!([self], "Go dormant");
    Output::None
    })
    }

    1. self.wake_at is None.
    2. dgram is None, thus self.process_input is never called.
    3. self.process_next_output(now) is called.
      fn process_next_output(&mut self, now: Instant) -> Option<Datagram> {
      qtrace!([self], "No packet to send, look at waiting connections");
      while let Some(c) = self.waiting.pop_front() {
      if let Some(d) = self.process_connection(&c, None, now) {
      return Some(d);
      }
      }
      1. It attains a reference to the one Connection through self.waiting.pop_front().
      2. It calls self.process_connection which in turn calls
        Connection::process, returning a Output::Callback(_), which is stored in self.wake_at.
    4. self.wake_at.take() takes the callback and returns it to the user as Output::Callback.
  2. The user calls Server::process again.

    1. self.wake_at is None.
    2. dgram is None, thus self.process_input isn't called.
    3. Server::process calls Server::process_next_output.
      1. Server::process_next_output finds no connection reference in self.waiting and thus returns None.
    4. self.wake_at.take() is None
    5. Server::process returns None

Result is that the user received an Output::None even though the one Connection managed by the Server is waiting for a callback. Thus the server stalls.

Shout-out to @KershawChang discovering this in #1917 (comment).

Solution

The single source of truth of whether a Connection needs a callback or not is the Connection itself. Instead of duplicating this information in Server::wake_at, Server::waiting, ServerConnectionState::wake_at, always ask the single source of truth, i.e. the Connection.

More concretely, with this patch Server::process always calls Connection::process for each of its Connections. It does not try to be smart on whether a Connection needs processing or not.


Fixes #1917.
Alternative to #1926.
Needed for #1903.
Another step after #1784 simplifying neqo_transport::Server.

Say a `neqo_transport::Server` is managing a single `neqo_transport::Connection`
in `Server::connections`. Assume the following chain of events:

1. A user (e.g. `neqo-http3`) calls `Server::process`.

    ``` rust
    pub fn process(&mut self, dgram: Option<&Datagram>, now: Instant) -> Output {
        if self.wake_at.map_or(false, |c| c <= now) {
            self.wake_at = None;
        }

        dgram
            .and_then(|d| self.process_input(d, now))
            .or_else(|| self.process_next_output(now))
            .map(|d| {
                qtrace!([self], "Send packet: {:?}", d);
                Output::Datagram(d)
            })
            .or_else(|| self.wake_at.take().map(|c| Output::Callback(c - now)))
            .unwrap_or_else(|| {
                qtrace!([self], "Go dormant");
                Output::None
            })
    }
    ```

    https://github.com/mozilla/neqo/blob/6664452e2ba25f028ebf07c404fff3d0193c0ef4/neqo-transport/src/server.rs#L660-L677

    2. `self.wake_at` is `None`.
    3. `dgram` is `None`, thus `self.process_input` is never called.
    4. `self.process_next_output(now)` is called.
        ``` rust
        fn process_next_output(&mut self, now: Instant) -> Option<Datagram> {
            qtrace!([self], "No packet to send, look at waiting connections");
            while let Some(c) = self.waiting.pop_front() {
                if let Some(d) = self.process_connection(&c, None, now) {
                    return Some(d);
                }
            }
        ```

        https://github.com/mozilla/neqo/blob/6664452e2ba25f028ebf07c404fff3d0193c0ef4/neqo-transport/src/server.rs#L636-L642
        1. It attains a reference to the one `Connection` through `self.waiting.pop_front()`.
        2. It calls self.process_connection which in turn calls
        `Connection::process`, returning a `Output::Callback(_)`, which is stored in `self.wake_at`.
  5. `self.wake_at.take()` takes the callback and returns it to the user as `Output::Callback`.
6. The user calls `Server::process` again.
    1. `self.wake_at` is `None`.
    2. `dgram` is `None`, thus `self.process_input` isn't called.
    3. `Server::process` calls `Server::process_next_output`.
        1. `Server::process_next_output` finds no connection reference in
        `self.waiting` and thus returns `None`.
    4. `self.wake_at.take()` is `None`
    5. `Server::process` returns `None`

Result is that the user received an `Output::None` even though the one `Connection` managed
by the `Server` is waiting for a callback. Thus the server stalls.

The single source of truth of whether a `Connection` needs a callback or not is
the `Connection` itself. Instead of duplicating this information in
`Server::wake_at`, `Server::waiting`, `ServerConnectionState::wake_at`, always
ask the single source of truth, i.e. the `Connection`.

More concretely, with this patch `Server::process` always calls
`Connection::process` for each of its `Connection`s. It does not try to be smart
on whether a `Connection` needs `process`ing or not.
@mxinden mxinden changed the title refactor(transport/server): always iterate each connection refactor(transport/server): always process each connection Jun 17, 2024
Copy link

github-actions bot commented Jun 17, 2024

Failed Interop Tests

QUIC Interop Runner, client vs. server

All results

Succeeded Interop Tests

QUIC Interop Runner, client vs. server

Unsupported Interop Tests

QUIC Interop Runner, client vs. server

Copy link

github-actions bot commented Jun 17, 2024

Firefox builds for this PR

The following builds are available for testing. Crossed-out builds did not succeed.

@larseggert
Copy link
Collaborator

larseggert commented Jun 18, 2024

FYI #1926 doesn't fix the issue in all cases. I can still make the client time out if I run test/test.sh many times (many more than before, so it definitely fixes some cases.) I also need to reduce the logging and run it in release mode; it seems that debug mode and extra logging create delays that make it succeed.

30s 56ms INFO [Client 10196f69d8a10da1] idle timeout expired
thread 'main' panicked at neqo-transport/src/connection/mod.rs:964:13:
idle timeout expired
stack backtrace:
   0: rust_begin_unwind
             at /rustc/129f3b9964af4d4a709d1383930ade12dfe7c081/library/std/src/panicking.rs:652:5
   1: core::panicking::panic_fmt
             at /rustc/129f3b9964af4d4a709d1383930ade12dfe7c081/library/core/src/panicking.rs:72:14
   2: neqo_transport::connection::Connection::process_timer
             at ./neqo-transport/src/connection/mod.rs:964:13
   3: neqo_transport::connection::Connection::process_output
             at ./neqo-transport/src/connection/mod.rs:1092:17
   4: neqo_http3::connection_client::Http3Client::process_output
             at ./neqo-http3/src/connection_client.rs:953:19
   5: neqo_bin::client::http3::<impl neqo_bin::client::Client for neqo_http3::connection_client::Http3Client>::process_output
             at ./neqo-bin/src/client/http3.rs:132:9
   6: neqo_bin::client::Runner<H>::process_output::{{closure}}
             at ./neqo-bin/src/client/mod.rs:418:19
   7: neqo_bin::client::Runner<H>::run::{{closure}}
             at ./neqo-bin/src/client/mod.rs:381:35
   8: neqo_bin::client::client::{{closure}}
             at ./neqo-bin/src/client/mod.rs:587:18
   9: neqo_client::main::{{closure}}
             at ./neqo-bin/src/bin/client.rs:13:36
  10: tokio::runtime::park::CachedParkThread::block_on::{{closure}}
             at /Users/lars/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.37.0/src/runtime/park.rs:281:63
  11: tokio::runtime::coop::with_budget
             at /Users/lars/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.37.0/src/runtime/coop.rs:107:5
  12: tokio::runtime::coop::budget
             at /Users/lars/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.37.0/src/runtime/coop.rs:73:5
  13: tokio::runtime::park::CachedParkThread::block_on
             at /Users/lars/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.37.0/src/runtime/park.rs:281:31
  14: tokio::runtime::context::blocking::BlockingRegionGuard::block_on
             at /Users/lars/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.37.0/src/runtime/context/blocking.rs:66:9
  15: tokio::runtime::scheduler::multi_thread::MultiThread::block_on::{{closure}}
             at /Users/lars/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.37.0/src/runtime/scheduler/multi_thread/mod.rs:87:13
  16: tokio::runtime::context::runtime::enter_runtime
             at /Users/lars/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.37.0/src/runtime/context/runtime.rs:65:16
  17: tokio::runtime::scheduler::multi_thread::MultiThread::block_on
             at /Users/lars/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.37.0/src/runtime/scheduler/multi_thread/mod.rs:86:9
  18: tokio::runtime::runtime::Runtime::block_on
             at /Users/lars/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.37.0/src/runtime/runtime.rs:351:45
  19: neqo_client::main
             at ./neqo-bin/src/bin/client.rs:13:5
  20: core::ops::function::FnOnce::call_once
             at /rustc/129f3b9964af4d4a709d1383930ade12dfe7c081/library/core/src/ops/function.rs:250:5

@larseggert
Copy link
Collaborator

I think we might have a winner! I'm running this in a loop and I see no more client timeouts.

while RUST_LOG=neqo_transport=warn  cargo test --release  --bench main --features bench RPS; do :; done

@larseggert
Copy link
Collaborator

@mxinden anything I can do to help land this?

@mxinden
Copy link
Collaborator Author

mxinden commented Jul 1, 2024

I have another simplification locally that I will push later today. That will resolve the outstanding TODO, namely the allocation into a Vec in the hot path. I will prioritize this pull request today.

Copy link

codecov bot commented Jul 1, 2024

Codecov Report

Attention: Patch coverage is 92.55319% with 7 lines in your changes missing coverage. Please review.

Project coverage is 94.79%. Comparing base (ea54273) to head (a50e1de).

Files Patch % Lines
neqo-transport/src/server.rs 91.13% 7 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1929      +/-   ##
==========================================
- Coverage   94.81%   94.79%   -0.02%     
==========================================
  Files         110      110              
  Lines       35773    35725      -48     
==========================================
- Hits        33918    33867      -51     
- Misses       1855     1858       +3     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link

github-actions bot commented Jul 1, 2024

Benchmark results

Performance differences relative to ea54273.

coalesce_acked_from_zero 1+1 entries: 💚 Performance has improved.
       time:   [187.33 ns 187.80 ns 188.32 ns]
       change: [-2.1351% -1.7756% -1.4215%] (p = 0.00 < 0.05)
Found 15 outliers among 100 measurements (15.00%)
  1 (1.00%) low mild
  8 (8.00%) high mild
  6 (6.00%) high severe
coalesce_acked_from_zero 3+1 entries: Change within noise threshold.
       time:   [230.43 ns 231.05 ns 231.71 ns]
       change: [-1.0369% -0.6201% -0.2308%] (p = 0.00 < 0.05)
Found 14 outliers among 100 measurements (14.00%)
  4 (4.00%) high mild
  10 (10.00%) high severe
coalesce_acked_from_zero 10+1 entries: Change within noise threshold.
       time:   [230.34 ns 231.22 ns 232.22 ns]
       change: [-1.8720% -1.3513% -0.8607%] (p = 0.00 < 0.05)
Found 7 outliers among 100 measurements (7.00%)
  1 (1.00%) low mild
  6 (6.00%) high severe
coalesce_acked_from_zero 1000+1 entries: No change in performance detected.
       time:   [212.11 ns 217.86 ns 231.09 ns]
       change: [-1.9215% +1.4013% +7.3856%] (p = 0.75 > 0.05)
Found 8 outliers among 100 measurements (8.00%)
  2 (2.00%) high mild
  6 (6.00%) high severe
RxStreamOrderer::inbound_frame(): Change within noise threshold.
       time:   [119.09 ms 119.15 ms 119.22 ms]
       change: [+0.8188% +0.9155% +1.0211%] (p = 0.00 < 0.05)
Found 7 outliers among 100 measurements (7.00%)
  1 (1.00%) low mild
  5 (5.00%) high mild
  1 (1.00%) high severe
transfer/Run multiple transfers with varying seeds: Change within noise threshold.
       time:   [122.72 ms 123.02 ms 123.31 ms]
       thrpt:  [32.437 MiB/s 32.514 MiB/s 32.594 MiB/s]
change:
       time:   [+2.3168% +2.6323% +2.9327%] (p = 0.00 < 0.05)
       thrpt:  [-2.8492% -2.5647% -2.2644%]
Found 3 outliers among 100 measurements (3.00%)
  3 (3.00%) low mild
transfer/Run multiple transfers with the same seed: Change within noise threshold.
       time:   [123.18 ms 123.34 ms 123.51 ms]
       thrpt:  [32.385 MiB/s 32.429 MiB/s 32.474 MiB/s]
change:
       time:   [+2.4343% +2.6551% +2.8706%] (p = 0.00 < 0.05)
       thrpt:  [-2.7905% -2.5864% -2.3764%]
Found 1 outliers among 100 measurements (1.00%)
  1 (1.00%) high mild
1-conn/1-100mb-resp (aka. Download)/client: No change in performance detected.
       time:   [1.0628 s 1.0755 s 1.0915 s]
       thrpt:  [91.616 MiB/s 92.980 MiB/s 94.094 MiB/s]
change:
       time:   [-3.9674% -1.1739% +1.3846%] (p = 0.45 > 0.05)
       thrpt:  [-1.3657% +1.1879% +4.1313%]
Found 2 outliers among 10 measurements (20.00%)
  2 (20.00%) high severe
1-conn/10_000-parallel-1b-resp (aka. RPS)/client: 💔 Performance has regressed.
       time:   [436.07 ms 438.41 ms 440.73 ms]
       thrpt:  [22.690 Kelem/s 22.810 Kelem/s 22.932 Kelem/s]
change:
       time:   [+10.667% +11.663% +12.706%] (p = 0.00 < 0.05)
       thrpt:  [-11.274% -10.445% -9.6389%]
Found 1 outliers among 100 measurements (1.00%)
  1 (1.00%) low mild
1-conn/1-1b-resp (aka. HPS)/client: 💔 Performance has regressed.
       time:   [45.496 ms 45.716 ms 45.947 ms]
       thrpt:  [21.764  elem/s 21.874  elem/s 21.980  elem/s]
change:
       time:   [+6.9518% +7.6605% +8.3040%] (p = 0.00 < 0.05)
       thrpt:  [-7.6673% -7.1154% -6.4999%]
Found 1 outliers among 100 measurements (1.00%)
  1 (1.00%) high mild

Client/server transfer results

Transfer of 33554432 bytes over loopback.

Client Server CC Pacing Mean [ms] Min [ms] Max [ms] Relative
msquic msquic 128.2 ± 28.4 102.2 226.6 1.00
neqo msquic reno on 267.4 ± 9.9 254.9 282.8 1.00
neqo msquic reno 306.2 ± 91.5 246.0 557.2 1.00
neqo msquic cubic on 269.6 ± 9.7 258.3 284.7 1.00
neqo msquic cubic 263.9 ± 8.4 251.9 277.9 1.00
msquic neqo reno on 948.7 ± 15.3 929.7 977.0 1.00
msquic neqo reno 947.4 ± 19.4 932.2 994.1 1.00
msquic neqo cubic on 914.1 ± 12.5 897.5 940.1 1.00
msquic neqo cubic 899.9 ± 8.6 892.3 914.8 1.00
neqo neqo reno on 978.0 ± 14.7 957.9 997.1 1.00
neqo neqo reno 966.6 ± 15.8 939.3 987.9 1.00
neqo neqo cubic on 954.8 ± 13.6 940.4 980.8 1.00
neqo neqo cubic 924.3 ± 9.4 908.0 936.9 1.00

⬇️ Download logs

Copy link
Collaborator Author

@mxinden mxinden left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This pull request is ready for a first review.

It includes multiple intertwined changes. E.g. the removal of the various Server fields allows for the processing of each connection without an additional allocation into a Vec, thereby enabling the actual bug-fix. Thus the pull request touches many components. I added a comment to each relevant change. Let me know if you would still prefer that I break the pull request up into multiple smaller patches.

I still need to look into the HPS benchmark regression, to see whether it is related, or just noise.

@mxinden mxinden marked this pull request as ready for review July 1, 2024 15:28
@mxinden
Copy link
Collaborator Author

mxinden commented Jul 1, 2024

Thus far I am unable to reproduce the HPS (handshakes per second) benchmark regression locally:

➜  neqo-bin git:(refactor-callback) ✗ critcmp main pr  -f "HPS"
group                                 pr                                    main
-----                                 ------                                 ----
1-conn/1-1b-resp (aka. HPS)/client    1.00     26.5±2.99ms    37 Elem/sec    1.04     27.6±3.53ms    36 Elem/sec

Copy link
Member

@martinthomson martinthomson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lots of red. That's good.

@larseggert
Copy link
Collaborator

neqo-transport/src/server.rs 91.13% 7 Missing ⚠️

Could we bump the coverage up?

@mxinden
Copy link
Collaborator Author

mxinden commented Jul 2, 2024

neqo-transport/src/server.rs 91.13% 7 Missing ⚠️

Could we bump the coverage up?

Note that this pull request does not introduce any untested lines. In other words, each untested line in this patch has been untested before. See coverage report.

That said, the more tests the better! Though I suggest doing so in a separate pull request. Any specific code-paths you have in mind @larseggert?

@larseggert larseggert enabled auto-merge July 2, 2024 08:09
@larseggert
Copy link
Collaborator

Note that this pull request does not introduce any untested lines. In other words, each untested line in this patch has been untested before. See coverage report.

Hm. Maybe I misread the coverage report, or lines were moved around. Anyway, WFM. Good to go.

@larseggert larseggert added this pull request to the merge queue Jul 2, 2024
Merged via the queue into mozilla:main with commit 8d23703 Jul 2, 2024
54 of 55 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

FRAME_TYPE_STREAMS_BLOCKED ignored?
3 participants