-
Notifications
You must be signed in to change notification settings - Fork 127
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
refactor(transport/server): always process each connection #1929
Conversation
Say a `neqo_transport::Server` is managing a single `neqo_transport::Connection` in `Server::connections`. Assume the following chain of events: 1. A user (e.g. `neqo-http3`) calls `Server::process`. ``` rust pub fn process(&mut self, dgram: Option<&Datagram>, now: Instant) -> Output { if self.wake_at.map_or(false, |c| c <= now) { self.wake_at = None; } dgram .and_then(|d| self.process_input(d, now)) .or_else(|| self.process_next_output(now)) .map(|d| { qtrace!([self], "Send packet: {:?}", d); Output::Datagram(d) }) .or_else(|| self.wake_at.take().map(|c| Output::Callback(c - now))) .unwrap_or_else(|| { qtrace!([self], "Go dormant"); Output::None }) } ``` https://github.com/mozilla/neqo/blob/6664452e2ba25f028ebf07c404fff3d0193c0ef4/neqo-transport/src/server.rs#L660-L677 2. `self.wake_at` is `None`. 3. `dgram` is `None`, thus `self.process_input` is never called. 4. `self.process_next_output(now)` is called. ``` rust fn process_next_output(&mut self, now: Instant) -> Option<Datagram> { qtrace!([self], "No packet to send, look at waiting connections"); while let Some(c) = self.waiting.pop_front() { if let Some(d) = self.process_connection(&c, None, now) { return Some(d); } } ``` https://github.com/mozilla/neqo/blob/6664452e2ba25f028ebf07c404fff3d0193c0ef4/neqo-transport/src/server.rs#L636-L642 1. It attains a reference to the one `Connection` through `self.waiting.pop_front()`. 2. It calls self.process_connection which in turn calls `Connection::process`, returning a `Output::Callback(_)`, which is stored in `self.wake_at`. 5. `self.wake_at.take()` takes the callback and returns it to the user as `Output::Callback`. 6. The user calls `Server::process` again. 1. `self.wake_at` is `None`. 2. `dgram` is `None`, thus `self.process_input` isn't called. 3. `Server::process` calls `Server::process_next_output`. 1. `Server::process_next_output` finds no connection reference in `self.waiting` and thus returns `None`. 4. `self.wake_at.take()` is `None` 5. `Server::process` returns `None` Result is that the user received an `Output::None` even though the one `Connection` managed by the `Server` is waiting for a callback. Thus the server stalls. The single source of truth of whether a `Connection` needs a callback or not is the `Connection` itself. Instead of duplicating this information in `Server::wake_at`, `Server::waiting`, `ServerConnectionState::wake_at`, always ask the single source of truth, i.e. the `Connection`. More concretely, with this patch `Server::process` always calls `Connection::process` for each of its `Connection`s. It does not try to be smart on whether a `Connection` needs `process`ing or not.
FYI #1926 doesn't fix the issue in all cases. I can still make the client time out if I run
|
I think we might have a winner! I'm running this in a loop and I see no more client timeouts. while RUST_LOG=neqo_transport=warn cargo test --release --bench main --features bench RPS; do :; done |
@mxinden anything I can do to help land this? |
I have another simplification locally that I will push later today. That will resolve the outstanding TODO, namely the allocation into a |
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #1929 +/- ##
==========================================
- Coverage 94.81% 94.79% -0.02%
==========================================
Files 110 110
Lines 35773 35725 -48
==========================================
- Hits 33918 33867 -51
- Misses 1855 1858 +3 ☔ View full report in Codecov by Sentry. |
Benchmark resultsPerformance differences relative to ea54273. coalesce_acked_from_zero 1+1 entries: 💚 Performance has improved.time: [187.33 ns 187.80 ns 188.32 ns] change: [-2.1351% -1.7756% -1.4215%] (p = 0.00 < 0.05) Found 15 outliers among 100 measurements (15.00%) 1 (1.00%) low mild 8 (8.00%) high mild 6 (6.00%) high severe coalesce_acked_from_zero 3+1 entries: Change within noise threshold.time: [230.43 ns 231.05 ns 231.71 ns] change: [-1.0369% -0.6201% -0.2308%] (p = 0.00 < 0.05) Found 14 outliers among 100 measurements (14.00%) 4 (4.00%) high mild 10 (10.00%) high severe coalesce_acked_from_zero 10+1 entries: Change within noise threshold.time: [230.34 ns 231.22 ns 232.22 ns] change: [-1.8720% -1.3513% -0.8607%] (p = 0.00 < 0.05) Found 7 outliers among 100 measurements (7.00%) 1 (1.00%) low mild 6 (6.00%) high severe coalesce_acked_from_zero 1000+1 entries: No change in performance detected.time: [212.11 ns 217.86 ns 231.09 ns] change: [-1.9215% +1.4013% +7.3856%] (p = 0.75 > 0.05) Found 8 outliers among 100 measurements (8.00%) 2 (2.00%) high mild 6 (6.00%) high severe RxStreamOrderer::inbound_frame(): Change within noise threshold.time: [119.09 ms 119.15 ms 119.22 ms] change: [+0.8188% +0.9155% +1.0211%] (p = 0.00 < 0.05) Found 7 outliers among 100 measurements (7.00%) 1 (1.00%) low mild 5 (5.00%) high mild 1 (1.00%) high severe transfer/Run multiple transfers with varying seeds: Change within noise threshold.time: [122.72 ms 123.02 ms 123.31 ms] thrpt: [32.437 MiB/s 32.514 MiB/s 32.594 MiB/s] change: time: [+2.3168% +2.6323% +2.9327%] (p = 0.00 < 0.05) thrpt: [-2.8492% -2.5647% -2.2644%] Found 3 outliers among 100 measurements (3.00%) 3 (3.00%) low mild transfer/Run multiple transfers with the same seed: Change within noise threshold.time: [123.18 ms 123.34 ms 123.51 ms] thrpt: [32.385 MiB/s 32.429 MiB/s 32.474 MiB/s] change: time: [+2.4343% +2.6551% +2.8706%] (p = 0.00 < 0.05) thrpt: [-2.7905% -2.5864% -2.3764%] Found 1 outliers among 100 measurements (1.00%) 1 (1.00%) high mild 1-conn/1-100mb-resp (aka. Download)/client: No change in performance detected.time: [1.0628 s 1.0755 s 1.0915 s] thrpt: [91.616 MiB/s 92.980 MiB/s 94.094 MiB/s] change: time: [-3.9674% -1.1739% +1.3846%] (p = 0.45 > 0.05) thrpt: [-1.3657% +1.1879% +4.1313%] Found 2 outliers among 10 measurements (20.00%) 2 (20.00%) high severe 1-conn/10_000-parallel-1b-resp (aka. RPS)/client: 💔 Performance has regressed.time: [436.07 ms 438.41 ms 440.73 ms] thrpt: [22.690 Kelem/s 22.810 Kelem/s 22.932 Kelem/s] change: time: [+10.667% +11.663% +12.706%] (p = 0.00 < 0.05) thrpt: [-11.274% -10.445% -9.6389%] Found 1 outliers among 100 measurements (1.00%) 1 (1.00%) low mild 1-conn/1-1b-resp (aka. HPS)/client: 💔 Performance has regressed.time: [45.496 ms 45.716 ms 45.947 ms] thrpt: [21.764 elem/s 21.874 elem/s 21.980 elem/s] change: time: [+6.9518% +7.6605% +8.3040%] (p = 0.00 < 0.05) thrpt: [-7.6673% -7.1154% -6.4999%] Found 1 outliers among 100 measurements (1.00%) 1 (1.00%) high mild Client/server transfer resultsTransfer of 33554432 bytes over loopback.
|
It is no longer needed as each connection is always processed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This pull request is ready for a first review.
It includes multiple intertwined changes. E.g. the removal of the various Server
fields allows for the processing of each connection without an additional allocation into a Vec
, thereby enabling the actual bug-fix. Thus the pull request touches many components. I added a comment to each relevant change. Let me know if you would still prefer that I break the pull request up into multiple smaller patches.
I still need to look into the HPS benchmark regression, to see whether it is related, or just noise.
Thus far I am unable to reproduce the
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lots of red. That's good.
Could we bump the coverage up? |
Note that this pull request does not introduce any untested lines. In other words, each untested line in this patch has been untested before. See coverage report. That said, the more tests the better! Though I suggest doing so in a separate pull request. Any specific code-paths you have in mind @larseggert? |
Hm. Maybe I misread the coverage report, or lines were moved around. Anyway, WFM. Good to go. |
Problem
Say a
neqo_transport::Server
is managing a singleneqo_transport::Connection
inServer::connections
. Assume the following chain of events:A user (e.g.
neqo-http3
) callsServer::process
.neqo/neqo-transport/src/server.rs
Lines 660 to 677 in 6664452
self.wake_at
isNone
.dgram
isNone
, thusself.process_input
is never called.self.process_next_output(now)
is called.neqo/neqo-transport/src/server.rs
Lines 636 to 642 in 6664452
Connection
throughself.waiting.pop_front()
.Connection::process
, returning aOutput::Callback(_)
, which is stored inself.wake_at
.self.wake_at.take()
takes the callback and returns it to the user asOutput::Callback
.The user calls
Server::process
again.self.wake_at
isNone
.dgram
isNone
, thusself.process_input
isn't called.Server::process
callsServer::process_next_output
.Server::process_next_output
finds no connection reference inself.waiting
and thus returnsNone
.self.wake_at.take()
isNone
Server::process
returnsNone
Result is that the user received an
Output::None
even though the oneConnection
managed by theServer
is waiting for a callback. Thus the server stalls.Shout-out to @KershawChang discovering this in #1917 (comment).
Solution
The single source of truth of whether a
Connection
needs a callback or not is theConnection
itself. Instead of duplicating this information inServer::wake_at
,Server::waiting
,ServerConnectionState::wake_at
, always ask the single source of truth, i.e. theConnection
.More concretely, with this patch
Server::process
always callsConnection::process
for each of itsConnection
s. It does not try to be smart on whether aConnection
needsprocess
ing or not.Fixes #1917.
Alternative to #1926.
Needed for #1903.
Another step after #1784 simplifying
neqo_transport::Server
.