Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Perform 2-RTT Handshake to upgrade to PQ when possible #4526

Merged
merged 5 commits into from
Jun 24, 2024

Conversation

alexw91
Copy link
Contributor

@alexw91 alexw91 commented Apr 25, 2024

Resolved issues:

N/A

Description of changes:

Updates s2n to always prefer upgrading to a PQ Hybrid KeyShare whenever possible, even if doing so would require a 2 round trip handshake. s2n's current behavior is to choose the best KeyShare option possible in a 1-RTT handshake first, and only if a 1-RTT is not possible will a KeyShare algorithm that requires a 2-RTT handshake be chosen. After this change, s2n's server-side negotiation logic will consider a Hybrid PQ KeyShare as higher priority than completing the handshake in 1-RTT. This is to ensure that Hybrid PQ KeyShares will be guaranteed to be negotiated during any transition periods between one PQ KeyShare algorithm and another PQ KeyShare algorithm (where a newer PQ algorithm isn't mutually supported yet, but the older PQ algorithm can be negotiated in 2 round trips).

There is further discussion of this change in the IETF mailing list, and in a draft RFC by David Benjamin from Google:

Call-outs:

N/A

Testing:

Unit Tests

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

@alexw91 alexw91 force-pushed the 2-rtt-pq-handshake branch 4 times, most recently from e923ece to e2d4a2c Compare May 21, 2024 16:35
@alexw91 alexw91 marked this pull request as ready for review May 21, 2024 17:16
@alexw91 alexw91 requested review from lrstewart and WillChilds-Klein and removed request for lrstewart May 21, 2024 17:59
@alexw91 alexw91 force-pushed the 2-rtt-pq-handshake branch from e2d4a2c to bb5f254 Compare May 23, 2024 16:11
@alexw91 alexw91 force-pushed the 2-rtt-pq-handshake branch from bb5f254 to 2013258 Compare June 3, 2024 20:21
Copy link
Contributor

@WillChilds-Klein WillChilds-Klein left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

code changes look good, just a few questions and nit-picks. perhaps consider adding two additional test cases:

  1. an integration test asserting that a client sending a server-usupported PQ keyshare negotiates Kyber512
  2. a unit test with the client sending multiple keyshares (perhaps the first unsupported by server, second supported by server)

@alexw91 alexw91 force-pushed the 2-rtt-pq-handshake branch 2 times, most recently from 34b98d0 to db306ee Compare June 12, 2024 22:13
@alexw91
Copy link
Contributor Author

alexw91 commented Jun 14, 2024

code changes look good, just a few questions and nit-picks. perhaps consider adding two additional test cases:

  1. an integration test asserting that a client sending a server-usupported PQ keyshare negotiates Kyber512

s2n's current integration tests are fragile and hard to run and debug locally. I'd prefer to test this change using unit tests until this issue is resolved: #4342

  1. a unit test with the client sending multiple keyshares (perhaps the first unsupported by server, second supported by server)

These unit tests already exist in this test file. Found some that appear to be testing what you are asking:

  1. https://github.com/aws/s2n-tls/blob/main/tests/unit/s2n_client_key_share_extension_pq_test.c#L589-L591
  2. https://github.com/aws/s2n-tls/blob/main/tests/unit/s2n_client_key_share_extension_pq_test.c#L774-L776

@alexw91 alexw91 force-pushed the 2-rtt-pq-handshake branch from db306ee to 46ae0e0 Compare June 14, 2024 23:15
@alexw91 alexw91 force-pushed the 2-rtt-pq-handshake branch from 46ae0e0 to 0fa4c71 Compare June 18, 2024 23:34
Copy link
Contributor

@lrstewart lrstewart Jun 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think my discomfort with this behavior change is coming from the fact that we're making an expensive choice (HRR, and with large key shares) based on assumptions about both the client and server. For the client, we're ignoring any intention indicated by key_share as unreliable, and we've never considered client preference order for options like supported_groups. For the server, we keep our PQ and classic groups in separate preference lists, so it's impossible for an application to indicate relative preferences between the two.

But I think the argument here is that this is just "secure by default". We've unilaterally made the decision for both applications and their clients that PQ is always strongly preferable to classic.

It might be worth adding a comment to the security policy struct reminding developers of that implied preference so they have it in mind while writing new security policies. Like, a reminder that if you add kem preferences to a policy, those always take priority over ecc groups, even at the cost of HRR.

This choice will likely prevent us from adding PQ to the default policies.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My understanding is that after this change we'd only incur an HRR in 3 cases:

  1. Client supports PQ per supported_groups, but doesn't send a PQ keyshare.
    • result: PQ keyshare in HelloRetry
  2. Client supports PQ, but sends malformed PQ keyshare or PQ keyshare for some algorithm combination not supported by the server
    • result: PQ keyshare in HelloRetry
  3. Client doesn't support PQ, and sends a keyshare with with some ECDHE group not supported by the server
    • result: ECDHE-only keyshare in HelloRetry

1.) is somewhat odd behavior; the only use-case I can think of for that is "probing" server PQ support or clients readying themselves for seamless PQ transition once the PQ groups are enabled server-side. 2.) is possible for non-s2n clients, but somewhat unlikely -- Kyber/MLKEM are what the industry has settled on and the code points are well known.

I don't think 1.) or 2.) would be particularly common, and 3.) was existing behavior before this PR. What's the concern with adding PQ to default policies?

Copy link
Contributor Author

@alexw91 alexw91 Jun 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2 will happen if there's ever a migration away from ML-KEM towards some "ML-KEM-v2" or equivalent. If we deploy this change today, little to nothing will change, but it will make discussions around PQ compliance much easier to have with customers, and save a lot of headache years from now if we ever need to migrate off of ML-KEM towards a better PQ algorithm.

Suppose 5 years from now "ML-KEM-v2" is invented that is twice as fast and uses half the bandwidth of today's ML-KEM (let's call it "ML-KEM-v1" for clarity). At that point in the future, clients will want to migrate to "ML-KEM-v2" and offer it in their KeyShare extension, but keep "ML-KEM-v1" as a fallback in their SupportedGroup extension without sending it in their KeyShare extension. In this scenario, AWS doesn't know if customers prefer PQ-security over first-byte latency, but if the customer is offering PQ at all, the safest option to ensure all customers are meeting all their compliance requirements around PQ, and select PQ whenever possible. This would perform a 2-RTT handshake and negotiate "ML-KEM-v1" from the SupportedGroups extension, even if it wasn't sent in the clients KeyShare extension.

Without this change customers who enable ML-KEM-v2 in their client configs early (before matching server-side support) will counterintuitively downgrade themselves to classical ECDHE KeyShares. Depending on the customers understanding of how TLS works, they might blame AWS TLS endpoints for downgrading their previously working PQ connections to ECDHE when it was correctly offered by their clients, and all they did was add a new PQ algorithm to the top of their preference list. To avoid these types of customer misconfiguration issues where changes to client-side PQ preference ordering might accidentally cause connections to lose PQ security that was previously working, we have decided that ensuring PQ algorithms are selected when they are offered is worth the small risk of a loss in performance.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the concern with adding PQ to default policies?

Applications use the default policies when they just want secure defaults and have no particular requirements for what those defaults are. Those applications may not want a HRR in order to select an option that they don't particularly need or care about. But with the current security policy structure, there's no way to indicate that PQ is supported but not strongly preferred. But maybe that's still fine, and we want everyone doing PQ all the time by default.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This choice will likely prevent us from adding PQ to the default policies.

I think this should be rephrased to

This choice will likely prevent us from adding PQ algorithms which were deployed before its final specification to the default policies.

If the algorithm is final and will not change, you can add it in the default policy.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think a comment in the PQ policy to explain the logic as @lrstewart suggests is a good idea.

And one more comment about key_share as an order indicator. This was not intender. RFC8446-bis is trying to explain that and enforce that the order is depicted from the group, not the key_share.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And +1 about not having a unified order instead of two orders which have order between them. I think @alexw91 explained this would take a lot of work. One unified order for picking groups would be better, but as it is now, we need to enforce an order between the two order and I think the way done here is the better way because of secure-by-default and that we don't want to downgrade someone that chose to enable PQ and support it.

Copy link
Contributor

@jmayclin jmayclin Jun 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also think we'd want to have a careful performance discussion before adding PQ to any of the default policies. On graviton 3

  • 3 cert RSA2048 chain, secp256 -> secp256/kyber768, 14% more cpu
  • 3 cert ECDSA256 chain, secp256 -> secp256/kyber768, 35% more cpu

@lrstewart lrstewart requested a review from jmayclin June 19, 2024 07:00
@alexw91 alexw91 force-pushed the 2-rtt-pq-handshake branch 2 times, most recently from 793a895 to d2b0334 Compare June 19, 2024 20:25
@alexw91 alexw91 force-pushed the 2-rtt-pq-handshake branch 3 times, most recently from 1845e0d to 61cec75 Compare June 21, 2024 22:11
@alexw91 alexw91 force-pushed the 2-rtt-pq-handshake branch from 61cec75 to 34a97bc Compare June 24, 2024 16:14
@jmayclin jmayclin merged commit 2edc3f8 into aws:main Jun 24, 2024
34 checks passed
rod-chapman pushed a commit to rod-chapman/s2n-tls that referenced this pull request Jun 25, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants