Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add inverse NTTs for Kyber & Dilithium #37

Merged
merged 29 commits into from
Apr 11, 2024

Conversation

dop-amin
Copy link
Collaborator

@dop-amin dop-amin commented Mar 15, 2024

This PR introduces inverse NTTs for Kyber and Dilithium.
The type of transposition and reduction is supposed to match the code from PQClean [1,2].

TODO:

  • Add optimized Dilithium invNTTs
  • Simplify syntax as in NTTs: remove ldr/str macros that are no longer needed #36 (do this separately, we want to re-run every example for this)
  • Optional: Add intt_kyber_1234_567 (from experience, these variants perform not as well as the ones that merge more layers in the second merge)

[1] https://github.com/PQClean/PQClean/tree/8e221ae797b229858a0b0d784577a8cb149d5789/crypto_sign/dilithium3/aarch64
[2] https://github.com/PQClean/PQClean/tree/8e221ae797b229858a0b0d784577a8cb149d5789/crypto_kem/kyber768/aarch64

@hanno-becker
Copy link
Collaborator

@dop-amin I cancelled the CI which was spinning indefinitely on the example dry run.

@@ -120,7 +120,7 @@ def get_min_max_objective(slothy):
vqdmulh_lane,
vmull, vmlal,
vsrshr, vushr, vusra, vshl,
vand, vbic): ExecutionUnit.V(),
vand, vbic, cmge): ExecutionUnit.V(),
Copy link
Collaborator

@hanno-becker hanno-becker Apr 3, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here and below you can now refer to cmge as ASimdCompare instead (assuming all compare instructions have similar performance characteristics). That makes the models more readable and simplifies the addition of further comparison instructions, because you only need to tweak the arch model.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, I modified the models. However, if we have something like is_qform_form_of, then passing the parent class won't work. Do we want to keep it this way or do we want the function returned by is_qform_form_of to also return True on the child classes?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dop-amin Good point. I think is_qform_of should be able to handle parent classes, yes. I bet the way I did this in aarch64_neon.py is not terribly pythonic, but you could use all_subclass_leaves() here.

@dop-amin dop-amin marked this pull request as ready for review April 3, 2024 12:28
@hanno-becker
Copy link
Collaborator

@dop-amin Are you going to investigate the CI failure or do you need help?

@hanno-becker
Copy link
Collaborator

@dop-amin Is there [going to be] a sibling PR to PQAX as well adding tests for the inverse NTT?

@dop-amin
Copy link
Collaborator Author

dop-amin commented Apr 5, 2024

@dop-amin Are you going to investigate the CI failure or do you need help?

Hi Hanno, I think the CI just times out because it takes too long to go through all the examples. Especially the ones using heuristics seem to take long because it involves so many individual calls to the solver. Do you have a suggestion on how to go about this? We could disable the CI for examples using heuristics.

@dop-amin
Copy link
Collaborator Author

dop-amin commented Apr 5, 2024

@dop-amin Is there [going to be] a sibling PR to PQAX as well adding tests for the inverse NTT?

Yes, I've been planning to submit it for a couple of days but now I finally did so. Thanks for reminding me.

@hanno-becker
Copy link
Collaborator

Hi Hanno, I think the CI just times out because it takes too long to go through all the examples. Especially the ones using heuristics seem to take long because it involves so many individual calls to the solver. Do you have a suggestion on how to go about this? We could disable the CI for examples using heuristics.

I am surprised by this because the dry run sets functional_only=True, allow_renaming=False and allow_reordering=False if I remember correctly -- this should not take long. Can you double-check that your scripts in example.py do not overwrite this?

example.py Outdated Show resolved Hide resolved
example.py Outdated Show resolved Hide resolved
modulus_addr: .quad 8380417
ninv_addr: .quad 16382
ninv_tw_addr: .quad 4197891
intt_dilithium_1234_5678_manual_ld4:
Copy link
Collaborator

@hanno-becker hanno-becker Apr 7, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again an optional improvement, but better to do while we're at it:

Could you go through the clean [inv]NTTs and add a comment at their entry symbol describing in what way (reduction / ordering) they deviate from a standard, fully-reduced NTT?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good point in general. I followed the way the current PQClean implementations handle the reductions sucht that we can have a fair comparison.

Regarding the ordering: All of our NTTs on aarch64 (should) output in the canonical ordering.

.endm

.macro montg_reduce a
srshr tmp.4S, \a\().4S, #23
Copy link
Collaborator

@hanno-becker hanno-becker Apr 7, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this really a Montgomery reduction? It looks more like Barrett, leveraging somehow that q is very close to 2^23.

In more detail: It looks like the absolute value |a/q - a/2**23| is at most 2**31 * |1/q - q/r| ~ 0.25, so the rounding is at most off by one. This neatly uses that the 'buffer' from q to the word boundary 2^32 is has about the same bitlength than the approximation q~2^23 (since q=2^23-2^13+1).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To me, this appears to be a shortened version of reduce32 from the reference implementation. As mentioned above, I followed the current state in PQClean.
Anyhow, the naming is off.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you adjust the macro names?

str_vo data6, in, (6*(1024/8))
str_vo data7, in, (7*(1024/8))

mul_ninv data4, data5, data6, data7, data0, data1, data2, data3
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be interesting to double-check if the mul_ninv and the canonical_reduce can be replaced by a refined Barrett multiplication the sense of https://eprint.iacr.org/2022/439.pdf.

Copy link
Collaborator

@hanno-becker hanno-becker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, @dop-amin -- thank you very much for this work.

@hanno-becker hanno-becker merged commit d7b5296 into slothy-optimizer:main Apr 11, 2024
7 checks passed
@dop-amin
Copy link
Collaborator Author

LGTM, @dop-amin -- thank you very much for this work.

Great, thanks for your feedback in the process!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants