Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement NTRU Prime #384

Open
wants to merge 5 commits into
base: main
Choose a base branch
from
Open

Implement NTRU Prime #384

wants to merge 5 commits into from

Conversation

Keelan10
Copy link

Hello,

This is a draft of NTRU Prime which implements all the parameters specified in the round 3 submission.
It passes the KAT tests provided in the NIST submission package.

Thanks for reviewing the PR!

@bwesterb
Copy link
Member

Thank you for this, we will have a look. That might take some time.

Where does this implementation come from? Did you implement it yourself from the spec or did you transliterate an existing implementation?

@bwesterb
Copy link
Member

Early comment: I see you use quite a lot of allocations:

BenchmarkDecapsulate/ntrulpr1013-8                           	      22	  48401778 ns/op	   57280 B/op	     135 allocs/op
BenchmarkDecapsulate/ntrulpr1277-8                           	      14	  76694781 ns/op	   72673 B/op	     145 allocs/op
BenchmarkDecapsulate/ntrulpr653-8                            	      57	  20311517 ns/op	   39968 B/op	     135 allocs/op
BenchmarkDecapsulate/ntrulpr761-8                            	      40	  28332384 ns/op	   43752 B/op	     135 allocs/op
BenchmarkDecapsulate/ntrulpr857-8                            	      30	  34925391 ns/op	   50016 B/op	     135 allocs/op
BenchmarkDecapsulate/ntrulpr953-8                            	      25	  43443037 ns/op	   56544 B/op	     135 allocs/op
BenchmarkEncapsulate/ntrulpr1013-8                           	      33	  32511700 ns/op	   48160 B/op	      97 allocs/op
BenchmarkEncapsulate/ntrulpr1277-8                           	      20	  53727275 ns/op	   60928 B/op	     103 allocs/op
BenchmarkEncapsulate/ntrulpr653-8                            	      91	  13530212 ns/op	   33472 B/op	      97 allocs/op
BenchmarkEncapsulate/ntrulpr761-8                            	      66	  19986460 ns/op	   36792 B/op	      97 allocs/op
BenchmarkEncapsulate/ntrulpr857-8                            	      45	  25415836 ns/op	   41728 B/op	      97 allocs/op
BenchmarkEncapsulate/ntrulpr953-8                            	      37	  31389920 ns/op	   47232 B/op	      97 allocs/op
BenchmarkDecapsulate/HPKE_KEM_P256_HKDF_SHA256-8             	   15246	     78954 ns/op	    3361 B/op	      44 allocs/op
BenchmarkGenerateKeyPair/ntrulpr1013-8                       	      67	  17443585 ns/op	   37824 B/op	    1054 allocs/op
BenchmarkGenerateKeyPair/ntrulpr1277-8                       	      45	  26197393 ns/op	   47760 B/op	    1320 allocs/op
BenchmarkGenerateKeyPair/ntrulpr653-8                        	     172	   6903091 ns/op	   25104 B/op	     694 allocs/op
BenchmarkGenerateKeyPair/ntrulpr761-8                        	     129	   9488787 ns/op	   28384 B/op	     802 allocs/op
BenchmarkGenerateKeyPair/ntrulpr857-8                        	     100	  11835992 ns/op	   32496 B/op	     898 allocs/op
BenchmarkGenerateKeyPair/ntrulpr953-8                        	      82	  14593648 ns/op	   36336 B/op	     994 allocs/op

Those should be around 1-5.

For comparison:

BenchmarkGenerateKeyPair/Kyber768-8                          	   36907	     32178 ns/op	    8256 B/op	       5 allocs/op
BenchmarkEncapsulate/Kyber768-8                              	   62809	     18425 ns/op	    1216 B/op	       3 allocs/op
BenchmarkDecapsulate/Kyber768-8                              	   62088	     19122 ns/op	      32 B/op	       1 allocs/op

Also, the operations are very slow compared to the cycle counts given in the round 3 submission. That one uses AVX2 intrinsics, but the difference is bigger than can be explained by that.

@Keelan10
Copy link
Author

Keelan10 commented Dec 12, 2022

Thank you for this, we will have a look. That might take some time.

Where does this implementation come from? Did you implement it yourself from the spec or did you transliterate an existing implementation?

I have translated it directly from the C reference/optimized implementation.

@Keelan10
Copy link
Author

Also, the operations are very slow compared to the cycle counts given in the round 3 submission. That one uses AVX2 intrinsics, but the difference is bigger than can be explained by that.

Noted, I am having a look at it. I will revert back to you afterwards.
Thanks.

@armfazh armfazh added the new feature New functionality or module label Dec 13, 2022
@Keelan10
Copy link
Author

Keelan10 commented Dec 18, 2022

I have managed to cut down the number of allocations.

BenchmarkGenerateKeyPair/ntrulpr1013-4                                50          34254035 ns/op           23591 B/op         34 allocs/op
BenchmarkGenerateKeyPair/ntrulpr1277-4                                16          67169300 ns/op           28929 B/op         36 allocs/op
BenchmarkGenerateKeyPair/ntrulpr653-4                                 69          17044200 ns/op           15605 B/op         34 allocs/op
BenchmarkGenerateKeyPair/ntrulpr761-4                                 50          25825864 ns/op           17712 B/op         34 allocs/op
BenchmarkGenerateKeyPair/ntrulpr857-4                                 39          31935840 ns/op           20178 B/op         34 allocs/op
BenchmarkGenerateKeyPair/ntrulpr953-4                                 33          36169091 ns/op           22640 B/op         34 allocs/op
BenchmarkEncapsulate/ntrulpr1013-4                                    26          58675716 ns/op           39601 B/op         89 allocs/op
BenchmarkEncapsulate/ntrulpr1277-4                                    16         102924990 ns/op           50145 B/op         95 allocs/op
BenchmarkEncapsulate/ntrulpr653-4                                     64          22068697 ns/op           27104 B/op         89 allocs/op
BenchmarkEncapsulate/ntrulpr761-4                                     37          31186720 ns/op           30256 B/op         89 allocs/op
BenchmarkEncapsulate/ntrulpr857-4                                     37          39697089 ns/op           33818 B/op         89 allocs/op
BenchmarkEncapsulate/ntrulpr953-4                                     24          47283570 ns/op           38069 B/op         89 allocs/op
BenchmarkDecapsulate/ntrulpr1013-4                                    18          86339886 ns/op           47548 B/op        126 allocs/op
BenchmarkDecapsulate/ntrulpr1277-4                                     9         135162156 ns/op           59486 B/op        136 allocs/op
BenchmarkDecapsulate/ntrulpr653-4                                     42          35454582 ns/op           32521 B/op        126 allocs/op
BenchmarkDecapsulate/ntrulpr761-4                                     28          43609793 ns/op           35601 B/op        126 allocs/op
BenchmarkDecapsulate/ntrulpr857-4                                     21          55018576 ns/op           40652 B/op        126 allocs/op
BenchmarkDecapsulate/ntrulpr953-4                                     19          75017090 ns/op           45854 B/op        126 allocs/op

The number of allocations for streamlined NTRU prime is given below:

BenchmarkGenerateKeyPair/sntrup653-4                  10         110994258 ns/op           14825 B/op         27 allocs/op
BenchmarkGenerateKeyPair/sntrup761-4                   8         141809830 ns/op           16970 B/op         27 allocs/op
BenchmarkGenerateKeyPair/sntrup857-4                   4         294731995 ns/op           19296 B/op         27 allocs/op
BenchmarkGenerateKeyPair/sntrup953-4                   4         364868864 ns/op           21984 B/op         27 allocs/op
BenchmarkGenerateKeyPair/sntrup1013-4                  3         446066821 ns/op           23360 B/op         27 allocs/op
BenchmarkGenerateKeyPair/sntrup1277-4                  2         619764119 ns/op           28904 B/op         29 allocs/op
BenchmarkEncapsulate/sntrup653-4                      78          14519160 ns/op           17904 B/op         74 allocs/op
BenchmarkEncapsulate/sntrup761-4                      88          21360334 ns/op           19784 B/op         74 allocs/op
BenchmarkEncapsulate/sntrup857-4                      76          29744781 ns/op           22736 B/op         74 allocs/op
BenchmarkEncapsulate/sntrup953-4                      38          32997382 ns/op           25456 B/op         74 allocs/op
BenchmarkEncapsulate/sntrup1013-4                     27          44015633 ns/op           26288 B/op         74 allocs/op
BenchmarkEncapsulate/sntrup1277-4                     19          55111287 ns/op           33744 B/op         80 allocs/op
BenchmarkDecapsulate/sntrup653-4                      38          46424767 ns/op           23728 B/op        114 allocs/op
BenchmarkDecapsulate/sntrup761-4                      28          66217428 ns/op           25752 B/op        114 allocs/op
BenchmarkDecapsulate/sntrup857-4                      15          77064377 ns/op           30032 B/op        114 allocs/op
BenchmarkDecapsulate/sntrup953-4                      10         100944968 ns/op           33520 B/op        114 allocs/op
BenchmarkDecapsulate/sntrup1013-4                     13         108477276 ns/op           34160 B/op        114 allocs/op
BenchmarkDecapsulate/sntrup1277-4                      7         167687608 ns/op           43474 B/op        124 allocs/op

I have identified that the remaining allocations come mostly from the the following functions:

  • internal.Encode
  • internal.Decode
  • expand

Could I ask you to provide pointers on how to optimize these functions please?
Thank you.

@bwesterb
Copy link
Member

I have translated it directly from the C reference/optimized implementation.

Then you should appropriately attribute the original.

@bwesterb
Copy link
Member

Could I ask you to provide pointers on how to optimize these functions please? Thank you.

Are you familiar with Go's heap escape mechanism? Typically allocations can be removed by avoiding interfaces, function types and having the caller allocate values.

For an example of the latter:

func Pack(something T) []byte {
  ret := make([]byte, 123)
  // write into ret
  return ret
}

will have an unavoidable allocation, whereas

func Pack(ret []byte, something T) {
  // write into ret
}

func SomeOtherFunc() {
  var something T
   // ...
   out  := make([]byte, 123)
   Pack(out, something)
}

will not.

@armfazh
Copy link
Contributor

armfazh commented Dec 19, 2022

@Keelan10 could you please format the files so the linter checks passes.

@Keelan10
Copy link
Author

Then you should appropriately attribute the original.

I added it to the doc.go file.

Are you familiar with Go's heap escape mechanism? Typically allocations can be removed by avoiding interfaces, function types and having the caller allocate values.

Thank you for this.

@Keelan10
Copy link
Author

@Keelan10 could you please format the files so the linter checks passes.

Yes, I have formatted the files accordingly. Thanks.

@Keelan10
Copy link
Author

After optimizing the Encode function, the number of allocations is as follows:

BenchmarkGenerateKeyPair/ntrulpr1013-4                                57          19854325 ns/op           21216 B/op         16 allocs/op
BenchmarkGenerateKeyPair/ntrulpr1277-4                                37          32050562 ns/op           26336 B/op         16 allocs/op
BenchmarkGenerateKeyPair/ntrulpr653-4                                130           8618956 ns/op           13920 B/op         16 allocs/op
BenchmarkGenerateKeyPair/ntrulpr761-4                                100          15008387 ns/op           15840 B/op         16 allocs/op
BenchmarkGenerateKeyPair/ntrulpr857-4                                 76          17634686 ns/op           18144 B/op         16 allocs/op
BenchmarkGenerateKeyPair/ntrulpr953-4                                 34          30173744 ns/op           20192 B/op         16 allocs/op
BenchmarkEncapsulate/ntrulpr1013-4                                    18          86437308 ns/op           36608 B/op         71 allocs/op
BenchmarkEncapsulate/ntrulpr1277-4                                     9         114803620 ns/op           46017 B/op         75 allocs/op
BenchmarkEncapsulate/ntrulpr653-4                                     57          27666982 ns/op           25408 B/op         71 allocs/op
BenchmarkEncapsulate/ntrulpr761-4                                     31          50561442 ns/op           28048 B/op         71 allocs/op
BenchmarkEncapsulate/ntrulpr857-4                                     18          66679760 ns/op           31552 B/op         71 allocs/op
BenchmarkEncapsulate/ntrulpr953-4                                     15          71584093 ns/op           35776 B/op         71 allocs/op
BenchmarkDecapsulate/ntrulpr1013-4                                    14          85484315 ns/op           43424 B/op        107 allocs/op
BenchmarkDecapsulate/ntrulpr1277-4                                     7         150202433 ns/op           54818 B/op        115 allocs/op
BenchmarkDecapsulate/ntrulpr653-4                                     38          35952484 ns/op           30240 B/op        107 allocs/op
BenchmarkDecapsulate/ntrulpr761-4                                     33          38854777 ns/op           33216 B/op        107 allocs/op
BenchmarkDecapsulate/ntrulpr857-4                                     25          53645810 ns/op           37792 B/op        107 allocs/op
BenchmarkDecapsulate/ntrulpr953-4                                     20          60605877 ns/op           42784 B/op        107 allocs/op
BenchmarkGenerateKeyPair/sntrup653-4                  12         123409579 ns/op           11968 B/op          8 allocs/op
BenchmarkGenerateKeyPair/sntrup761-4                   8         132397397 ns/op           13889 B/op          8 allocs/op
BenchmarkGenerateKeyPair/sntrup857-4                   7         159702746 ns/op           15680 B/op          8 allocs/op
BenchmarkGenerateKeyPair/sntrup953-4                   6         218574910 ns/op           17984 B/op          8 allocs/op
BenchmarkGenerateKeyPair/sntrup1013-4                  4         295355986 ns/op           19264 B/op          8 allocs/op
BenchmarkGenerateKeyPair/sntrup1277-4                  2         567121358 ns/op           23616 B/op          8 allocs/op
BenchmarkEncapsulate/sntrup653-4                      66          16461438 ns/op           15056 B/op         55 allocs/op
BenchmarkEncapsulate/sntrup761-4                      51          19784946 ns/op           16704 B/op         55 allocs/op
BenchmarkEncapsulate/sntrup857-4                      51          25694787 ns/op           19120 B/op         55 allocs/op
BenchmarkEncapsulate/sntrup953-4                      39          33983371 ns/op           21456 B/op         55 allocs/op
BenchmarkEncapsulate/sntrup1013-4                     38          35351851 ns/op           22192 B/op         55 allocs/op
BenchmarkEncapsulate/sntrup1277-4                     18          58978942 ns/op           28464 B/op         59 allocs/op
BenchmarkDecapsulate/sntrup653-4                      39          39306603 ns/op           17360 B/op         91 allocs/op
BenchmarkDecapsulate/sntrup761-4                      26          71803380 ns/op           18832 B/op         91 allocs/op
BenchmarkDecapsulate/sntrup857-4                      19          70680971 ns/op           21936 B/op         91 allocs/op
BenchmarkDecapsulate/sntrup953-4                      13         100234053 ns/op           24400 B/op         91 allocs/op
BenchmarkDecapsulate/sntrup1013-4                     10         165934875 ns/op           24944 B/op         91 allocs/op
BenchmarkDecapsulate/sntrup1277-4                      6         167856732 ns/op           31666 B/op         99 allocs/op

Also, I am not sure how to optimize the Decode function and any suggestion would be greatly appreciated. Thank you.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
new feature New functionality or module
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants