Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

perf(gcm): shrink Shoup table and tune GCM loop (IDFGH-13409) #14314

Closed

Conversation

bryghtlabs-richard
Copy link
Contributor

Profiling showed a lot of time in gcm_mult() during downloads.

Tune GCM loop for pure 32-bit processors like Xtensa and RV32.

With ESP32-S3-GCC 12.2.0, -O2:

Item Before After Notes
len(.rodata.last4) 128B 64B
len(.text.gcm_mult) 328B 368B
gcm_mult() cycles ~1200 ~930 IRAM/DRAM + xthal_get_ccount()

Copy link

github-actions bot commented Aug 6, 2024

Warnings
⚠️

Some issues found for the commit messages in this PR:

  • the commit message "change(mbedtls/port): optimize gcm_mult()":
    • summary looks too short

Please fix these commit messages - here are some basic tips:

  • follow Conventional Commits style
  • correct format of commit message should be: <type/action>(<scope/component>): <summary>, for example fix(esp32): Fixed startup timeout issue
  • allowed types are: change,ci,docs,feat,fix,refactor,remove,revert,test
  • sufficiently descriptive message summary should be between 20 to 72 characters and start with upper case letter
  • avoid Jira references in commit messages (unavailable/irrelevant for our customers)

TIP: Install pre-commit hooks and run this check when committing (uses the Conventional Precommit Linter).

👋 Hello bryghtlabs-richard, we appreciate your contribution to this project!


📘 Please review the project's Contributions Guide for key guidelines on code, documentation, testing, and more.

🖊️ Please also make sure you have read and signed the Contributor License Agreement for this project.

Click to see more instructions ...


This automated output is generated by the PR linter DangerJS, which checks if your Pull Request meets the project's requirements and helps you fix potential issues.

DangerJS is triggered with each push event to a Pull Request and modify the contents of this comment.

Please consider the following:
- Danger mainly focuses on the PR structure and formatting and can't understand the meaning behind your code or changes.
- Danger is not a substitute for human code reviews; it's still important to request a code review from your colleagues.
- Resolve all warnings (⚠️ ) before requesting a review from human reviewers - they will appreciate it.
- To manually retry these Danger checks, please navigate to the Actions tab and re-run last Danger workflow.

Review and merge process you can expect ...


We do welcome contributions in the form of bug reports, feature requests and pull requests via this public GitHub repository.

This GitHub project is public mirror of our internal git repository

1. An internal issue has been created for the PR, we assign it to the relevant engineer.
2. They review the PR and either approve it or ask you for changes or clarifications.
3. Once the GitHub PR is approved, we synchronize it into our internal git repository.
4. In the internal git repository we do the final review, collect approvals from core owners and make sure all the automated tests are passing.
- At this point we may do some adjustments to the proposed change, or extend it by adding tests or documentation.
5. If the change is approved and passes the tests it is merged into the default branch.
5. On next sync from the internal git repository merged change will appear in this public GitHub repository.

Generated by 🚫 dangerJS against 1bb9db8

@espressif-bot espressif-bot added the Status: Opened Issue is new label Aug 6, 2024
@github-actions github-actions bot changed the title perf(gcm): shrink Shoup table and tune GCM loop perf(gcm): shrink Shoup table and tune GCM loop (IDFGH-13409) Aug 6, 2024
@mahavirj
Copy link
Member

mahavirj commented Aug 7, 2024

@bryghtlabs-richard

I see some recent improvements in the upstream code too: Mbed-TLS/mbedtls@0767fda. We will check, might as well align to the upstream version. Just fyi.

@bryghtlabs-richard
Copy link
Contributor Author

It's certainly worth testing the upstream approach. It seems upstream assumes unaligned access is possible, but for ESP32 it is not, so we'll spend more time doing xor, but I haven't measured it.

@bryghtlabs-richard
Copy link
Contributor Author

bryghtlabs-richard commented Aug 7, 2024

New MbedTLS version is slower. Each with IRAM_ATTR, last4 with DRAM_ATTR, counted with xthal_get_ccount():

Implementation Cycles/Block Cycles/Byte
OldMbed/EspUpstream 1214-1219 75.9
NewMbedSmall 4139-4141 258.7
NewMbedLarge 2168 135.5
ThisPatch 917-920 57.3

Edit: added Mbed's New, LargeTable approach, same test setup. Function runtimes depend slightly on caller, and slightly on instruction alignment in memory.

@KaeLL
Copy link
Contributor

KaeLL commented Aug 8, 2024

@bryghtlabs-richard do you mind sharing the benchmark setup?

@rsaxvc
Copy link

rsaxvc commented Aug 8, 2024

I should also include the large table mbedtls approach

@bryghtlabs-richard
Copy link
Contributor Author

@KaeLL , I've put my cycle-counting code into https://github.com/bryghtlabs-richard/esp-gcm-bench

@mahavirj , I've also tested the mbedTLS new large-table function, but it's worse than the old mbedTLS / current ESP-IDF approach. My preshift-unroll approach still seems to be the best for Xtensa.

@AdityaHPatwardhan
Copy link
Collaborator

Hi @bryghtlabs-richard Thanks for the PR.

The changes look good to me.

@AdityaHPatwardhan AdityaHPatwardhan added the PR-Sync-Merge Pull request sync as merge commit label Aug 13, 2024
@AdityaHPatwardhan
Copy link
Collaborator

AdityaHPatwardhan commented Aug 13, 2024

@bryghtlabs-richard Can you please squash all the commits into one commit.

@bryghtlabs-richard
Copy link
Contributor Author

@AdityaHPatwardhan done. I think #14317 should go in first though.

@AdityaHPatwardhan
Copy link
Collaborator

AdityaHPatwardhan commented Aug 14, 2024

Okay, #14317 has been merged in the internal code-base, the GitHub PR should be updated once the code is available on GitHub.

@AdityaHPatwardhan
Copy link
Collaborator

sha=9b6dab9edb71290061e7f718ba48d76a0dd93e13

1) pre-shift GCM last4 to use 32-bit shift

On 32-bit architectures like Aarch32, RV32, Xtensa,
shifting a 64-bit variable by 32-bits is free,
since it changes the register representing half of the 64-bit var.
Pre-shift the last4 array to take advantage of this.

2) unroll first GCM iteration

The first loop of gcm_mult() is different from
the others. By unrolling it separately from the
others, the other iterations may take advantage
of the zero-overhead loop construct, in addition
to saving a conditional branch in the loop.
@AdityaHPatwardhan
Copy link
Collaborator

sha=1bb9db875896da2605cf96bc0fd29b0111af2283

@AdityaHPatwardhan AdityaHPatwardhan added PR-Sync-Update Pull request sync fetch new changes PR-Sync-Merge Pull request sync as merge commit and removed PR-Sync-Merge Pull request sync as merge commit PR-Sync-Update Pull request sync fetch new changes labels Aug 18, 2024
@espressif-bot espressif-bot added Status: Done Issue is done internally Resolution: NA Issue resolution is unavailable and removed Status: Opened Issue is new labels Aug 21, 2024
@bryghtlabs-richard bryghtlabs-richard deleted the perf/gcm branch September 13, 2024 18:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
PR-Sync-Merge Pull request sync as merge commit Resolution: NA Issue resolution is unavailable Status: Done Issue is done internally
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants