Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Question about the data packing lab #100

Open
jsjtxietian opened this issue Oct 16, 2024 · 2 comments
Open

Question about the data packing lab #100

jsjtxietian opened this issue Oct 16, 2024 · 2 comments

Comments

@jsjtxietian
Copy link
Contributor

jsjtxietian commented Oct 16, 2024

Hi thanks for the great lab.

I know that the data packing lab is marked as broken as I can't get the about 20% speed up as mentioned in the video too, however I do get about 3-8% speed up when using clang 17 on windows. Maybe we can investigate further about the current state of this lab ?

@dendibakh
Copy link
Owner

Hi @jsjtxietian , sure, if you're interested, feel free to investigate. I'm currently very busy, so I won't be able to look into this in the next 1-2 months.

@jsjtxietian
Copy link
Contributor Author

jsjtxietian commented Oct 17, 2024

The following data is collected when N= 50000 and iteration time is 10000, on windows11 using vtune with clang ver 17.0.6
(Note: I can not get reliable opt effect when using the origin N's config)

Running hotspot analysis shows the time saving mainly comes from std::shuffle:

image

Microarchitecture exploration shows a little decrease in backend bound:

image

Something I observe when comparing hardware events:

  • Reduced Data Cache Miss Cycles:
    • MEMORY_ACTIVITY.STALLS_L1D_MISS P-Core 3,336,010,008 - 696,002,088 = 2,640,007,920
    • MEMORY_ACTIVITY.STALLS_L2_MISS P-Core 2,232,006,696 - 192,000,576 = 2,040,006,120
  • Fewer Split Loads and Stores:
    • MEM_INST_RETIRED.SPLIT_LOADS P-Core 1,574,447,232 - 537,616,128 = 1,036,831,104
    • MEM_INST_RETIRED.SPLIT_STORES P-Core 1,533,646,008 - 496,814,904 = 1,036,831,104
  • Reduced DTLB Misses:
    • DTLB_LOAD_MISSES.STLB_HIT:cmask=1 P-Core 576,017,280 - 345,610,368 = 230,406,912
    • DTLB_STORE_MISSES.STLB_HIT:cmask=1 P-Core 964,828,944 - 859,225,776 = 105,603,168

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants