ARM: Pair Memory Instructions Microops #1545
mahyarsamani
started this conversation in
gem5-dev
Replies: 2 comments
-
@ivanaamit can you add this to the agenda for the dev meeting on Thursday? |
Beta Was this translation helpful? Give feedback.
0 replies
-
Thanks @mahyarsamani; can I ask you where the HW information is coming from? Usually uops are HW specific and shouldn't be visible to the software (profiler). |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Hi,
I am working on building a reliable model for the Neoverse N1 Platform found in Ampere Altra Q80 processor. I've been using NAS Parallel benchmarks to compare measurements in details between gem5 and real hardware. I use PAPI to measure things like number of cycles and number of load and store instructions.
I have started my experiments with a model of Neoverse N1 core found here:
https://github.com/binebrank/gem5/blob/neoverse_model/configs/common/cores/arm/O3_ARM_Neoverse_N1.py
Using CHI protocol, I have configured a cache system "similar" to CMN-600 in gem5.
Looking at the stats in gem5, I can see up to 4x difference between gem5 and real hardware for number of committed store instructions. Digging deeper, I have found the source of the issue to be the way pair memory instructions are microcoded in gem5. Below is some information from 4 simulations/benchmarking on real hardware.
Workload 0:
Number of store instructions on gem5: 9134773
Number of store instructions on real hardware: 9125397
Details of pair memory instructions:
8243x ['strxi_uop x30, [ureg0, #8].', 'strxi_uop x29, [ureg0].']
Workload 1:
Number of store instructions on gem5: 222628
Number of store instructions on real hardware: 150693
Details of pair memory instructions:
65983x ['strxi_uop x30, [ureg0].', 'strxi_uop x19, [ureg0, #8].']
Workload 2:
Number of store instructions on gem5: 4258729
Number of store instructions on real hardware: 2967315
Details of pair memory instructions:
987698x ['strqbfpxr_uop w1, [w11, w24].', 'strqtfpxr_uop w1, [w11, w24].']
Workload 3:
Number of store instructions on gem5: 39952364
Number of store instructions on real hardware: 10040649
Details of pair memory instructions:
9949696x ['strqtfpxi_uop x0, [ureg0, #16].', 'strqbfpxi_uop x0, [ureg0, #16].', 'strqbfpxi_uop x1, [ureg0].', 'strqtfpxi_uop x1, [ureg0].']
From this data, it seems to me like most of pair memory instructions take 1 microop on real hardware as opposed to 1/2/4 in gem5. Given that these instructions could potentially result in significant difference in performance between gem5 and real hardware, I wonder if there are recommendation/plans in place to change the way these instructions are microcoded?
For easier navigation in gem5's code, I thought I put the link to where the microops are defined.
https://github.com/gem5/gem5/blob/stable/src/arch/arm/insts/macromem.hh#L473
https://github.com/gem5/gem5/blob/stable/src/arch/arm/insts/macromem.cc#L244
Mentioning Giacomo and Tiago to get their opinion. @giactra @tiagormk
Beta Was this translation helpful? Give feedback.
All reactions