Using the library with much lighter overhead #602

gshanemiller · 2022-04-30T19:29:14Z

gshanemiller
Apr 30, 2022

Consider this example I added to a fork of this repository which is basically a redo of supplied c_example.c:
https://github.com/rodgarrison/pcm/blob/master/examples1/example1.cpp

the nub of the code:

  PCM.pcm_c_init();
  PCM.pcm_c_start();

  // No memory accessed: no LLC refs or misses
  unsigned s=0;
  for (unsigned i=0; i<10000000; i++) {
    s+=1;
  }
 
 PCM.pcm_c_stop();

run with:

# the last two events count LLC hits, misses
./example1 umask=0x00,event=0x3c umask=0x00,event=0xc0 umask=0x4f,event=0x2e umask=0x41,event=0x2e

Giving this ouput:

test1: lcore: 0 cpu-cycles: 220075 instructions: 156010, instructions/cycle: 0.71, counter0: 220075, counter1: 156010, counter2: 16334, counter3: 720

While there will be sporadic LLC hits/misses, the counts counter2: 16334, counter3: 720 are insanely high. My interpretation is that all of that stuff, probably leaking into instruction and cycle counts, is the overhead of the API which gets loads and loads of data. All that noise makes the reported stats hard to understand.

Is there an example using this library that is way, way lighter in overhead? Usually for this kind of micro benchmarking we're looking to profile with PMU,

1 pinned thread on 1 core only
1-4 counters like cycles, retired instructions, and LLC hit/misses

opcm · 2022-05-04T08:47:02Z

opcm
May 4, 2022
Collaborator

The instruction count (156.010) does not match the order of the iteration count (10.000.000). Likely the thread was migrated? Could you try to pin the thread using https://man7.org/linux/man-pages/man3/pthread_setaffinity_np.3.html

0 replies

gshanemiller · 2022-05-07T17:13:46Z

gshanemiller
May 7, 2022
Author

Thank you for taking time to respond. Intel PMUs are a great feature btw.

My error - pinning the thread is required as you rightly point out.

Corrected

# run with hyper-threading disabled

# vendor_id	: GenuineIntel
# cpu family	: 6
# model		: 158
# model name	: Intel(R) Xeon(R) E-2278G CPU @ 3.40GHz [skylake/kabby/coffee]
# stepping	: 13
# microcode	: 0xea
# cpu MHz		: 3400.000
# cache size	: 16384 KB
# cpuid level	: 22

./example1 umask=0x00,event=0x3c umask=0x00,event=0xc0 umask=0x4f,event=0x2e umask=0x41,event=0x2e

=====  Processor information  =====
Linux arch_perfmon flag  : yes
Hybrid processor         : no
IBRS and IBPB supported  : yes
STIBP supported          : yes
Spec arch caps supported : yes
Max CPUID level          : 22
IBRS enabled in the kernel   : yes
STIBP enabled in the kernel  : no
The processor is not susceptible to Rogue Data Cache Load: yes
The processor supports enhanced IBRS                     : yes
INFO: Linux perf interface to program uncore PMUs is present
building core event 'umask=0x00,event=0x3c' counter 0
building core event 'umask=0x00,event=0xc0' counter 1
building core event 'umask=0x4f,event=0x2e' counter 2
building core event 'umask=0x41,event=0x2e' counter 3
 Closed perf event handles
Trying to use Linux perf events...
Successfully programmed on-core PMU using Linux perf
test1: lcore: 5 cpu-cycles: 508199958 instructions: 700409996, instructions/cycle: 1.38, counter0: 508199896, counter1: 700409996, counter2: 24300, counter3: 1003 s=100000000

Now contrast counter 2 and 3 (metrics on LLC hit and misses) to this alternative run on same machine running on a pinned lcore.
The only difference to pcm is that the counters are read with rdpmc (not perf-events), and the setup
overhead I believe is less:

test4: RAW VALUES
-------------------------------------------------------------------
test4:                : iterations run                                      : 100000000
test4: fixed counter 0: retired instructions                  : 700000187
test4: fixed counter 1: no-halt CPU cycles                  : 543990093
test4: fixed counter 2: ref no-halt CPU cycles            : 371684148
test4: prog  counter 0: LLC references                        : 20
test4: prog  counter 1: LLC misses                               : 4
test4: prog  counter 2: brch instrct retired                  : 100000069
test4: prog  counter 3: brch instrct not-taken retired: 3

I'm guessing the guidance for PCM is to run a no-op ... measure the overhead pcm introduces and then run the code under test a bunch of times and subtract out (or average out) the overhead?

0 replies

opcm · 2022-06-21T07:32:50Z

opcm
Jun 21, 2022
Collaborator

I think the difference is noise-level here.. E.g. for instructions retired it is 0.06%

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Using the library with much lighter overhead #602

Uh oh!

{{title}}

Uh oh!

Replies: 3 comments

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Using the library with much lighter overhead #602

Uh oh!

gshanemiller Apr 30, 2022

Replies: 3 comments

Uh oh!

opcm May 4, 2022 Collaborator

Uh oh!

gshanemiller May 7, 2022 Author

Uh oh!

opcm Jun 21, 2022 Collaborator

gshanemiller
Apr 30, 2022

opcm
May 4, 2022
Collaborator

gshanemiller
May 7, 2022
Author

opcm
Jun 21, 2022
Collaborator