HPCC RandomAccess benchmark added. #286

alexfrolov · 2016-07-19T15:28:25Z

Hi guys!

I want to make some comparison of different hpc runtimes on a set of benchmarks. As a first step I have implemented HPCC RandomAccess on Grappa. It slightly differs from demo-gups* in a way that is a bit closer to original HPCC RandomAccess benchmark. If you are interested you can include it to the master.

Best,
Alex

bholt · 2016-07-19T15:47:22Z

Hi Alex, good work. A couple questions:

Does this code synchronize those delegate::call<async>s somewhere? These delegates are using the default global synchronizer implicitly; usually in our code these happen inside of a forall, which also uses the default global synchronizer, so the forall only finishes when all the delegates have completed. Since this is inside an on_all_cores only, I suspect it is never calling wait on the default_gce. This could definitely impact the correctness of your timing measurements.
Also, doesn't CMake require that you explicitly add_subdirectory(hpcc) in the root CMakeLists.txt file, otherwise this won't be compilable?

alexfrolov · 2016-07-20T13:02:36Z

Hi Brandon,

thank you for your comments! I have fixed it. Could you please check the synchronization of delegates?

alexfrolov · 2016-07-20T14:38:10Z

The test seems to work on small tables sizes (and shows good performance), but when the size is increasing it segfaults. Any guesses how to track it down?

srun --partition=A --ntasks-per-node=8 --nodes=1 --time=180:00 ./applications/hpcc/hpcc_random_access.exe --iters=4 --scale=24 --node_memsize=53687091200

`I0720 17:32:43.431330 29791 Grappa.cpp:647] 
-------------------------
Shared memory breakdown:
  node total:                   50 GB
  locale shared heap total:     25 GB
  locale shared heap per core:  3.125 GB
  communicator per core:        0.125 GB
  tasks per core:               0.0156631 GB
  global heap per core:         0.78125 GB
  aggregator per core:          0.0650177 GB
  shared_pool current per core: 4.76837e-07 GB
  shared_pool max per core:     0.78125 GB
  free per locale:              16.6694 GB
  free per core:                2.08368 GB
-------------------------
I0720 17:32:43.532766 29791 hpcc_random_access.cpp:109]     Global table size   = 2^24 * 128 = -2147483648 words
I0720 17:32:43.532836 29791 hpcc_random_access.cpp:110]     Number of processes = 128
I0720 17:32:43.532857 29791 hpcc_random_access.cpp:111]     Number of updates = 8589934592
I0720 17:32:43.532938 29791 hpcc_random_access.cpp:38] HPCC RandomAccess
Exiting due to signal 11 with siginfo 0x4002157fa430 and payload 0x4002157fa300
slurm_srun: error: A1: task 0: Exited with exit code 1
slurm_srun: Terminating job step 56353.0
slurmstepd: *** STEP 56353.0 ON A1 CANCELLED AT 2016-07-20T17:32:45 ***
slurm_srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
slurm_srun: error: A10: tasks 60,63: Killed
slurm_srun: error: A1: tasks 3,5-7: Killed
slurm_srun: error: A9: tasks 48,50,54-55: Killed
slurm_srun: error: A4: tasks 24,29-30: Killed
slurm_srun: error: A2: tasks 9,11,13-14: Killed
slurm_srun: error: A19: tasks 99-101: Killed

`

bholt · 2016-07-20T15:51:44Z

applications/hpcc/hpcc_random_access.cpp

+
+Grappa::GlobalCompletionEvent randomaccess_gce;
+
+template <Grappa::GlobalCompletionEvent * GCE = &randomaccess_gce >


You wouldn't necessarily need to make this a template parameter for run_random_access. Is there a reason to not just use randomaccess_gce directly?

bholt · 2016-07-20T15:55:26Z

I can't tell at the moment why it's segfaulting. Can you enable backtraces or attach to it with GDB? (if you don't know how, I can explain or link you to the docs)

alexfrolov · 2016-07-21T13:10:37Z

Running in debug mode shows the following:

-------------------------
Shared memory breakdown:
  node total:                   50 GB
  locale shared heap total:     25 GB
  locale shared heap per core:  3.125 GB
  communicator per core:        0.125 GB
  tasks per core:               0.0156631 GB
  global heap per core:         0.78125 GB
  aggregator per core:          0.0650177 GB
  shared_pool current per core: 4.76837e-07 GB
  shared_pool max per core:     0.78125 GB
  free per locale:              16.6695 GB
  free per core:                2.08369 GB
-------------------------
I0721 16:08:11.565477 25243 hpcc_random_access.cpp:106]     Global table size   = 2^24 * 128 = -2147483648 words
I0721 16:08:11.565526 25243 hpcc_random_access.cpp:107]     Number of processes = 128
I0721 16:08:11.565537 25243 hpcc_random_access.cpp:108]     Number of updates = 8589934592
I0721 16:08:11.565543 25243 hpcc_random_access.cpp:37] HPCC RandomAccess
hpcc_random_access.exe: /mnt/lustre/home/frolo/grappa/system/Allocator.hpp:138: std::map<long int, AllocatorChunk>::iterator Allocator::add_to_chunk_map(const AllocatorChunk&): Assertion `inserted' failed.

bholt · 2016-07-21T17:30:52Z

Oh, for one thing, N is only set on core 0. In Grappa we use static variables a lot, but they have to be initialized in an on_all_cores. If that doesn't fix it, I'd start adding more debug prints to see at which point you get that error. Also maybe add some CHECK asserts to verify that, for example, your indexes are always in range, etc. You do have a power-of-2 total number of cores, right? Otherwise the key & N-1 logic won't work correctly.

alexfrolov · 2016-07-21T20:15:21Z

That's definitely a bug, thank you! I will check if it has any influence on the segfault issue. But I think that segfault happens during allocation of the table. I'll check.

The possible bug/problem is in Grappa::memset( hpcc_table, 0, N);. It is reproduced even on single core:

GRAPPA_FREEZE_ON_ERROR=1 srun --partition=B --ntasks-per-node=1 --nodes=1 --time=180:00 ./applications/hpcc/hpcc_random_access.exe --iters=4 --scale=31 --node_memsize=53687091200

gdb$ bt
#0  0x00007fdf518f91e0 in __nanosleep_nocancel () from /lib64/libc.so.6
#1  0x00007fdf518f901c in sleep () from /lib64/libc.so.6
#2  0x000000000045064a in Grappa::impl::freeze_for_debugger () at /mnt/lustre/home/frolo/grappa/system/Grappa.cpp:251
#3  0x000000000045077e in Grappa::impl::failure_sighandler (signum=0xb, si=0x4001992df7b0, unused=0x4001992df680) at /mnt/lustre/home/frolo/grappa/system/Grappa.cpp:270
#4  <signal handler called>
#5  operator() (__closure=0x4001992dfca0) at /mnt/lustre/home/frolo/grappa/system/Array.hpp:65
#6  Grappa::call_on_all_cores<void Grappa::memset<long, int>(GlobalAddress<long>, int, unsigned long)::{lambda()#1}>(void Grappa::memset<long, int>(GlobalAddress<long>, int, unsigned long)::{lambda()#1}) (work=...) at /mnt/lustre/home/frolo/grappa/system/Collective.hpp:169
#7  0x0000000000440fe7 in memset<long, int> (count=0xffffffff80000000, value=0x0, base=...) at /mnt/lustre/home/frolo/grappa/system/Array.hpp:61
#8  run_random_access () at /mnt/lustre/home/frolo/grappa/applications/hpcc/hpcc_random_access.cpp:42
#9  0x0000000000441584 in operator() (__closure=<optimized out>) at /mnt/lustre/home/frolo/grappa/applications/hpcc/hpcc_random_access.cpp:111
#10 operator() (__closure=<optimized out>) at /mnt/lustre/home/frolo/grappa/system/Tasking.hpp:216
#11 Grappa::impl::task_functor_proxy<Grappa::run(FP) [with FP = main(int, char**)::__lambda90]::__lambda0>(uint64_t, uint64_t, uint64_t) (a0=<optimized out>, a1=<optimized out>, a2=<optimized out>) at /mnt/lustre/home/frolo/grappa/system/Tasking.hpp:92
#12 0x00000000004714c3 in execute (this=0x4001992dff10) at /mnt/lustre/home/frolo/grappa/system/tasks/Task.hpp:93
#13 Grappa::impl::workerLoop (me=me@entry=0x749000, args=args@entry=0x6f2b30) at /mnt/lustre/home/frolo/grappa/system/tasks/TaskingScheduler.cpp:194
#14 0x000000000046d9ce in Grappa::impl::tramp (me=0x749000, arg=0x6f5c20) at /mnt/lustre/home/frolo/grappa/system/Worker.hpp:239
#15 0x000000000047301a in _makestack () at /mnt/lustre/home/frolo/grappa/system/stack.S:219
#16 0x0000000000000000 in ?? ()

UPD: the issue is in assigning of N: LL is needed for to correctly shift 1 for more than 32 bits.
N = (1LL << FLAGS_scale)

bholt · 2016-07-21T23:21:35Z

Ah, I did notice that but thought you were seeing the problem with scale = 24?

alexfrolov · 2016-07-22T08:24:07Z

Yes, but I multiply it on cores().

alexfrolov · 2016-07-22T13:51:22Z

Now, while I am trying to run the hpcc RandomAccess test on large tables (2^28 words * per core). This causes problems with allocating memory for table.
frolo@head:~/grappa/build/Make+Release/applications/hpcc> srun --partition=B --ntasks-per-node=8 --nodes=1 --time=15:00 ./hpcc_random_access.exe --iters=4 --scale=28 --node_memsize=53687091200 --locale_shared_fraction=0.5

I0722 16:31:15.466739 11689 Grappa.cpp:647] 
-------------------------
Shared memory breakdown:
  node total:                   50 GB
  locale shared heap total:     25 GB
  locale shared heap per core:  3.125 GB
  communicator per core:        0.125 GB
  tasks per core:               0.0156631 GB
  global heap per core:         0.78125 GB
  aggregator per core:          0.00247955 GB
  shared_pool current per core: 4.76837e-07 GB
  shared_pool max per core:     0.78125 GB
  free per locale:              17.6048 GB
  free per core:                2.2006 GB
-------------------------
I0722 16:31:15.516224 11689 hpcc_random_access.cpp:107]     Global table size   = 2^28 * 8 = 2147483648 words
I0722 16:31:15.516284 11689 hpcc_random_access.cpp:108]     Number of processes = 8
I0722 16:31:15.516296 11689 hpcc_random_access.cpp:109]     Number of updates = 8589934592
I0722 16:31:15.516304 11689 hpcc_random_access.cpp:37] HPCC RandomAccess
E0722 16:31:15.516326 11689 Allocator.hpp:226] Out of memory in the global heap: couldn't find a chunk of size 17179869184 to hold an allocation of 17179869184 bytes. Can you increase --global_heap_fraction?
terminate called after throwing an instance of 'Allocator::Exception'
  what():  std::exception
slurm_srun: error: B1: task 0: Aborted
slurm_srun: Terminating job step 56969.0
slurmstepd: *** STEP 56969.0 ON B1 CANCELLED AT 2016-07-22T16:31:17 ***
slurm_srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
slurm_srun: error: B1: tasks 1-7: Killed

Then I tried (as suggested) to increase locale shared heap total and it leads to an error:

frolo@head:~/grappa/build/Make+Release/applications/hpcc> srun --partition=B --ntasks-per-node=8 --nodes=1 --time=15:00 ./hpcc_random_access.exe --iters=4 --scale=28 --node_memsize=53687091200 --locale_shared_fraction=0.85 
E0722 16:36:30.652633 12833 LocaleSharedMemory.cpp:99] Failed to create locale shared memory of size 45634027520
*** Aborted at 1469194590 (unix time) try "date -d @1469194590" if you are using GNU date ***
PC: @           0x483ad8 google::DumpStackTrace()
    @           0x4506f1 Grappa::impl::failure_function()
    @           0x4546ef Grappa::impl::LocaleSharedMemory::create()
    @           0x45559d Grappa::impl::LocaleSharedMemory::activate()
    @           0x45253b Grappa_activate()
    @           0x43a714 main
    @     0x7f22ee683c36 __libc_start_main
    @           0x440151 (unknown)
I0722 16:36:30.665314 12833 Grappa.cpp:262] Exiting via failure function
slurm_srun: error: B1: task 0: Exited with exit code 1
slurm_srun: Terminating job step 56971.0
slurmstepd: *** STEP 56971.0 ON B1 CANCELLED AT 2016-07-22T16:36:32 ***
slurm_srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
slurm_srun: error: B1: tasks 1-7: Killed

This same error I saw when tried to run grappa without --node_memsize=53687091200. On the cluster we have 50G max limit for the task virtual memory. The error happens in mmap when interprocess shared memory is created.

The experiments showed that I can't create locale shared memory more than 25G (roughly) without violating virtual memory limit. It seems to me strange since program just have started its execution, and heap should not be large... there should be huge amount of static data?

bholt · 2016-07-22T16:22:57Z

Yeah, the memory allocation is a huge pain. We always meant to fix it for real and get rid of the need for the locale-shared-heap. That message actually told you to bump up --global_heap_fraction, not --locale_shared_fraction. So try that instead.

I don't quite remember what happens when you increase the locale_shared_fraction too far. Could be that there's some stuff being allocated out of the non-locale-shared pool of memory (I was gonna say execution stacks for workers, but I think that's locale-shared). But there's also sometimes a hard limit that you have to configure in the OS for how much SysV shared memory you're allowed to allocate.

But if you increase global_heap_fraction, you should be able to do a reasonably large scale run even with just 25GB per node.

nelsonje · 2016-07-22T17:26:05Z

You probably need both --locale_shared_fraction and --global_heap_fraction (which is allocated out of the locale shared memory).

nelsonje · 2016-07-22T17:27:09Z

(the right choice for locale_shared_fraction will probably be between 0.7 and 0.85, depending on what else you need to allocate)

nerdismotaku · 2018-02-26T19:26:02Z

Do you have the code of RandomAccess benchmark isolated? I wanna to compile this benchmark out of the HPCC suite.

HPCC RandomAccess benchmark added.

67eab85

Alexander Frolov added 2 commits July 20, 2016 14:04

hpcc folder moved to application, CMakeList.txt fixed

15f34ea

hpcc randomaccess modified (gce added to monitor completion of updates)

66644f5

bholt reviewed Jul 20, 2016
View reviewed changes

Minor fixes in HPCC RandomAccess.

e243124

Bug fixed in HPCC RandomAccess.

e1ce6ca

Merge remote-tracking branch 'upstream/master'

5b9b495

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

HPCC RandomAccess benchmark added. #286

HPCC RandomAccess benchmark added. #286

alexfrolov commented Jul 19, 2016

bholt commented Jul 19, 2016

alexfrolov commented Jul 20, 2016

alexfrolov commented Jul 20, 2016

bholt Jul 20, 2016 •

edited

Loading

bholt commented Jul 20, 2016

alexfrolov commented Jul 21, 2016

bholt commented Jul 21, 2016

alexfrolov commented Jul 21, 2016 •

edited

Loading

bholt commented Jul 21, 2016

alexfrolov commented Jul 22, 2016

alexfrolov commented Jul 22, 2016

bholt commented Jul 22, 2016

nelsonje commented Jul 22, 2016

nelsonje commented Jul 22, 2016

nerdismotaku commented Feb 26, 2018


		Grappa::GlobalCompletionEvent randomaccess_gce;

		template <Grappa::GlobalCompletionEvent * GCE = &randomaccess_gce >

HPCC RandomAccess benchmark added. #286

Are you sure you want to change the base?

HPCC RandomAccess benchmark added. #286

Conversation

alexfrolov commented Jul 19, 2016

bholt commented Jul 19, 2016

alexfrolov commented Jul 20, 2016

alexfrolov commented Jul 20, 2016

bholt Jul 20, 2016 • edited Loading

Choose a reason for hiding this comment

bholt commented Jul 20, 2016

alexfrolov commented Jul 21, 2016

bholt commented Jul 21, 2016

alexfrolov commented Jul 21, 2016 • edited Loading

bholt commented Jul 21, 2016

alexfrolov commented Jul 22, 2016

alexfrolov commented Jul 22, 2016

bholt commented Jul 22, 2016

nelsonje commented Jul 22, 2016

nelsonje commented Jul 22, 2016

nerdismotaku commented Feb 26, 2018

bholt Jul 20, 2016 •

edited

Loading

alexfrolov commented Jul 21, 2016 •

edited

Loading