Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HPCC RandomAccess benchmark added. #286

Open
wants to merge 6 commits into
base: master
Choose a base branch
from

Conversation

alexfrolov
Copy link
Contributor

Hi guys!

I want to make some comparison of different hpc runtimes on a set of benchmarks. As a first step I have implemented HPCC RandomAccess on Grappa. It slightly differs from demo-gups* in a way that is a bit closer to original HPCC RandomAccess benchmark. If you are interested you can include it to the master.

Best,
Alex

@bholt
Copy link
Member

bholt commented Jul 19, 2016

Hi Alex, good work. A couple questions:

  • Does this code synchronize those delegate::call<async>s somewhere? These delegates are using the default global synchronizer implicitly; usually in our code these happen inside of a forall, which also uses the default global synchronizer, so the forall only finishes when all the delegates have completed. Since this is inside an on_all_cores only, I suspect it is never calling wait on the default_gce. This could definitely impact the correctness of your timing measurements.
  • Also, doesn't CMake require that you explicitly add_subdirectory(hpcc) in the root CMakeLists.txt file, otherwise this won't be compilable?

@alexfrolov
Copy link
Contributor Author

Hi Brandon,

thank you for your comments! I have fixed it. Could you please check the synchronization of delegates?

@alexfrolov
Copy link
Contributor Author

The test seems to work on small tables sizes (and shows good performance), but when the size is increasing it segfaults. Any guesses how to track it down?

srun --partition=A --ntasks-per-node=8 --nodes=1 --time=180:00 ./applications/hpcc/hpcc_random_access.exe --iters=4 --scale=24 --node_memsize=53687091200

`I0720 17:32:43.431330 29791 Grappa.cpp:647] 
-------------------------
Shared memory breakdown:
  node total:                   50 GB
  locale shared heap total:     25 GB
  locale shared heap per core:  3.125 GB
  communicator per core:        0.125 GB
  tasks per core:               0.0156631 GB
  global heap per core:         0.78125 GB
  aggregator per core:          0.0650177 GB
  shared_pool current per core: 4.76837e-07 GB
  shared_pool max per core:     0.78125 GB
  free per locale:              16.6694 GB
  free per core:                2.08368 GB
-------------------------
I0720 17:32:43.532766 29791 hpcc_random_access.cpp:109]     Global table size   = 2^24 * 128 = -2147483648 words
I0720 17:32:43.532836 29791 hpcc_random_access.cpp:110]     Number of processes = 128
I0720 17:32:43.532857 29791 hpcc_random_access.cpp:111]     Number of updates = 8589934592
I0720 17:32:43.532938 29791 hpcc_random_access.cpp:38] HPCC RandomAccess
Exiting due to signal 11 with siginfo 0x4002157fa430 and payload 0x4002157fa300
slurm_srun: error: A1: task 0: Exited with exit code 1
slurm_srun: Terminating job step 56353.0
slurmstepd: *** STEP 56353.0 ON A1 CANCELLED AT 2016-07-20T17:32:45 ***
slurm_srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
slurm_srun: error: A10: tasks 60,63: Killed
slurm_srun: error: A1: tasks 3,5-7: Killed
slurm_srun: error: A9: tasks 48,50,54-55: Killed
slurm_srun: error: A4: tasks 24,29-30: Killed
slurm_srun: error: A2: tasks 9,11,13-14: Killed
slurm_srun: error: A19: tasks 99-101: Killed

`


Grappa::GlobalCompletionEvent randomaccess_gce;

template <Grappa::GlobalCompletionEvent * GCE = &randomaccess_gce >
Copy link
Member

@bholt bholt Jul 20, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You wouldn't necessarily need to make this a template parameter for run_random_access. Is there a reason to not just use randomaccess_gce directly?

@bholt
Copy link
Member

bholt commented Jul 20, 2016

I can't tell at the moment why it's segfaulting. Can you enable backtraces or attach to it with GDB? (if you don't know how, I can explain or link you to the docs)

@alexfrolov
Copy link
Contributor Author

Running in debug mode shows the following:

-------------------------
Shared memory breakdown:
  node total:                   50 GB
  locale shared heap total:     25 GB
  locale shared heap per core:  3.125 GB
  communicator per core:        0.125 GB
  tasks per core:               0.0156631 GB
  global heap per core:         0.78125 GB
  aggregator per core:          0.0650177 GB
  shared_pool current per core: 4.76837e-07 GB
  shared_pool max per core:     0.78125 GB
  free per locale:              16.6695 GB
  free per core:                2.08369 GB
-------------------------
I0721 16:08:11.565477 25243 hpcc_random_access.cpp:106]     Global table size   = 2^24 * 128 = -2147483648 words
I0721 16:08:11.565526 25243 hpcc_random_access.cpp:107]     Number of processes = 128
I0721 16:08:11.565537 25243 hpcc_random_access.cpp:108]     Number of updates = 8589934592
I0721 16:08:11.565543 25243 hpcc_random_access.cpp:37] HPCC RandomAccess
hpcc_random_access.exe: /mnt/lustre/home/frolo/grappa/system/Allocator.hpp:138: std::map<long int, AllocatorChunk>::iterator Allocator::add_to_chunk_map(const AllocatorChunk&): Assertion `inserted' failed.

@bholt
Copy link
Member

bholt commented Jul 21, 2016

Oh, for one thing, N is only set on core 0. In Grappa we use static variables a lot, but they have to be initialized in an on_all_cores. If that doesn't fix it, I'd start adding more debug prints to see at which point you get that error. Also maybe add some CHECK asserts to verify that, for example, your indexes are always in range, etc. You do have a power-of-2 total number of cores, right? Otherwise the key & N-1 logic won't work correctly.

@alexfrolov
Copy link
Contributor Author

alexfrolov commented Jul 21, 2016

That's definitely a bug, thank you! I will check if it has any influence on the segfault issue. But I think that segfault happens during allocation of the table. I'll check.

The possible bug/problem is in Grappa::memset( hpcc_table, 0, N);. It is reproduced even on single core:

GRAPPA_FREEZE_ON_ERROR=1 srun --partition=B --ntasks-per-node=1 --nodes=1 --time=180:00 ./applications/hpcc/hpcc_random_access.exe --iters=4 --scale=31 --node_memsize=53687091200

gdb$ bt
#0  0x00007fdf518f91e0 in __nanosleep_nocancel () from /lib64/libc.so.6
#1  0x00007fdf518f901c in sleep () from /lib64/libc.so.6
#2  0x000000000045064a in Grappa::impl::freeze_for_debugger () at /mnt/lustre/home/frolo/grappa/system/Grappa.cpp:251
#3  0x000000000045077e in Grappa::impl::failure_sighandler (signum=0xb, si=0x4001992df7b0, unused=0x4001992df680) at /mnt/lustre/home/frolo/grappa/system/Grappa.cpp:270
#4  <signal handler called>
#5  operator() (__closure=0x4001992dfca0) at /mnt/lustre/home/frolo/grappa/system/Array.hpp:65
#6  Grappa::call_on_all_cores<void Grappa::memset<long, int>(GlobalAddress<long>, int, unsigned long)::{lambda()#1}>(void Grappa::memset<long, int>(GlobalAddress<long>, int, unsigned long)::{lambda()#1}) (work=...) at /mnt/lustre/home/frolo/grappa/system/Collective.hpp:169
#7  0x0000000000440fe7 in memset<long, int> (count=0xffffffff80000000, value=0x0, base=...) at /mnt/lustre/home/frolo/grappa/system/Array.hpp:61
#8  run_random_access () at /mnt/lustre/home/frolo/grappa/applications/hpcc/hpcc_random_access.cpp:42
#9  0x0000000000441584 in operator() (__closure=<optimized out>) at /mnt/lustre/home/frolo/grappa/applications/hpcc/hpcc_random_access.cpp:111
#10 operator() (__closure=<optimized out>) at /mnt/lustre/home/frolo/grappa/system/Tasking.hpp:216
#11 Grappa::impl::task_functor_proxy<Grappa::run(FP) [with FP = main(int, char**)::__lambda90]::__lambda0>(uint64_t, uint64_t, uint64_t) (a0=<optimized out>, a1=<optimized out>, a2=<optimized out>) at /mnt/lustre/home/frolo/grappa/system/Tasking.hpp:92
#12 0x00000000004714c3 in execute (this=0x4001992dff10) at /mnt/lustre/home/frolo/grappa/system/tasks/Task.hpp:93
#13 Grappa::impl::workerLoop (me=me@entry=0x749000, args=args@entry=0x6f2b30) at /mnt/lustre/home/frolo/grappa/system/tasks/TaskingScheduler.cpp:194
#14 0x000000000046d9ce in Grappa::impl::tramp (me=0x749000, arg=0x6f5c20) at /mnt/lustre/home/frolo/grappa/system/Worker.hpp:239
#15 0x000000000047301a in _makestack () at /mnt/lustre/home/frolo/grappa/system/stack.S:219
#16 0x0000000000000000 in ?? ()

UPD: the issue is in assigning of N: LL is needed for to correctly shift 1 for more than 32 bits.
N = (1LL << FLAGS_scale)

@bholt
Copy link
Member

bholt commented Jul 21, 2016

Ah, I did notice that but thought you were seeing the problem with scale = 24?

@alexfrolov
Copy link
Contributor Author

Yes, but I multiply it on cores().

@alexfrolov
Copy link
Contributor Author

Now, while I am trying to run the hpcc RandomAccess test on large tables (2^28 words * per core). This causes problems with allocating memory for table.
frolo@head:~/grappa/build/Make+Release/applications/hpcc> srun --partition=B --ntasks-per-node=8 --nodes=1 --time=15:00 ./hpcc_random_access.exe --iters=4 --scale=28 --node_memsize=53687091200 --locale_shared_fraction=0.5

I0722 16:31:15.466739 11689 Grappa.cpp:647] 
-------------------------
Shared memory breakdown:
  node total:                   50 GB
  locale shared heap total:     25 GB
  locale shared heap per core:  3.125 GB
  communicator per core:        0.125 GB
  tasks per core:               0.0156631 GB
  global heap per core:         0.78125 GB
  aggregator per core:          0.00247955 GB
  shared_pool current per core: 4.76837e-07 GB
  shared_pool max per core:     0.78125 GB
  free per locale:              17.6048 GB
  free per core:                2.2006 GB
-------------------------
I0722 16:31:15.516224 11689 hpcc_random_access.cpp:107]     Global table size   = 2^28 * 8 = 2147483648 words
I0722 16:31:15.516284 11689 hpcc_random_access.cpp:108]     Number of processes = 8
I0722 16:31:15.516296 11689 hpcc_random_access.cpp:109]     Number of updates = 8589934592
I0722 16:31:15.516304 11689 hpcc_random_access.cpp:37] HPCC RandomAccess
E0722 16:31:15.516326 11689 Allocator.hpp:226] Out of memory in the global heap: couldn't find a chunk of size 17179869184 to hold an allocation of 17179869184 bytes. Can you increase --global_heap_fraction?
terminate called after throwing an instance of 'Allocator::Exception'
  what():  std::exception
slurm_srun: error: B1: task 0: Aborted
slurm_srun: Terminating job step 56969.0
slurmstepd: *** STEP 56969.0 ON B1 CANCELLED AT 2016-07-22T16:31:17 ***
slurm_srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
slurm_srun: error: B1: tasks 1-7: Killed

Then I tried (as suggested) to increase locale shared heap total and it leads to an error:

frolo@head:~/grappa/build/Make+Release/applications/hpcc> srun --partition=B --ntasks-per-node=8 --nodes=1 --time=15:00 ./hpcc_random_access.exe --iters=4 --scale=28 --node_memsize=53687091200 --locale_shared_fraction=0.85 
E0722 16:36:30.652633 12833 LocaleSharedMemory.cpp:99] Failed to create locale shared memory of size 45634027520
*** Aborted at 1469194590 (unix time) try "date -d @1469194590" if you are using GNU date ***
PC: @           0x483ad8 google::DumpStackTrace()
    @           0x4506f1 Grappa::impl::failure_function()
    @           0x4546ef Grappa::impl::LocaleSharedMemory::create()
    @           0x45559d Grappa::impl::LocaleSharedMemory::activate()
    @           0x45253b Grappa_activate()
    @           0x43a714 main
    @     0x7f22ee683c36 __libc_start_main
    @           0x440151 (unknown)
I0722 16:36:30.665314 12833 Grappa.cpp:262] Exiting via failure function
slurm_srun: error: B1: task 0: Exited with exit code 1
slurm_srun: Terminating job step 56971.0
slurmstepd: *** STEP 56971.0 ON B1 CANCELLED AT 2016-07-22T16:36:32 ***
slurm_srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
slurm_srun: error: B1: tasks 1-7: Killed

This same error I saw when tried to run grappa without --node_memsize=53687091200. On the cluster we have 50G max limit for the task virtual memory. The error happens in mmap when interprocess shared memory is created.

The experiments showed that I can't create locale shared memory more than 25G (roughly) without violating virtual memory limit. It seems to me strange since program just have started its execution, and heap should not be large... there should be huge amount of static data?

@bholt
Copy link
Member

bholt commented Jul 22, 2016

Yeah, the memory allocation is a huge pain. We always meant to fix it for real and get rid of the need for the locale-shared-heap. That message actually told you to bump up --global_heap_fraction, not --locale_shared_fraction. So try that instead.

I don't quite remember what happens when you increase the locale_shared_fraction too far. Could be that there's some stuff being allocated out of the non-locale-shared pool of memory (I was gonna say execution stacks for workers, but I think that's locale-shared). But there's also sometimes a hard limit that you have to configure in the OS for how much SysV shared memory you're allowed to allocate.

But if you increase global_heap_fraction, you should be able to do a reasonably large scale run even with just 25GB per node.

@nelsonje
Copy link
Member

You probably need both --locale_shared_fraction and --global_heap_fraction (which is allocated out of the locale shared memory).

@nelsonje
Copy link
Member

(the right choice for locale_shared_fraction will probably be between 0.7 and 0.85, depending on what else you need to allocate)

@nerdismotaku
Copy link

Do you have the code of RandomAccess benchmark isolated? I wanna to compile this benchmark out of the HPCC suite.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants