Investigate occupancy limitation / calculation on MX150 GPU. #208

denisalevi · 2021-05-27T14:42:31Z

For the following example, the stateupdater doesn't achieve full occupancy on my laptop GPU (MX150). Why? Is this a GPU ressource limitation or is there something going wrong in the occupancy calculation?

from brian2 import *

import brian2cuda                # These two lines suffice
set_device('cuda_standalone')    # to run brian2 on a GPU

# Parameters
N = 5000         ; duration = 0.1*second   ; V_r = 10*mV
theta = 20*mV    ; tau = 20*ms             ; delta = 2*ms
tau_ref = 2*ms   ; C = 1000                ; J = 0.1*mV
mu_ext = 25*mV   ; sigma_ext = 1*mV

# Network of N noise-driven leaky integrate-and-fire neurons
model = """
dV/dt = (-V + mu_ext) / tau + sigma_ext / sqrt(tau) * xi : volt
"""
neurons = NeuronGroup(N,
                      model,
                      threshold='V>theta',
                      reset='V=V_r',
                      refractory=tau_ref,
                      method='euler')

# Initialize membrane potential
neurons.V = V_r

run(duration)

This gives

INFO kernel_neurongroup_stateupdater_codeobject
        7 blocks
        768 threads
        36 registers per block
        0 bytes statically-allocated shared memory per block
        0 bytes local memory per thread
        576 bytes user-allocated constant memory
        0.750 theoretical occupancy (need 6 blocks for 1.000)
INFO kernel_neurongroup_thresholder_codeobject
        5 blocks
        1024 threads
        16 registers per block
        0 bytes statically-allocated shared memory per block
        0 bytes local memory per thread
        576 bytes user-allocated constant memory
        1.000 theoretical occupancy (need 6 blocks for 1.000)
INFO kernel_neurongroup_resetter_codeobject
        5 blocks
        1024 threads
        14 registers per block
        0 bytes statically-allocated shared memory per block
        0 bytes local memory per thread
        576 bytes user-allocated constant memory
        1.000 theoretical occupancy (need 6 blocks for 1.000)

Why do we use 7 blocks for the stateupdater? How do we get 100% occupancy with only 5 blocks for the the thresholder and resetter if the occupancy calculation says that we need 6 blocks?

To get the (need 6 blocks for 1.000), I printed the min_num_threads variables (which should be called min_num_blocks...).

The text was updated successfully, but these errors were encountered:

denisalevi · 2022-02-14T18:55:32Z

See my explanations in #266. We use 36 registers, that means we can't run 2048 threads per block due to registers per SM limits (would need 32 registers per thread for that). Hence we use less threads than 1024, leading to lower theoretical occupancy.

The occupancy value is a theoretical occupancy per SM, so it is 100% independent of number of blocks. But to actually fully use all SMs, one would need 6 blocks here (since there are 3 SMs that can run 2 blocks each on the MX150).

TODO: Modify the info message to say "theoretical occupancy per SM", to make this distinction clearer.

denisalevi mentioned this issue Feb 13, 2022

Block dimension calculation can lead to non-optimal occupancy #266

Open

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Investigate occupancy limitation / calculation on MX150 GPU. #208

Investigate occupancy limitation / calculation on MX150 GPU. #208

denisalevi commented May 27, 2021 •

edited

Loading

denisalevi commented Feb 14, 2022

Investigate occupancy limitation / calculation on MX150 GPU. #208

Investigate occupancy limitation / calculation on MX150 GPU. #208

Comments

denisalevi commented May 27, 2021 • edited Loading

denisalevi commented Feb 14, 2022

denisalevi commented May 27, 2021 •

edited

Loading