Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Investigate occupancy limitation / calculation on MX150 GPU. #208

Open
denisalevi opened this issue May 27, 2021 · 1 comment
Open

Investigate occupancy limitation / calculation on MX150 GPU. #208

denisalevi opened this issue May 27, 2021 · 1 comment

Comments

@denisalevi
Copy link
Member

denisalevi commented May 27, 2021

For the following example, the stateupdater doesn't achieve full occupancy on my laptop GPU (MX150). Why? Is this a GPU ressource limitation or is there something going wrong in the occupancy calculation?

from brian2 import *

import brian2cuda                # These two lines suffice
set_device('cuda_standalone')    # to run brian2 on a GPU

# Parameters
N = 5000         ; duration = 0.1*second   ; V_r = 10*mV
theta = 20*mV    ; tau = 20*ms             ; delta = 2*ms
tau_ref = 2*ms   ; C = 1000                ; J = 0.1*mV
mu_ext = 25*mV   ; sigma_ext = 1*mV

# Network of N noise-driven leaky integrate-and-fire neurons
model = """
dV/dt = (-V + mu_ext) / tau + sigma_ext / sqrt(tau) * xi : volt
"""
neurons = NeuronGroup(N,
                      model,
                      threshold='V>theta',
                      reset='V=V_r',
                      refractory=tau_ref,
                      method='euler')

# Initialize membrane potential
neurons.V = V_r

run(duration)

This gives

INFO kernel_neurongroup_stateupdater_codeobject
        7 blocks
        768 threads
        36 registers per block
        0 bytes statically-allocated shared memory per block
        0 bytes local memory per thread
        576 bytes user-allocated constant memory
        0.750 theoretical occupancy (need 6 blocks for 1.000)
INFO kernel_neurongroup_thresholder_codeobject
        5 blocks
        1024 threads
        16 registers per block
        0 bytes statically-allocated shared memory per block
        0 bytes local memory per thread
        576 bytes user-allocated constant memory
        1.000 theoretical occupancy (need 6 blocks for 1.000)
INFO kernel_neurongroup_resetter_codeobject
        5 blocks
        1024 threads
        14 registers per block
        0 bytes statically-allocated shared memory per block
        0 bytes local memory per thread
        576 bytes user-allocated constant memory
        1.000 theoretical occupancy (need 6 blocks for 1.000)

Why do we use 7 blocks for the stateupdater? How do we get 100% occupancy with only 5 blocks for the the thresholder and resetter if the occupancy calculation says that we need 6 blocks?

To get the (need 6 blocks for 1.000), I printed the min_num_threads variables (which should be called min_num_blocks...).

@denisalevi
Copy link
Member Author

See my explanations in #266. We use 36 registers, that means we can't run 2048 threads per block due to registers per SM limits (would need 32 registers per thread for that). Hence we use less threads than 1024, leading to lower theoretical occupancy.

The occupancy value is a theoretical occupancy per SM, so it is 100% independent of number of blocks. But to actually fully use all SMs, one would need 6 blocks here (since there are 3 SMs that can run 2 blocks each on the MX150).

TODO: Modify the info message to say "theoretical occupancy per SM", to make this distinction clearer.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant