Memory consumption #262

kerim371 · 2024-07-07T10:05:17Z

Hi,

Recently I've done some computation on single Ubuntu node with 64 GB RAM and it finished successively.

Then I've tryied to do the same computations on small cluster (5 Centos 7 nodes) with 128 GB RAM and at some point I've noticed that sometimes I see the warning like not enogh memory, starting swapping and the RAM is about 110 GB filled. And after some time I alway get ar error that the connection lost or something, so I can't to perform even a single iteration of FWI.

That means on single Ubuntu node it was enough to have 64 GB RAM without swapping and on small CentOS 7 cluster 128 GB is not enough.

Any ideas of the possible reasons?

Julia's cluster manager is SSH based.
Julia 1.9.3
JUDI v3.3.10

The text was updated successfully, but these errors were encountered:

mloubout · 2024-07-11T12:45:14Z

I think there might be some issue with the parallel scheduler that leaves multiple workers alive on the node. I'll try to have alook

mloubout · 2024-07-24T15:42:45Z

I've made a small patch in the new version that might resolve that issue if you wanna give it a try.

kerim371 · 2024-07-24T15:50:16Z

I will, thank you very much!

I've made a small patch in the new version that might resolve that issue if you wanna give it a try.

kerim371 · 2024-07-24T22:51:16Z

I've checked and now I don't see any excessive memory consumption (leak).

Thank you! Fix works

kerim371 · 2024-07-25T15:48:11Z

@mloubout need to reopen :)

I now have 11 identical workers with 128 Gb RAM.
Workers 8 and 11 says that they need to swap memory.
This is a little bit weird because it is been already few hours that all the nodes require about 80 Gb, so there should be free space available.

How do you think is there is something that could be optimazed (freed) during JUDI computations? Because the data I use for FWI have regular geometry for every shot. There should not be any shots with bigger source receiver offsets.

By the way I use the following options:

global jopt = JUDI.Options(
    IC = "fwi",
    limit_m = true,
    buffer_size = 1000f0,
    dt_comp = 7.5,
    optimal_checkpointing=false,
    # subsampling_factor=2,
    free_surface=true,  # free_surface is ON to model multiples as well
    space_order=8)     # increase space order for > 12 Hz source wavelet

JUDI: v3.4.5

      From worker 2:      warnings.warn("Provided dt=%s is bigger than maximum stable dt %s "
      From worker 3:    /home/kerim/.julia/dev/JUDI/src/pysource/models.py:461: UserWarning: Provided dt=7.5 is bigger than maximum stable dt 2.104 
      From worker 3:      warnings.warn("Provided dt=%s is bigger than maximum stable dt %s "
      From worker 4:    /home/kerim/.julia/dev/JUDI/src/pysource/models.py:461: UserWarning: Provided dt=7.5 is bigger than maximum stable dt 2.122 
      From worker 4:      warnings.warn("Provided dt=%s is bigger than maximum stable dt %s "
      From worker 9:    /home/kerim/.julia/dev/JUDI/src/pysource/models.py:461: UserWarning: Provided dt=7.5 is bigger than maximum stable dt 2.085 
      From worker 9:      warnings.warn("Provided dt=%s is bigger than maximum stable dt %s "
      From worker 10:   /home/kerim/.julia/dev/JUDI/src/pysource/models.py:461: UserWarning: Provided dt=7.5 is bigger than maximum stable dt 2.118 
      From worker 10:     warnings.warn("Provided dt=%s is bigger than maximum stable dt %s "
      From worker 12:   /home/kerim/.julia/dev/JUDI/src/pysource/models.py:461: UserWarning: Provided dt=7.5 is bigger than maximum stable dt 2.286 
      From worker 12:     warnings.warn("Provided dt=%s is bigger than maximum stable dt %s "
      From worker 8:    /home/kerim/.julia/dev/JUDI/src/pysource/models.py:461: UserWarning: Provided dt=7.5 is bigger than maximum stable dt 2.102 
      From worker 8:      warnings.warn("Provided dt=%s is bigger than maximum stable dt %s "
      From worker 8:    Trying to allocate more memory for symbol u than available on physical device, this will start swapping
      From worker 11:   /home/kerim/.julia/dev/JUDI/src/pysource/models.py:461: UserWarning: Provided dt=7.5 is bigger than maximum stable dt 2.145 
      From worker 11:     warnings.warn("Provided dt=%s is bigger than maximum stable dt %s "
      From worker 11:   Trying to allocate more memory for symbol u than available on physical device, this will start swapping
      From worker 8:    /home/kerim/.julia/dev/JUDI/src/pysource/models.py:461: UserWarning: Provided dt=7.5 is bigger than maximum stable dt 2.102 
      From worker 8:      warnings.warn("Provided dt=%s is bigger than maximum stable dt %s "
      From worker 8:    Trying to allocate more memory for symbol u than available on physical device, this will start swapping
      From worker 11:   /home/kerim/.julia/dev/JUDI/src/pysource/models.py:461: UserWarning: Provided dt=7.5 is bigger than maximum stable dt 2.145 
      From worker 11:     warnings.warn("Provided dt=%s is bigger than maximum stable dt %s "
      From worker 11:   Trying to allocate more memory for symbol u than available on physical device, this will start swapping
      From worker 5:    Operator `gradient` ran in 56.88 s
      From worker 5:    /home/kerim/.julia/dev/JUDI/src/pysource/models.py:461: UserWarning: Provided dt=7.5 is bigger than maximum stable dt 2.085 
      From worker 5:      warnings.warn("Provided dt=%s is bigger than maximum stable dt %s "
      From worker 12:   Operator `forward` ran in 13.81 s
      From worker 2:    Operator `forward` ran in 15.11 s
      From worker 3:    Operator `forward` ran in 20.94 s
      From worker 7:    Operator `gradient` ran in 54.52 s
      From worker 7:    /home/kerim/.julia/dev/JUDI/src/pysource/models.py:461: UserWarning: Provided dt=7.5 is bigger than maximum stable dt 2.107 
      From worker 7:      warnings.warn("Provided dt=%s is bigger than maximum stable dt %s "
      From worker 12:   Operator `gradient` ran in 20.21 s
      From worker 9:    Operator `forward` ran in 29.82 s
      From worker 4:    Operator `forward` ran in 29.88 s
      From worker 2:    Operator `gradient` ran in 21.83 s
      From worker 3:    Operator `gradient` ran in 30.58 s
      From worker 10:   Operator `forward` ran in 55.22 s
      From worker 4:    Operator `gradient` ran in 44.12 s
      From worker 7:    Operator `forward` ran in 38.50 s
      From worker 9:    Operator `gradient` ran in 45.81 s
      From worker 5:    Operator `forward` ran in 55.74 s
      From worker 7:    Operator `gradient` ran in 58.76 s
      From worker 10:   Operator `gradient` ran in 81.79 s
      From worker 2:    /home/kerim/.julia/dev/JUDI/src/pysource/models.py:461: UserWarning: Provided dt=7.5 is bigger than maximum stable dt 2.271 
      From worker 2:      warnings.warn("Provided dt=%s is bigger than maximum stable dt %s "
      From worker 3:    /home/kerim/.julia/dev/JUDI/src/pysource/models.py:461: UserWarning: Provided dt=7.5 is bigger than maximum stable dt 2.122 
      From worker 3:      warnings.warn("Provided dt=%s is bigger than maximum stable dt %s "
      From worker 10:   /home/kerim/.julia/dev/JUDI/src/pysource/models.py:461: UserWarning: Provided dt=7.5 is bigger than maximum stable dt 2.085 
      From worker 10:     warnings.warn("Provided dt=%s is bigger than maximum stable dt %s "
      From worker 4:    /home/kerim/.julia/dev/JUDI/src/pysource/models.py:461: UserWarning: Provided dt=7.5 is bigger than maximum stable dt 2.085 
      From worker 4:      warnings.warn("Provided dt=%s is bigger than maximum stable dt %s "
      From worker 9:    /home/kerim/.julia/dev/JUDI/src/pysource/models.py:461: UserWarning: Provided dt=7.5 is bigger than maximum stable dt 2.085 
      From worker 9:      warnings.warn("Provided dt=%s is bigger than maximum stable dt %s "
      From worker 12:   /home/kerim/.julia/dev/JUDI/src/pysource/models.py:461: UserWarning: Provided dt=7.5 is bigger than maximum stable dt 2.085 
      From worker 12:     warnings.warn("Provided dt=%s is bigger than maximum stable dt %s "
      From worker 8:    /home/kerim/.julia/dev/JUDI/src/pysource/models.py:461: UserWarning: Provided dt=7.5 is bigger than maximum stable dt 2.085 
      From worker 8:      warnings.warn("Provided dt=%s is bigger than maximum stable dt %s "
      From worker 8:    Trying to allocate more memory for symbol u than available on physical device, this will start swapping
      From worker 11:   /home/kerim/.julia/dev/JUDI/src/pysource/models.py:461: UserWarning: Provided dt=7.5 is bigger than maximum stable dt 2.13 
      From worker 11:     warnings.warn("Provided dt=%s is bigger than maximum stable dt %s "
      From worker 11:   Trying to allocate more memory for symbol u than available on physical device, this will start swapping
      From worker 8:    /home/kerim/.julia/dev/JUDI/src/pysource/models.py:461: UserWarning: Provided dt=7.5 is bigger than maximum stable dt 2.085 
      From worker 8:      warnings.warn("Provided dt=%s is bigger than maximum stable dt %s "
      From worker 11:   /home/kerim/.julia/dev/JUDI/src/pysource/models.py:461: UserWarning: Provided dt=7.5 is bigger than maximum stable dt 2.13 
      From worker 11:     warnings.warn("Provided dt=%s is bigger than maximum stable dt %s "
      From worker 8:    Trying to allocate more memory for symbol u than available on physical device, this will start swapping
      From worker 11:   Trying to allocate more memory for symbol u than available on physical device, this will start swapping
      From worker 5:    Operator `gradient` ran in 79.84 s
      From worker 5:    /home/kerim/.julia/dev/JUDI/src/pysource/models.py:461: UserWarning: Provided dt=7.5 is bigger than maximum stable dt 2.085 
      From worker 5:      warnings.warn("Provided dt=%s is bigger than maximum stable dt %s "
      From worker 7:    /home/kerim/.julia/dev/JUDI/src/pysource/models.py:461: UserWarning: Provided dt=7.5 is bigger than maximum stable dt 2.085 
      From worker 7:      warnings.warn("Provided dt=%s is bigger than maximum stable dt %s "
      From worker 2:    Operator `forward` ran in 13.92 s
      From worker 2:    Operator `gradient` ran in 20.29 s
      From worker 10:   Operator `forward` ran in 30.02 s
      From worker 12:   Operator `forward` ran in 31.11 s
      From worker 9:    Operator `forward` ran in 32.06 s
      From worker 5:    Operator `forward` ran in 27.94 s
      From worker 3:    Operator `forward` ran in 39.53 s
      From worker 4:    Operator `forward` ran in 38.99 s
      From worker 7:    Operator `forward` ran in 50.90 s
      From worker 10:   Operator `gradient` ran in 43.76 s
      From worker 10:   /home/kerim/.julia/dev/JUDI/src/pysource/models.py:461: UserWarning: Provided dt=7.5 is bigger than maximum stable dt 2.085 
      From worker 10:     warnings.warn("Provided dt=%s is bigger than maximum stable dt %s "
      From worker 12:   Operator `gradient` ran in 46.35 s
      From worker 5:    Operator `gradient` ran in 41.58 s
      From worker 9:    Operator `gradient` ran in 49.62 s
      From worker 3:    Operator `gradient` ran in 56.87 s
      From worker 2:    /home/kerim/.julia/dev/JUDI/src/pysource/models.py:461: UserWarning: Provided dt=7.5 is bigger than maximum stable dt 2.111 
      From worker 2:      warnings.warn("Provided dt=%s is bigger than maximum stable dt %s "
      From worker 3:    /home/kerim/.julia/dev/JUDI/src/pysource/models.py:461: UserWarning: Provided dt=7.5 is bigger than maximum stable dt 2.118 
      From worker 3:      warnings.warn("Provided dt=%s is bigger than maximum stable dt %s "
      From worker 8:    /home/kerim/.julia/dev/JUDI/src/pysource/models.py:461: UserWarning: Provided dt=7.5 is bigger than maximum stable dt 2.111 
      From worker 8:      warnings.warn("Provided dt=%s is bigger than maximum stable dt %s "
      From worker 8:    Trying to allocate more memory for symbol u than available on physical device, this will start swapping
      From worker 4:    Operator `gradient` ran in 57.71 s
      From worker 4:    /home/kerim/.julia/dev/JUDI/src/pysource/models.py:461: UserWarning: Provided dt=7.5 is bigger than maximum stable dt 2.117 
      From worker 4:      warnings.warn("Provided dt=%s is bigger than maximum stable dt %s "
      From worker 8:    /home/kerim/.julia/dev/JUDI/src/pysource/models.py:461: UserWarning: Provided dt=7.5 is bigger than maximum stable dt 2.111 
      From worker 8:      warnings.warn("Provided dt=%s is bigger than maximum stable dt %s "
      From worker 9:    /home/kerim/.julia/dev/JUDI/src/pysource/models.py:461: UserWarning: Provided dt=7.5 is bigger than maximum stable dt 2.118 
      From worker 9:      warnings.warn("Provided dt=%s is bigger than maximum stable dt %s "
      From worker 8:    Trying to allocate more memory for symbol u than available on physical device, this will start swapping
      From worker 12:   /home/kerim/.julia/dev/JUDI/src/pysource/models.py:461: UserWarning: Provided dt=7.5 is bigger than maximum stable dt 2.088 
      From worker 12:     warnings.warn("Provided dt=%s is bigger than maximum stable dt %s "
      From worker 5:    /home/kerim/.julia/dev/JUDI/src/pysource/models.py:461: UserWarning: Provided dt=7.5 is bigger than maximum stable dt 2.085 
      From worker 5:      warnings.warn("Provided dt=%s is bigger than maximum stable dt %s "
      From worker 11:   /home/kerim/.julia/dev/JUDI/src/pysource/models.py:461: UserWarning: Provided dt=7.5 is bigger than maximum stable dt 2.113 
      From worker 11:     warnings.warn("Provided dt=%s is bigger than maximum stable dt %s "
      From worker 11:   Trying to allocate more memory for symbol u than available on physical device, this will start swapping
      From worker 11:   /home/kerim/.julia/dev/JUDI/src/pysource/models.py:461: UserWarning: Provided dt=7.5 is bigger than maximum stable dt 2.113 
      From worker 11:     warnings.warn("Provided dt=%s is bigger than maximum stable dt %s "
      From worker 11:   Trying to allocate more memory for symbol u than available on physical device, this will start swapping
      From worker 2:    Operator `forward` ran in 29.66 s
      From worker 12:   Operator `forward` ran in 27.80 s
      From worker 10:   Operator `forward` ran in 48.04 s
      From worker 7:    Operator `gradient` ran in 74.85 s
      From worker 3:    Operator `forward` ran in 34.17 s
      From worker 7:    /home/kerim/.julia/dev/JUDI/src/pysource/models.py:461: UserWarning: Provided dt=7.5 is bigger than maximum stable dt 2.108 
      From worker 7:      warnings.warn("Provided dt=%s is bigger than maximum stable dt %s "
      From worker 9:    Operator `forward` ran in 35.75 s
      From worker 5:    Operator `forward` ran in 35.27 s
      From worker 4:    Operator `forward` ran in 36.14 s
      From worker 12:   Operator `gradient` ran in 41.04 s
      From worker 2:    Operator `gradient` ran in 43.71 s
      From worker 7:    Operator `forward` ran in 31.60 s

mloubout · 2024-07-25T16:21:17Z

That's quite hight dt_comp this will only produce nans.
It might throw the warning because the new task starts while it's gathering the result so there is a tiny bit of overlap time that might lead to the warning but it should clear the memory before being used

mloubout · 2024-07-25T16:21:48Z

This is just a warning it would crash if it was actually allocating too much

kerim371 · 2024-07-25T17:15:49Z

Yes the dt_comp is big but I have to do that because of limited resources. And I do FWI on low frequencies.

This is just a warning it would crash if it was actually allocating too much

Finally it crashed :)
But to my shame I haven't read the error message.
Probaly I will try to reproduce it

mloubout · 2024-07-25T17:32:31Z

is big but I have to do that because of limited resources

It doesn't matter, if you set dt_comp to a lar get value it becomes unstable and the results will be garbage, you cannot set it to something higher than the CFL which is why you get all those warnings. If you want the wavefield to be sampled coarser for the gradient you need to use subsampling_factor option

kerim371 · 2024-07-25T18:13:40Z

I can say that the result of the first iteration with dt_comp=5, subsampling_factor=2 is:

So the gradient looks fine except the linear artefacts.
And the max stable is about 2 I guess.

So I'm doing some experiments in an attempt to to deacrease the computational cost while losing some quality

mloubout · 2024-07-25T18:16:00Z

JUDI will ignore dt_comp if the one provided is too high so it doesn't use the one you give it uses the 2.085 one

kerim371 · 2024-07-25T18:19:50Z

JUDI will ignore dt_comp if the one provided is too high so it doesn't use the one you give it uses the 2.085 one

Good to know, thank you!

kerim371 closed this as completed Jul 24, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Memory consumption #262

Memory consumption #262

kerim371 commented Jul 7, 2024

mloubout commented Jul 11, 2024

mloubout commented Jul 24, 2024

kerim371 commented Jul 24, 2024

kerim371 commented Jul 24, 2024

kerim371 commented Jul 25, 2024

mloubout commented Jul 25, 2024

mloubout commented Jul 25, 2024

kerim371 commented Jul 25, 2024

mloubout commented Jul 25, 2024

kerim371 commented Jul 25, 2024

mloubout commented Jul 25, 2024

kerim371 commented Jul 25, 2024

Memory consumption #262

Memory consumption #262

Comments

kerim371 commented Jul 7, 2024

mloubout commented Jul 11, 2024

mloubout commented Jul 24, 2024

kerim371 commented Jul 24, 2024

kerim371 commented Jul 24, 2024

kerim371 commented Jul 25, 2024

mloubout commented Jul 25, 2024

mloubout commented Jul 25, 2024

kerim371 commented Jul 25, 2024

mloubout commented Jul 25, 2024

kerim371 commented Jul 25, 2024

mloubout commented Jul 25, 2024

kerim371 commented Jul 25, 2024