Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Memory consumption #262

Closed
kerim371 opened this issue Jul 7, 2024 · 12 comments
Closed

Memory consumption #262

kerim371 opened this issue Jul 7, 2024 · 12 comments

Comments

@kerim371
Copy link
Contributor

kerim371 commented Jul 7, 2024

Hi,

Recently I've done some computation on single Ubuntu node with 64 GB RAM and it finished successively.

Then I've tryied to do the same computations on small cluster (5 Centos 7 nodes) with 128 GB RAM and at some point I've noticed that sometimes I see the warning like not enogh memory, starting swapping and the RAM is about 110 GB filled. And after some time I alway get ar error that the connection lost or something, so I can't to perform even a single iteration of FWI.

That means on single Ubuntu node it was enough to have 64 GB RAM without swapping and on small CentOS 7 cluster 128 GB is not enough.

Any ideas of the possible reasons?

Julia's cluster manager is SSH based.
Julia 1.9.3
JUDI v3.3.10

@mloubout
Copy link
Member

I think there might be some issue with the parallel scheduler that leaves multiple workers alive on the node. I'll try to have alook

@mloubout
Copy link
Member

I've made a small patch in the new version that might resolve that issue if you wanna give it a try.

@kerim371
Copy link
Contributor Author

I will, thank you very much!

I've made a small patch in the new version that might resolve that issue if you wanna give it a try.

@kerim371
Copy link
Contributor Author

I've checked and now I don't see any excessive memory consumption (leak).

Thank you! Fix works

@kerim371
Copy link
Contributor Author

@mloubout need to reopen :)

I now have 11 identical workers with 128 Gb RAM.
Workers 8 and 11 says that they need to swap memory.
This is a little bit weird because it is been already few hours that all the nodes require about 80 Gb, so there should be free space available.

How do you think is there is something that could be optimazed (freed) during JUDI computations? Because the data I use for FWI have regular geometry for every shot. There should not be any shots with bigger source receiver offsets.

By the way I use the following options:

global jopt = JUDI.Options(
    IC = "fwi",
    limit_m = true,
    buffer_size = 1000f0,
    dt_comp = 7.5,
    optimal_checkpointing=false,
    # subsampling_factor=2,
    free_surface=true,  # free_surface is ON to model multiples as well
    space_order=8)     # increase space order for > 12 Hz source wavelet

JUDI: v3.4.5

image

      From worker 2:      warnings.warn("Provided dt=%s is bigger than maximum stable dt %s "
      From worker 3:    /home/kerim/.julia/dev/JUDI/src/pysource/models.py:461: UserWarning: Provided dt=7.5 is bigger than maximum stable dt 2.104 
      From worker 3:      warnings.warn("Provided dt=%s is bigger than maximum stable dt %s "
      From worker 4:    /home/kerim/.julia/dev/JUDI/src/pysource/models.py:461: UserWarning: Provided dt=7.5 is bigger than maximum stable dt 2.122 
      From worker 4:      warnings.warn("Provided dt=%s is bigger than maximum stable dt %s "
      From worker 9:    /home/kerim/.julia/dev/JUDI/src/pysource/models.py:461: UserWarning: Provided dt=7.5 is bigger than maximum stable dt 2.085 
      From worker 9:      warnings.warn("Provided dt=%s is bigger than maximum stable dt %s "
      From worker 10:   /home/kerim/.julia/dev/JUDI/src/pysource/models.py:461: UserWarning: Provided dt=7.5 is bigger than maximum stable dt 2.118 
      From worker 10:     warnings.warn("Provided dt=%s is bigger than maximum stable dt %s "
      From worker 12:   /home/kerim/.julia/dev/JUDI/src/pysource/models.py:461: UserWarning: Provided dt=7.5 is bigger than maximum stable dt 2.286 
      From worker 12:     warnings.warn("Provided dt=%s is bigger than maximum stable dt %s "
      From worker 8:    /home/kerim/.julia/dev/JUDI/src/pysource/models.py:461: UserWarning: Provided dt=7.5 is bigger than maximum stable dt 2.102 
      From worker 8:      warnings.warn("Provided dt=%s is bigger than maximum stable dt %s "
      From worker 8:    Trying to allocate more memory for symbol u than available on physical device, this will start swapping
      From worker 11:   /home/kerim/.julia/dev/JUDI/src/pysource/models.py:461: UserWarning: Provided dt=7.5 is bigger than maximum stable dt 2.145 
      From worker 11:     warnings.warn("Provided dt=%s is bigger than maximum stable dt %s "
      From worker 11:   Trying to allocate more memory for symbol u than available on physical device, this will start swapping
      From worker 8:    /home/kerim/.julia/dev/JUDI/src/pysource/models.py:461: UserWarning: Provided dt=7.5 is bigger than maximum stable dt 2.102 
      From worker 8:      warnings.warn("Provided dt=%s is bigger than maximum stable dt %s "
      From worker 8:    Trying to allocate more memory for symbol u than available on physical device, this will start swapping
      From worker 11:   /home/kerim/.julia/dev/JUDI/src/pysource/models.py:461: UserWarning: Provided dt=7.5 is bigger than maximum stable dt 2.145 
      From worker 11:     warnings.warn("Provided dt=%s is bigger than maximum stable dt %s "
      From worker 11:   Trying to allocate more memory for symbol u than available on physical device, this will start swapping
      From worker 5:    Operator `gradient` ran in 56.88 s
      From worker 5:    /home/kerim/.julia/dev/JUDI/src/pysource/models.py:461: UserWarning: Provided dt=7.5 is bigger than maximum stable dt 2.085 
      From worker 5:      warnings.warn("Provided dt=%s is bigger than maximum stable dt %s "
      From worker 12:   Operator `forward` ran in 13.81 s
      From worker 2:    Operator `forward` ran in 15.11 s
      From worker 3:    Operator `forward` ran in 20.94 s
      From worker 7:    Operator `gradient` ran in 54.52 s
      From worker 7:    /home/kerim/.julia/dev/JUDI/src/pysource/models.py:461: UserWarning: Provided dt=7.5 is bigger than maximum stable dt 2.107 
      From worker 7:      warnings.warn("Provided dt=%s is bigger than maximum stable dt %s "
      From worker 12:   Operator `gradient` ran in 20.21 s
      From worker 9:    Operator `forward` ran in 29.82 s
      From worker 4:    Operator `forward` ran in 29.88 s
      From worker 2:    Operator `gradient` ran in 21.83 s
      From worker 3:    Operator `gradient` ran in 30.58 s
      From worker 10:   Operator `forward` ran in 55.22 s
      From worker 4:    Operator `gradient` ran in 44.12 s
      From worker 7:    Operator `forward` ran in 38.50 s
      From worker 9:    Operator `gradient` ran in 45.81 s
      From worker 5:    Operator `forward` ran in 55.74 s
      From worker 7:    Operator `gradient` ran in 58.76 s
      From worker 10:   Operator `gradient` ran in 81.79 s
      From worker 2:    /home/kerim/.julia/dev/JUDI/src/pysource/models.py:461: UserWarning: Provided dt=7.5 is bigger than maximum stable dt 2.271 
      From worker 2:      warnings.warn("Provided dt=%s is bigger than maximum stable dt %s "
      From worker 3:    /home/kerim/.julia/dev/JUDI/src/pysource/models.py:461: UserWarning: Provided dt=7.5 is bigger than maximum stable dt 2.122 
      From worker 3:      warnings.warn("Provided dt=%s is bigger than maximum stable dt %s "
      From worker 10:   /home/kerim/.julia/dev/JUDI/src/pysource/models.py:461: UserWarning: Provided dt=7.5 is bigger than maximum stable dt 2.085 
      From worker 10:     warnings.warn("Provided dt=%s is bigger than maximum stable dt %s "
      From worker 4:    /home/kerim/.julia/dev/JUDI/src/pysource/models.py:461: UserWarning: Provided dt=7.5 is bigger than maximum stable dt 2.085 
      From worker 4:      warnings.warn("Provided dt=%s is bigger than maximum stable dt %s "
      From worker 9:    /home/kerim/.julia/dev/JUDI/src/pysource/models.py:461: UserWarning: Provided dt=7.5 is bigger than maximum stable dt 2.085 
      From worker 9:      warnings.warn("Provided dt=%s is bigger than maximum stable dt %s "
      From worker 12:   /home/kerim/.julia/dev/JUDI/src/pysource/models.py:461: UserWarning: Provided dt=7.5 is bigger than maximum stable dt 2.085 
      From worker 12:     warnings.warn("Provided dt=%s is bigger than maximum stable dt %s "
      From worker 8:    /home/kerim/.julia/dev/JUDI/src/pysource/models.py:461: UserWarning: Provided dt=7.5 is bigger than maximum stable dt 2.085 
      From worker 8:      warnings.warn("Provided dt=%s is bigger than maximum stable dt %s "
      From worker 8:    Trying to allocate more memory for symbol u than available on physical device, this will start swapping
      From worker 11:   /home/kerim/.julia/dev/JUDI/src/pysource/models.py:461: UserWarning: Provided dt=7.5 is bigger than maximum stable dt 2.13 
      From worker 11:     warnings.warn("Provided dt=%s is bigger than maximum stable dt %s "
      From worker 11:   Trying to allocate more memory for symbol u than available on physical device, this will start swapping
      From worker 8:    /home/kerim/.julia/dev/JUDI/src/pysource/models.py:461: UserWarning: Provided dt=7.5 is bigger than maximum stable dt 2.085 
      From worker 8:      warnings.warn("Provided dt=%s is bigger than maximum stable dt %s "
      From worker 11:   /home/kerim/.julia/dev/JUDI/src/pysource/models.py:461: UserWarning: Provided dt=7.5 is bigger than maximum stable dt 2.13 
      From worker 11:     warnings.warn("Provided dt=%s is bigger than maximum stable dt %s "
      From worker 8:    Trying to allocate more memory for symbol u than available on physical device, this will start swapping
      From worker 11:   Trying to allocate more memory for symbol u than available on physical device, this will start swapping
      From worker 5:    Operator `gradient` ran in 79.84 s
      From worker 5:    /home/kerim/.julia/dev/JUDI/src/pysource/models.py:461: UserWarning: Provided dt=7.5 is bigger than maximum stable dt 2.085 
      From worker 5:      warnings.warn("Provided dt=%s is bigger than maximum stable dt %s "
      From worker 7:    /home/kerim/.julia/dev/JUDI/src/pysource/models.py:461: UserWarning: Provided dt=7.5 is bigger than maximum stable dt 2.085 
      From worker 7:      warnings.warn("Provided dt=%s is bigger than maximum stable dt %s "
      From worker 2:    Operator `forward` ran in 13.92 s
      From worker 2:    Operator `gradient` ran in 20.29 s
      From worker 10:   Operator `forward` ran in 30.02 s
      From worker 12:   Operator `forward` ran in 31.11 s
      From worker 9:    Operator `forward` ran in 32.06 s
      From worker 5:    Operator `forward` ran in 27.94 s
      From worker 3:    Operator `forward` ran in 39.53 s
      From worker 4:    Operator `forward` ran in 38.99 s
      From worker 7:    Operator `forward` ran in 50.90 s
      From worker 10:   Operator `gradient` ran in 43.76 s
      From worker 10:   /home/kerim/.julia/dev/JUDI/src/pysource/models.py:461: UserWarning: Provided dt=7.5 is bigger than maximum stable dt 2.085 
      From worker 10:     warnings.warn("Provided dt=%s is bigger than maximum stable dt %s "
      From worker 12:   Operator `gradient` ran in 46.35 s
      From worker 5:    Operator `gradient` ran in 41.58 s
      From worker 9:    Operator `gradient` ran in 49.62 s
      From worker 3:    Operator `gradient` ran in 56.87 s
      From worker 2:    /home/kerim/.julia/dev/JUDI/src/pysource/models.py:461: UserWarning: Provided dt=7.5 is bigger than maximum stable dt 2.111 
      From worker 2:      warnings.warn("Provided dt=%s is bigger than maximum stable dt %s "
      From worker 3:    /home/kerim/.julia/dev/JUDI/src/pysource/models.py:461: UserWarning: Provided dt=7.5 is bigger than maximum stable dt 2.118 
      From worker 3:      warnings.warn("Provided dt=%s is bigger than maximum stable dt %s "
      From worker 8:    /home/kerim/.julia/dev/JUDI/src/pysource/models.py:461: UserWarning: Provided dt=7.5 is bigger than maximum stable dt 2.111 
      From worker 8:      warnings.warn("Provided dt=%s is bigger than maximum stable dt %s "
      From worker 8:    Trying to allocate more memory for symbol u than available on physical device, this will start swapping
      From worker 4:    Operator `gradient` ran in 57.71 s
      From worker 4:    /home/kerim/.julia/dev/JUDI/src/pysource/models.py:461: UserWarning: Provided dt=7.5 is bigger than maximum stable dt 2.117 
      From worker 4:      warnings.warn("Provided dt=%s is bigger than maximum stable dt %s "
      From worker 8:    /home/kerim/.julia/dev/JUDI/src/pysource/models.py:461: UserWarning: Provided dt=7.5 is bigger than maximum stable dt 2.111 
      From worker 8:      warnings.warn("Provided dt=%s is bigger than maximum stable dt %s "
      From worker 9:    /home/kerim/.julia/dev/JUDI/src/pysource/models.py:461: UserWarning: Provided dt=7.5 is bigger than maximum stable dt 2.118 
      From worker 9:      warnings.warn("Provided dt=%s is bigger than maximum stable dt %s "
      From worker 8:    Trying to allocate more memory for symbol u than available on physical device, this will start swapping
      From worker 12:   /home/kerim/.julia/dev/JUDI/src/pysource/models.py:461: UserWarning: Provided dt=7.5 is bigger than maximum stable dt 2.088 
      From worker 12:     warnings.warn("Provided dt=%s is bigger than maximum stable dt %s "
      From worker 5:    /home/kerim/.julia/dev/JUDI/src/pysource/models.py:461: UserWarning: Provided dt=7.5 is bigger than maximum stable dt 2.085 
      From worker 5:      warnings.warn("Provided dt=%s is bigger than maximum stable dt %s "
      From worker 11:   /home/kerim/.julia/dev/JUDI/src/pysource/models.py:461: UserWarning: Provided dt=7.5 is bigger than maximum stable dt 2.113 
      From worker 11:     warnings.warn("Provided dt=%s is bigger than maximum stable dt %s "
      From worker 11:   Trying to allocate more memory for symbol u than available on physical device, this will start swapping
      From worker 11:   /home/kerim/.julia/dev/JUDI/src/pysource/models.py:461: UserWarning: Provided dt=7.5 is bigger than maximum stable dt 2.113 
      From worker 11:     warnings.warn("Provided dt=%s is bigger than maximum stable dt %s "
      From worker 11:   Trying to allocate more memory for symbol u than available on physical device, this will start swapping
      From worker 2:    Operator `forward` ran in 29.66 s
      From worker 12:   Operator `forward` ran in 27.80 s
      From worker 10:   Operator `forward` ran in 48.04 s
      From worker 7:    Operator `gradient` ran in 74.85 s
      From worker 3:    Operator `forward` ran in 34.17 s
      From worker 7:    /home/kerim/.julia/dev/JUDI/src/pysource/models.py:461: UserWarning: Provided dt=7.5 is bigger than maximum stable dt 2.108 
      From worker 7:      warnings.warn("Provided dt=%s is bigger than maximum stable dt %s "
      From worker 9:    Operator `forward` ran in 35.75 s
      From worker 5:    Operator `forward` ran in 35.27 s
      From worker 4:    Operator `forward` ran in 36.14 s
      From worker 12:   Operator `gradient` ran in 41.04 s
      From worker 2:    Operator `gradient` ran in 43.71 s
      From worker 7:    Operator `forward` ran in 31.60 s

@mloubout
Copy link
Member

That's quite hight dt_comp this will only produce nans.
It might throw the warning because the new task starts while it's gathering the result so there is a tiny bit of overlap time that might lead to the warning but it should clear the memory before being used

@mloubout
Copy link
Member

This is just a warning it would crash if it was actually allocating too much

@kerim371
Copy link
Contributor Author

Yes the dt_comp is big but I have to do that because of limited resources. And I do FWI on low frequencies.

This is just a warning it would crash if it was actually allocating too much

Finally it crashed :)
But to my shame I haven't read the error message.
Probaly I will try to reproduce it

@mloubout
Copy link
Member

is big but I have to do that because of limited resources

It doesn't matter, if you set dt_comp to a lar get value it becomes unstable and the results will be garbage, you cannot set it to something higher than the CFL which is why you get all those warnings. If you want the wavefield to be sampled coarser for the gradient you need to use subsampling_factor option

@kerim371
Copy link
Contributor Author

I can say that the result of the first iteration with dt_comp=5, subsampling_factor=2 is:
image

So the gradient looks fine except the linear artefacts.
And the max stable is about 2 I guess.

So I'm doing some experiments in an attempt to to deacrease the computational cost while losing some quality

@mloubout
Copy link
Member

JUDI will ignore dt_comp if the one provided is too high so it doesn't use the one you give it uses the 2.085 one

@kerim371
Copy link
Contributor Author

JUDI will ignore dt_comp if the one provided is too high so it doesn't use the one you give it uses the 2.085 one

Good to know, thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants