Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NATS process consumes more and more memory over time untill GOMEMLIMIT is reached [v2.10.22] #6163

Open
dved opened this issue Nov 22, 2024 · 6 comments
Labels
defect Suspected defect such as a bug or regression

Comments

@dved
Copy link

dved commented Nov 22, 2024

Observed behavior

Nats process consumes more and more memory durin normal work. Client application has ~ 40 vms which communicates to each other via nats streams.
Just after the service starts the nats process consumes ~ 200MB resident/1,4GB, and after 2 days it is already ~ 2GB resident/3,5GB Virtual.

Average stats for Available RAM of the VM:
image

Current usage via htop:
image

I also used simple bash script, which was being callee by crontab every 10 minutes. It was getting some basic stats for the process:

ps aux --sort=-%mem | awk -v process="$PROCESS_NAME" '
$11 ~ process {
    printf "%s,%d,%d,%d,%.2f,%.2f,%s\n", strftime("%Y-%m-%d %H:%M:%S"), $2, $6, $5, $3, $4, $11
}' >> "$OUTPUT_FILE"

csv file is attached.

I also had a second bash file, which makes memory profiler dumps every 10 minutes. I used NATS internal memory profiler according to this doc. Zip file with all of them are also attached to the defect.

I also got some basic sats from the nats via nats-cli:
./nats.exe -s nats://*** --user=*** --password='***' stream report
the result file is attached as well

nats configuration is very simple: 1 node. An additional environment variable for go mem soft limit is set
Environment="GOMEMLIMIT=14000MiB"

Previously, I had a similar lab, when memory usage was constantly increasing until the level when the GO garbage collector was fired (according to GOMEMLIMIT). Then CPU usage was almost 100%, and average mem usage decreased.
image

Unfortunately, I don't have similar CSV stats from that time.

I have the lab ready for additional investigation. This can be reproduced. When I increase the number of consumers, the speed of memory consumption increases.

If no GOMEMLIMIT is set, the Nats process increases over time until the Linux-OOM-killer usually kills the process.

Expected behavior

It is not expected that the Nats process consumes more and more memory without visible attempts to clear the used memory before the soft limit occurs.

Server and client version

server - 2.10.22

Host environment

nats service:
OS - Vm on Azure, Linux, 22.04.1, Ubuntu, 6.5.0-1025-azure, x64
4 CPU/16GB RAM

Steps to reproduce

No response

@dved dved added the defect Suspected defect such as a bug or regression label Nov 22, 2024
@dved
Copy link
Author

dved commented Nov 22, 2024

adding zip file containing csv, mem profile dumps
nats_mem_stats 2.zip

@neilalexander
Copy link
Member

neilalexander commented Nov 22, 2024

Are you setting Nats-Msg-Id on your publishes? What duplicate window do you have set on your streams?

@dved
Copy link
Author

dved commented Nov 25, 2024

Hello, yes, we use Nats-Msg-Id in the streams where duplication is enabled. Most streams use duplication and 2min duplication window. Some of them has 1y duplication window, but as i understand we have just a couple of them and they usually contains a small number of messages

@dved
Copy link
Author

dved commented Dec 9, 2024

Hello, did you have time to look into my results?

@derekcollison
Copy link
Member

@neilalexander is OOO through Wed IIRC.

Question are you a Synadia customer?

@wallyqs wallyqs changed the title [2.10.22] Nats process consumes more and more memory over time untill GOMEMLIMIT is reached NATS process consumes more and more memory over time untill GOMEMLIMIT is reached [v2.10.22] Dec 11, 2024
@neilalexander
Copy link
Member

Hi @dved, apologies for the delayed response. I do think there's memory usage being held here in duplicate message detection. The duplicate window tracks Nats-Msg-Id for each message in memory, so it's important that the duplicate window is never set higher than it should be. 2 minutes could potentially be high or low depending on the publish rate, 1 year is definitely too high.

I had hoped to improve this in 2.10.23 but ran into some performance regressions with the new code so it was backed out. I will be going back to this hopefully soon but for now please revisit the stream configuration.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
defect Suspected defect such as a bug or regression
Projects
None yet
Development

No branches or pull requests

3 participants