-
Notifications
You must be signed in to change notification settings - Fork 143
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Performance] How to minimize/restrict memory utilization? #662
Comments
Hi, Some comments:
It should depend on how many actual connections your users generate.
That is odd. Basically if inserting a value causes overflow in terms of size, cache evicts some random items (https://github.com/dgryski/go-expirecache/blob/master/cache.go#L93-L95). That would cause more load on GO's GC but appart from that it should slow down the rss growth. But let's keep that thought about GC in mind for now
That actually can happen as the more actual requests you do, the more garbage you'll have for golang to collect
That is extremely weird, as backendv1 is converted internally to a backendv2 (there is just an extra step to pre-populate some things there). Basically for each backend section you have backendv2 section that behaves the same (https://github.com/go-graphite/carbonapi/blob/main/zipper/config/config.go#L128-L155) Basically your section:
Is equivalent to:
Also it would override global That would be more friendly towards Golang's GC.
maxBatchSize controls that to some extent. Also As well
With go-carbon ways how to deal with that are rather limited, unfortunately (it doesn't support sending already pre-aggregated replies and would always do it's best to send you all the data it can). I'm not sure if there are any ways to limit size of reply on go-carbon side to be honest, as that would be the better approach here. Otherwise - you can limit amount of concurrent queries and actually enable caches (the less you need to go to backends the better it would be for you). There were some efforts by @msaf1980 to improve how caching is done in general. You might want to look at his work (it's currently in master, I haven't cut a release yet as I want to fix few issues first). It might also help if you'll be able to collect some heap profiles and share svg: https://go.dev/doc/diagnostics#profiling carbonapi provides a way to enable expvar and pprof on a separate port: You can enable it there and collect profiles with |
Oh, and I forgot to mention, as there is some evidence that it might be actually GC pressure, it would be great to see what Go version you are using and maybe you can play a bit with GOGC value (https://pkg.go.dev/runtime). There are numerous articles on how to do that or what it means: And some others. So maybe lowering it might help if garbage collection issues is actually the case here. |
Currently I am using image from Docker Hub, but I've built the tip using 1.17 and in both cases the behavior is similar. Though I did not measure the time it takes to OOM in both cases. I have changed the setup from running |
When you are having a backends in About releasing memory - golang as most of GC-based languages do not really like to do it, so even if it's not in-use it won't be released for quiet some time. That is expected. That's why you have some metrics exported by carbonapi itself. For actual memory usage you should refer to them. Overall for heavy requests I would recommend considering migrating backend to graphite-clickhouse/carbon-clickhouse and use them and enable backend side aggregation. Not only you won't need to use broadcast mode (if you have replication enabled on CH side, it will ensure that data is the same across all your replicas) but also it can pre-aggregate responses based on what Grafana requested. That usually will give about some reduction in amount of data you need to fetch and process. However that obviously have it's own drawbacks (you'll need to manage clickhouse installation, you'll need to migrate data somehow, clickhouse in general is slower for single reads and low amount of updates, but it's faster for bulk reads and writes) |
No, we don't have copies of data, it is all separate across different servers (using carbon-relay-ng and consistent hashing), as the write throughput of single server (raid10 sata3 ssds) is not enough anymore (unless we go oh-so-expensive NVMe storage). But it does merge since with consistent hashing we can't predict which metrics will go where and single dashboard may be loading metrics from different storage hosts. But that is expected. The unexpected part is how much memory it wants to use. The first setup (which had OOMs) has 128 GB RAM, which cabonapi consumed in a matter of seconds in some instances. The current setup is 2 servers each with 256 GB RAM and carbonapi periodically gets close to the max. Comparing it to graphite-web -- same data requested uses barely 2MB of RAM. But it is entirely possible that the merging is done on disk there? (that could be the case because of how insanely slow it is to give data from 2 sources). |
We did not have this issue for awhile, and recently started hitting it again. I did another round of config tuning lowering Perhaps would be good to create a performance tuning documentation on how to tune towards different use-cases? |
We just went into production with carbonapi 0.16.0~1 and also quickly ended up having memory issues - carbonapi had maybe 100 MiB of data in cache, but process memory consumption was 15 GiB and increasing. We have two go-carbon carbonserver backends in broadcast mode, but the memory issue can be replicated with a single backend as well. After some testing, we think this is a memory leak related to carbonapi response cache and JSON response format. Here's a small bash test script to run locally on an idle carbonapi server - it keeps requesting the same data in a loop once a second, increasing maxDataPoints by one to force a cache miss every time, and records carbonapi's RSS (resident set size) memory usage and change between requests: #!/bin/bash
carbonapi_pid=$(pgrep -u carbon carbonapi)
if [ -z "$carbonapi_pid" ]; then
printf "carbonapi is not running\n"
exit 0
fi
render_url="localhost:8080/render"
target="testing.carbonapi.*.runtime.mem_stats.*"
range="from=-48h&until=now"
format="json"
request="${render_url}/render?target=${target}&${range}&format=${format}"
printf "Teasing carbonapi at $render_url\n"
rss_before=$(ps -q $carbonapi_pid --no-headers -o rss)
for points in {1000..2000}; do
curl --silent --show-error "${request}&maxDataPoints=$points" > /dev/null || break
rss_after=$(ps -q $carbonapi_pid --no-headers -o rss)
printf "%s # carbonapi RSS: %9d bytes (delta %6d bytes)\n" \
"maxDataPoints=$points" $rss_after $(($rss_after - $rss_before))
rss_before=$rss_after
sleep 1
done With carbonapi response cache enabled and backend cache disabled, i.e.:
...and running the script on a small test VM, carbonapi runs out of memory pretty fast:
The selected metrics query and size of the response affects memory consumption rate, but the point is here that we hardly ever see the RSS figure going down. Sometimes the delta stays at zero for a few requests, but overall it's almost linear increase. However if we switch from using response cache to backend cache, or request the data in CSV format instead of JSON, carbonapi's memory consumption stays perfectly in control:
So a workaround for carbonapi's huge memory usage seems to be disabling response cache and relying on backend cache instead. We are not go experts here, but a colleague of mine tried profiling the issue, and all the excess memory seems to be used by |
Can you also grab and share memory profile? |
If cache enabled - it's a problem. Different maxDataPoints produce different data set (because different cache key used). For example, i't a key build functions
|
@easterhanu If you need protect against it - set cache size limit. |
@Civil here's a heap profile moments before OOM: @msaf1980 as far as I can tell, setting response cache |
Direct write to http.ResponseWriter may be a solution. I do some test early, but don't make a PR - in our environment we don't have memory overload (16 GB is not too costly for huge installations). |
We did some more testing and debugging, and think the root cause for response cache's huge memory consumption is this : Line 123 in cdf42a3
bf0ffdc changed the way the byte slice is created from |
There are other memory concerns too. During past weekend we had two production servers running carbonapi with just the backend cache enabled (as a workaround for response cache issues). Server A carbonapi had steady memory consumption around ~50 MiB, whereas server B carbonapi got killed by kernel OOM killer after reaching almost 30 GiB. Carbonapi settings were the same for both servers, but B was serving some really heavy wildcard requests which would often fail with something like WARN zipper errors occurred while getting results {"type": "protoV2Group", "name": "http://xxxxx", "type": "fetch", "request": "&MultiFetchRequest{Metrics:[]FetchRequest{FetchRequest{ ... (insert tons of FetchRequests for different metrics) ... "errors": "max tries exceeded", "errorsVerbose": "max tries exceeded\nHTTP Code: 500\n\ngithub.com/go-graphite/carbonapi/zipper/types.init\n\t/root/go/src/github.com/go-graphite/carbonapi/zipper/types/errors.go:25\nruntime.doInit\n\t/usr/local/go/src/runtime/proc.go:6321\nruntime.doInit\n\t/usr/local/go/src/runtime/proc.go:6298\nruntime.doInit\n\t/usr/local/go/src/runtime/proc.go:6298\nruntime.doInit\n\t/usr/local/go/src/runtime/proc.go:6298\nruntime.main\n\t/usr/local/go/src/runtime/proc.go:233\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:1594\n\nCaused By: failed to fetch data from server/group\nHTTP Code: 500\n\ngithub.com/go-graphite/carbonapi/zipper/types.init\n\t/root/go/src/github.com/go-graphite/carbonapi/zipper/types/errors.go:27\nruntime.doInit\n\t/usr/local/go/src/runtime/proc.go:6321\nruntime.doInit\n\t/usr/local/go/src/runtime/proc.go:6298\nruntime.doInit\n\t/usr/local/go/src/runtime/proc.go:6298\nruntime.doInit\n\t/usr/local/go/src/runtime/proc.go:6298\nruntime.main\n\t/usr/local/go/src/runtime/proc.go:233\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:1594\n\nCaused By: error while fetching Response\n\ngithub.com/go-graphite/carbonapi/zipper/types.init\n\t/root/go/src/github.com/go-graphite/carbonapi/zipper/types/errors.go:34\nruntime.doInit\n\t/usr/local/go/src/runtime/proc.go:6321\nruntime.doInit\n\t/usr/local/go/src/runtime/proc.go:6298\nruntime.doInit\n\t/usr/local/go/src/runtime/proc.go:6298\nruntime.doInit\n\t/usr/local/go/src/runtime/proc.go:6298\nruntime.main\n\t/usr/local/go/src/runtime/proc.go:233\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:1594"} Sometimes we could also see "could not expand globs - context canceled" errors on the go-carbon side. Switching |
This need some research. Can you test version from custom branch ?
It's a try to get appoximated buffer size. Without this too many realloc. You can see benchmark in PR #729 May be need logic change or switch to direct write. |
@msaf1980 which branch would you like us to test? |
Tomorrow I write branch names. I'll adapt branch with direct write to current muster and may be a version with updated policy of buffer prealloc. |
Hm, if this a true, json marshaling not a problem for high memory usage. From documentation: @Civil I don't use go-carbon on heavy load. |
It is niche. To have better results you need to have:
As it mostly allows you to save on network and utilize potential concurrency of underlying storage. If you have small amount of servers or slow I/O that is not very good with concurrent requests - I wouldn't recommend that as I would expect worse performance. Potentially there is a room to implement heuristic to alternate between both, but that would require getting some information from go-carbon and much more performance data than I can gather myself. And I would strongly advice against splitting globs or anything fancy if your backend is a database that can scale by itself (e.x. clickhouse) |
Any settings what to try?
|
Problem description
We just moved our large system from graphite-web to carbonapi -> go-carbon which results in much faster performance and also allows us to scale horizontally easily (with multiple backend servers defined in carbonapi).
From my team's side we are managing the monitoring system, but not individual dashboards and how they are querying, that is done by others. Which results in very large queries to carbonapi (both number of metrics requested at once and the date range).
The above causes carbonapi to run out of memory every few minutes on a 128 GB server. This did not happen with graphite-web ever (it was slow, but memory footprint there was pretty small). I've tried to tune following:
Seems it is related to having multiple backend servers and needing to merge response which is done in-memory. So there is no setting really that controls that. At least no setting for that in documentation.
So my question is -- how to restrict memory usage in carbonapi to avoid OOM?
carbonapi's version
v0.15.4
Does this happened before
N/A, it did not happen before, but before it was graphite-web with single data server, not carbonapi with multiple backends.
carbonapi's config
backend software and config
go-carbon(s):
Query that causes problems
aliasByNode
with multiple wildcards.The text was updated successfully, but these errors were encountered: