Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Memory leak #22

Open
sevagh opened this issue Feb 20, 2019 · 14 comments
Open

Memory leak #22

sevagh opened this issue Feb 20, 2019 · 14 comments

Comments

@sevagh
Copy link
Contributor

sevagh commented Feb 20, 2019

Hello.

I just upgraded to the latest version of this adapter (with Prometheus 2.6 pollers) and the following ES settings:

ES_WORKERS=4
ES_BATCH_MAX_DOCS=-1
ES_BATCH_MAX_SIZE=-1

The adapter has a memory leak. I added the pprof debug endpoint. Here's the pprof/heap -top output:

Showing nodes accounting for 154.44MB, 98.21% of 157.26MB total
Dropped 48 nodes (cum <= 0.79MB)
      flat  flat%   sum%        cum   cum%
   52.01MB 33.07% 33.07%    80.02MB 50.88%  github.com/pwillie/prometheus-es-adapter/pkg/elasticsearch.(*WriteService).Write
   45.51MB 28.94% 62.01%    45.51MB 28.94%  github.com/pwillie/prometheus-es-adapter/vendor/gopkg.in/olivere/elastic%2ev6.(*BulkIndexRequest).Source
   28.01MB 17.81% 79.82%    28.01MB 17.81%  github.com/pwillie/prometheus-es-adapter/vendor/gopkg.in/olivere/elastic%2ev6.NewBulkIndexRequest (inline)
   18.50MB 11.76% 91.59%    18.50MB 11.76%  github.com/pwillie/prometheus-es-adapter/vendor/github.com/prometheus/prometheus/prompb.(*Label).Unmarshal

Another one:

File: prometheus-es-adapter
Type: inuse_space
Time: Feb 20, 2019 at 4:14am (PST)
Showing nodes accounting for 357.28MB, 99.08% of 360.59MB total
Dropped 57 nodes (cum <= 1.80MB)
      flat  flat%   sum%        cum   cum%
  110.53MB 30.65% 30.65%   168.04MB 46.60%  github.com/pwillie/prometheus-es-adapter/pkg/elasticsearch.(*WriteService).Write
   92.51MB 25.66% 56.31%    93.01MB 25.80%  github.com/pwillie/prometheus-es-adapter/vendor/gopkg.in/olivere/elastic%2ev6.(*BulkIndexRequest).Source
   57.51MB 15.95% 72.26%    57.51MB 15.95%  github.com/pwillie/prometheus-es-adapter/vendor/gopkg.in/olivere/elastic%2ev6.NewBulkIndexRequest (inline)
   39.50MB 10.95% 83.21%    39.50MB 10.95%  github.com/pwillie/prometheus-es-adapter/vendor/github.com/prometheus/prometheus/prompb.(*Label).Unmarshal
   25.30MB  7.02% 90.23%    25.30MB  7.02%  bytes.(*Buffer).String (inline)

It's mostly from this method: https://github.com/pwillie/prometheus-es-adapter/blob/master/pkg/elasticsearch/write.go#L69

Moreover, the inuse_space doesn't seem to be the source of the leak - I think it's a GC problem (but I'm not sure how well I can continue debugging). When I use pprof -alloc_space I think the bulk of the leak is evident there:

Dropped 240 nodes (cum <= 0.48GB)
Showing top 10 nodes out of 71
      flat  flat%   sum%        cum   cum%
   25.93GB 26.98% 26.98%    44.49GB 46.29%  github.com/pwillie/prometheus-es-adapter/vendor/gopkg.in/olivere/elastic%2ev6.(*BulkIndexRequest).Source
   11.08GB 11.53% 38.51%    11.08GB 11.53%  bytes.makeSlice
    5.94GB  6.18% 44.69%     9.33GB  9.70%  github.com/pwillie/prometheus-es-adapter/pkg/elasticsearch.(*WriteService).Write
    4.20GB  4.37% 49.06%    18.56GB 19.31%  encoding/json.Marshal
    4.08GB  4.25% 53.31%     6.39GB  6.65%  github.com/pwillie/prometheus-es-adapter/vendor/github.com/prometheus/prometheus/prompb.(*TimeSeries).Unmarshal

Running with pprof list Write:

ROUTINE ======================== github.com/pwillie/prometheus-es-adapter/pkg/elasticsearch.(*WriteService).Write in /go/src/github.com/pwillie/prometheus-es-adapter/pkg/elasticsearch/write.go
    6.52GB    10.23GB (flat, cum)  9.73% of Total
         .          .     66:}
         .          .     67:
         .          .     68:// Write will enqueue Prometheus sample data to be batch written to Elasticsearch
         .          .     69:func (svc *WriteService) Write(req []*prompb.TimeSeries) {
         .          .     70:   for _, ts := range req {
  884.04MB   884.04MB     71:           metric := make(model.Metric, len(ts.Labels))
         .          .     72:           for _, l := range ts.Labels {
    5.11GB     5.11GB     73:                   metric[model.LabelName(l.Name)] = model.LabelValue(l.Value)
         .          .     74:           }
         .          .     75:           for _, s := range ts.Samples {
         .          .     76:                   v := float64(s.Value)
         .          .     77:                   if math.IsNaN(v) || math.IsInf(v, 0) {
         .    41.51MB     78:                           svc.logger.Debug(fmt.Sprintf("invalid value %+v, skipping sample %+v", v, s))
         .          .     79:                           continue
         .          .     80:                   }
         .          .     81:                   sample := prometheusSample{
         .          .     82:                           metric,
         .          .     83:                           v,
         .          .     84:                           s.Timestamp,
         .          .     85:                   }
         .          .     86:                   r := elastic.
         .     3.67GB     87:                           NewBulkIndexRequest().
         .          .     88:                           Index(svc.config.Alias).
         .          .     89:                           Type(sampleType).

Here's the Prometheus protobuf marshaling code:

ROUTINE ======================== github.com/pwillie/prometheus-es-adapter/vendor/github.com/prometheus/prometheus/prompb.(*WriteRequest).Unmarshal in /go/src/github.com/pwillie/prometheus-es-adapter/vendor/github.com/prometheus/prometheus/prompb/remote.pb.go
    1.25GB     8.24GB (flat, cum)  7.83% of Total
         .          .    493:                   }
         .          .    494:                   postIndex := iNdEx + msglen
         .          .    495:                   if postIndex > l {
         .          .    496:                           return io.ErrUnexpectedEOF
         .          .    497:                   }
    1.25GB     1.25GB    498:                   m.Timeseries = append(m.Timeseries, &TimeSeries{})
         .     6.99GB    499:                   if err := m.Timeseries[len(m.Timeseries)-1].Unmarshal(dAtA[iNdEx:postIndex]); err != nil {
         .          .    500:                           return err
         .          .    501:                   }
         .          .    502:                   iNdEx = postIndex
         .          .    503:           default:
         .          
@sevagh
Copy link
Contributor Author

sevagh commented Feb 20, 2019

Looks like after the metric[] slice is constructed and encapsulated in the BulkIndexRequest sample, it's never freed/released?

@sevagh
Copy link
Contributor Author

sevagh commented Feb 20, 2019

Inuse space good:

shanssian:prometheus-es-adapter $ go tool pprof -inuse_space -top http://bleep:9000/debug/pprof/heap | head -n10
Fetching profile over HTTP from http://bleep:9000/debug/pprof/heap
Saved profile in /home/shanssian/pprof/pprof.prometheus-es-adapter.alloc_objects.alloc_space.inuse_objects.inuse_space.039.pb.gz
File: prometheus-es-adapter
Type: inuse_space
Time: Feb 20, 2019 at 4:40am (PST)
Showing nodes accounting for 65.88MB, 100% of 65.88MB total
      flat  flat%   sum%        cum   cum%
   25.01MB 37.96% 37.96%    32.51MB 49.34%  github.com/pwillie/prometheus-es-adapter/pkg/elasticsearch.(*WriteService).Write
      20MB 30.36% 68.32%       20MB 30.36%  github.com/pwillie/prometheus-es-adapter/vendor/gopkg.in/olivere/elastic%2ev6.(*BulkIndexRequest).Source
    7.50MB 11.39% 79.71%     7.50MB 11.39%  github.com/pwillie/prometheus-es-adapter/vendor/gopkg.in/olivere/elastic%2ev6.NewBulkIndexRequest (inline)
       7MB 10.63% 90.33%        7MB 10.63%  github.com/pwillie/prometheus-es-adapter/vendor/github.com/prometheus/prometheus/prompb.(*Label).Unmarshal
    2.40MB  3.64% 93.97%     2.40MB  3.64%  bytes.(*Buffer).String (inline)

Alloc_space bad:

File: prometheus-es-adapter
Type: alloc_space
Time: Feb 20, 2019 at 4:40am (PST)
Showing nodes accounting for 167229.67MB, 98.79% of 169269.66MB total
Dropped 264 nodes (cum <= 846.35MB)
      flat  flat%   sum%        cum   cum%
45555.63MB 26.91% 26.91% 78240.27MB 46.22%  github.com/pwillie/prometheus-es-adapter/vendor/gopkg.in/olivere/elastic%2ev6.(*BulkIndexRequest).Source
19662.38MB 11.62% 38.53% 19662.38MB 11.62%  bytes.makeSlice
10465.85MB  6.18% 44.71% 16375.52MB  9.67%  github.com/pwillie/prometheus-es-adapter/pkg/elasticsearch.(*WriteService).Write
 7419.20MB  4.38% 49.10% 32684.64MB 19.31%  encoding/json.Marshal

Png dump of alloc_space:
profile006

@sevagh
Copy link
Contributor Author

sevagh commented Feb 20, 2019

Inuse_space for good measure (since I may be misunderstanding alloc_space):
profile007
Also, runtime.MemStats:

# runtime.MemStats
# Alloc = 163905888
# TotalAlloc = 197824212744
# Sys = 1208749816
# Lookups = 0
# Mallocs = 2158845849
# Frees = 2157028671
# HeapAlloc = 163905888
# HeapSys = 1136525312
# HeapIdle = 778420224
# HeapInuse = 358105088
# HeapReleased = 0
# HeapObjects = 1817178
# Stack = 4325376 / 4325376
# MSpan = 8442536 / 15089664
# MCache = 55296 / 65536
# BuckHashSys = 1904911
# GCSys = 45088768
# OtherSys = 5750249
# NextGC = 188449376
# LastGC = 1550666769451819398

@sevagh
Copy link
Contributor Author

sevagh commented Feb 20, 2019

I'm still monitoring the situation but it looks like adding ES_BATCH_MAX_AGE=30 is preventing this issue. Perhaps it's related to the GC running?

@sevagh
Copy link
Contributor Author

sevagh commented Feb 21, 2019

I've been trying to look at the code since my previously deployed version of this adapter (which still works fine). The only thing I can see, since your wrapper around the BulkWriter is largely the same, just moved around to different files, is this:
616b927#diff-836546cc53507f6b2d581088903b1785
Changing the Elastic client. I'll file an issue there.

@pwillie
Copy link
Owner

pwillie commented Feb 27, 2019

Thanks for the detailed report. Not sure when I will get to this but please keep me posted with any further developments.

@Kanshiroron
Copy link

Do you guys have any update on this? We've been using this adapter and it gets regularly evicted from Kubernetes as it's consuming too much RAM (10GB+ in under 24H)...
Thanks for your support!

@sevagh
Copy link
Contributor Author

sevagh commented May 4, 2019

I've actually switched to using metricbeat and the Prometheus module. You can use metricbeat to scrape the same targets as Prometheus, rather than go through the remote_* adapters.

@Kanshiroron
Copy link

Thanks @sevagh for your workaround, sadly that won't work for us as the Prometheus module doesn't seem to be compatible with AWS/Kubernetes service discovery...

@pwillie
Copy link
Owner

pwillie commented May 13, 2019

@Kanshiroron, I haven't had a chance to dig into this one yet sorry.

@Kanshiroron
Copy link

Hey @pwillie I am actually testing a workaround I'll keep you posted with the results.
Thanks for following up

@Kanshiroron
Copy link

Hey @pwillie, after tweaking the adapter configuration, I no longer see the issue appearing. It has been running for 10 days now without any issue.
I changed the default settings to following:

  ES_WORKERS: "20"
  ES_BATCH_MAX_AGE: "30" # in seconds
  ES_BATCH_MAX_DOCS: "5000"
  ES_BATCH_MAX_SIZE: "16384" # in bytes
  ES_INDEX_SHARDS: "1"
  ES_INDEX_REPLICAS: "2"

I suspect the number of shard and replicas to be causing the issue (use to have 5 index shards and 3 replicas)

@pwillie
Copy link
Owner

pwillie commented May 28, 2019

Thanks for the update and great to hear of your experience.

@Kanshiroron
Copy link

You're welcome.
What I think was happening is that the es-adapter was receiving more metrics that it was able to push to ElasticSearch.
It would be nice to have an internal check to make sure the number of metrics in the buffer don't keep increasing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants