Memory leak #22

sevagh · 2019-02-20T12:25:45Z

Hello.

I just upgraded to the latest version of this adapter (with Prometheus 2.6 pollers) and the following ES settings:

ES_WORKERS=4
ES_BATCH_MAX_DOCS=-1
ES_BATCH_MAX_SIZE=-1

The adapter has a memory leak. I added the pprof debug endpoint. Here's the pprof/heap -top output:

Showing nodes accounting for 154.44MB, 98.21% of 157.26MB total
Dropped 48 nodes (cum <= 0.79MB)
      flat  flat%   sum%        cum   cum%
   52.01MB 33.07% 33.07%    80.02MB 50.88%  github.com/pwillie/prometheus-es-adapter/pkg/elasticsearch.(*WriteService).Write
   45.51MB 28.94% 62.01%    45.51MB 28.94%  github.com/pwillie/prometheus-es-adapter/vendor/gopkg.in/olivere/elastic%2ev6.(*BulkIndexRequest).Source
   28.01MB 17.81% 79.82%    28.01MB 17.81%  github.com/pwillie/prometheus-es-adapter/vendor/gopkg.in/olivere/elastic%2ev6.NewBulkIndexRequest (inline)
   18.50MB 11.76% 91.59%    18.50MB 11.76%  github.com/pwillie/prometheus-es-adapter/vendor/github.com/prometheus/prometheus/prompb.(*Label).Unmarshal

Another one:

File: prometheus-es-adapter
Type: inuse_space
Time: Feb 20, 2019 at 4:14am (PST)
Showing nodes accounting for 357.28MB, 99.08% of 360.59MB total
Dropped 57 nodes (cum <= 1.80MB)
      flat  flat%   sum%        cum   cum%
  110.53MB 30.65% 30.65%   168.04MB 46.60%  github.com/pwillie/prometheus-es-adapter/pkg/elasticsearch.(*WriteService).Write
   92.51MB 25.66% 56.31%    93.01MB 25.80%  github.com/pwillie/prometheus-es-adapter/vendor/gopkg.in/olivere/elastic%2ev6.(*BulkIndexRequest).Source
   57.51MB 15.95% 72.26%    57.51MB 15.95%  github.com/pwillie/prometheus-es-adapter/vendor/gopkg.in/olivere/elastic%2ev6.NewBulkIndexRequest (inline)
   39.50MB 10.95% 83.21%    39.50MB 10.95%  github.com/pwillie/prometheus-es-adapter/vendor/github.com/prometheus/prometheus/prompb.(*Label).Unmarshal
   25.30MB  7.02% 90.23%    25.30MB  7.02%  bytes.(*Buffer).String (inline)

It's mostly from this method: https://github.com/pwillie/prometheus-es-adapter/blob/master/pkg/elasticsearch/write.go#L69

Moreover, the inuse_space doesn't seem to be the source of the leak - I think it's a GC problem (but I'm not sure how well I can continue debugging). When I use pprof -alloc_space I think the bulk of the leak is evident there:

Dropped 240 nodes (cum <= 0.48GB)
Showing top 10 nodes out of 71
      flat  flat%   sum%        cum   cum%
   25.93GB 26.98% 26.98%    44.49GB 46.29%  github.com/pwillie/prometheus-es-adapter/vendor/gopkg.in/olivere/elastic%2ev6.(*BulkIndexRequest).Source
   11.08GB 11.53% 38.51%    11.08GB 11.53%  bytes.makeSlice
    5.94GB  6.18% 44.69%     9.33GB  9.70%  github.com/pwillie/prometheus-es-adapter/pkg/elasticsearch.(*WriteService).Write
    4.20GB  4.37% 49.06%    18.56GB 19.31%  encoding/json.Marshal
    4.08GB  4.25% 53.31%     6.39GB  6.65%  github.com/pwillie/prometheus-es-adapter/vendor/github.com/prometheus/prometheus/prompb.(*TimeSeries).Unmarshal

Running with pprof list Write:

ROUTINE ======================== github.com/pwillie/prometheus-es-adapter/pkg/elasticsearch.(*WriteService).Write in /go/src/github.com/pwillie/prometheus-es-adapter/pkg/elasticsearch/write.go
    6.52GB    10.23GB (flat, cum)  9.73% of Total
         .          .     66:}
         .          .     67:
         .          .     68:// Write will enqueue Prometheus sample data to be batch written to Elasticsearch
         .          .     69:func (svc *WriteService) Write(req []*prompb.TimeSeries) {
         .          .     70:   for _, ts := range req {
  884.04MB   884.04MB     71:           metric := make(model.Metric, len(ts.Labels))
         .          .     72:           for _, l := range ts.Labels {
    5.11GB     5.11GB     73:                   metric[model.LabelName(l.Name)] = model.LabelValue(l.Value)
         .          .     74:           }
         .          .     75:           for _, s := range ts.Samples {
         .          .     76:                   v := float64(s.Value)
         .          .     77:                   if math.IsNaN(v) || math.IsInf(v, 0) {
         .    41.51MB     78:                           svc.logger.Debug(fmt.Sprintf("invalid value %+v, skipping sample %+v", v, s))
         .          .     79:                           continue
         .          .     80:                   }
         .          .     81:                   sample := prometheusSample{
         .          .     82:                           metric,
         .          .     83:                           v,
         .          .     84:                           s.Timestamp,
         .          .     85:                   }
         .          .     86:                   r := elastic.
         .     3.67GB     87:                           NewBulkIndexRequest().
         .          .     88:                           Index(svc.config.Alias).
         .          .     89:                           Type(sampleType).

Here's the Prometheus protobuf marshaling code:

ROUTINE ======================== github.com/pwillie/prometheus-es-adapter/vendor/github.com/prometheus/prometheus/prompb.(*WriteRequest).Unmarshal in /go/src/github.com/pwillie/prometheus-es-adapter/vendor/github.com/prometheus/prometheus/prompb/remote.pb.go
    1.25GB     8.24GB (flat, cum)  7.83% of Total
         .          .    493:                   }
         .          .    494:                   postIndex := iNdEx + msglen
         .          .    495:                   if postIndex > l {
         .          .    496:                           return io.ErrUnexpectedEOF
         .          .    497:                   }
    1.25GB     1.25GB    498:                   m.Timeseries = append(m.Timeseries, &TimeSeries{})
         .     6.99GB    499:                   if err := m.Timeseries[len(m.Timeseries)-1].Unmarshal(dAtA[iNdEx:postIndex]); err != nil {
         .          .    500:                           return err
         .          .    501:                   }
         .          .    502:                   iNdEx = postIndex
         .          .    503:           default:
         .

The text was updated successfully, but these errors were encountered:

sevagh · 2019-02-20T12:30:18Z

Looks like after the metric[] slice is constructed and encapsulated in the BulkIndexRequest sample, it's never freed/released?

sevagh · 2019-02-20T12:42:37Z

Inuse space good:

shanssian:prometheus-es-adapter $ go tool pprof -inuse_space -top http://bleep:9000/debug/pprof/heap | head -n10
Fetching profile over HTTP from http://bleep:9000/debug/pprof/heap
Saved profile in /home/shanssian/pprof/pprof.prometheus-es-adapter.alloc_objects.alloc_space.inuse_objects.inuse_space.039.pb.gz
File: prometheus-es-adapter
Type: inuse_space
Time: Feb 20, 2019 at 4:40am (PST)
Showing nodes accounting for 65.88MB, 100% of 65.88MB total
      flat  flat%   sum%        cum   cum%
   25.01MB 37.96% 37.96%    32.51MB 49.34%  github.com/pwillie/prometheus-es-adapter/pkg/elasticsearch.(*WriteService).Write
      20MB 30.36% 68.32%       20MB 30.36%  github.com/pwillie/prometheus-es-adapter/vendor/gopkg.in/olivere/elastic%2ev6.(*BulkIndexRequest).Source
    7.50MB 11.39% 79.71%     7.50MB 11.39%  github.com/pwillie/prometheus-es-adapter/vendor/gopkg.in/olivere/elastic%2ev6.NewBulkIndexRequest (inline)
       7MB 10.63% 90.33%        7MB 10.63%  github.com/pwillie/prometheus-es-adapter/vendor/github.com/prometheus/prometheus/prompb.(*Label).Unmarshal
    2.40MB  3.64% 93.97%     2.40MB  3.64%  bytes.(*Buffer).String (inline)

Alloc_space bad:

File: prometheus-es-adapter
Type: alloc_space
Time: Feb 20, 2019 at 4:40am (PST)
Showing nodes accounting for 167229.67MB, 98.79% of 169269.66MB total
Dropped 264 nodes (cum <= 846.35MB)
      flat  flat%   sum%        cum   cum%
45555.63MB 26.91% 26.91% 78240.27MB 46.22%  github.com/pwillie/prometheus-es-adapter/vendor/gopkg.in/olivere/elastic%2ev6.(*BulkIndexRequest).Source
19662.38MB 11.62% 38.53% 19662.38MB 11.62%  bytes.makeSlice
10465.85MB  6.18% 44.71% 16375.52MB  9.67%  github.com/pwillie/prometheus-es-adapter/pkg/elasticsearch.(*WriteService).Write
 7419.20MB  4.38% 49.10% 32684.64MB 19.31%  encoding/json.Marshal

Png dump of alloc_space:

sevagh · 2019-02-20T12:46:38Z

Inuse_space for good measure (since I may be misunderstanding alloc_space):

Also, runtime.MemStats:

# runtime.MemStats
# Alloc = 163905888
# TotalAlloc = 197824212744
# Sys = 1208749816
# Lookups = 0
# Mallocs = 2158845849
# Frees = 2157028671
# HeapAlloc = 163905888
# HeapSys = 1136525312
# HeapIdle = 778420224
# HeapInuse = 358105088
# HeapReleased = 0
# HeapObjects = 1817178
# Stack = 4325376 / 4325376
# MSpan = 8442536 / 15089664
# MCache = 55296 / 65536
# BuckHashSys = 1904911
# GCSys = 45088768
# OtherSys = 5750249
# NextGC = 188449376
# LastGC = 1550666769451819398

sevagh · 2019-02-20T14:18:08Z

I'm still monitoring the situation but it looks like adding ES_BATCH_MAX_AGE=30 is preventing this issue. Perhaps it's related to the GC running?

sevagh · 2019-02-21T14:45:52Z

I've been trying to look at the code since my previously deployed version of this adapter (which still works fine). The only thing I can see, since your wrapper around the BulkWriter is largely the same, just moved around to different files, is this:
616b927#diff-836546cc53507f6b2d581088903b1785
Changing the Elastic client. I'll file an issue there.

pwillie · 2019-02-27T07:16:33Z

Thanks for the detailed report. Not sure when I will get to this but please keep me posted with any further developments.

Kanshiroron · 2019-04-30T14:19:54Z

Do you guys have any update on this? We've been using this adapter and it gets regularly evicted from Kubernetes as it's consuming too much RAM (10GB+ in under 24H)...
Thanks for your support!

sevagh · 2019-05-04T20:18:02Z

I've actually switched to using metricbeat and the Prometheus module. You can use metricbeat to scrape the same targets as Prometheus, rather than go through the remote_* adapters.

Kanshiroron · 2019-05-13T14:19:59Z

Thanks @sevagh for your workaround, sadly that won't work for us as the Prometheus module doesn't seem to be compatible with AWS/Kubernetes service discovery...

pwillie · 2019-05-13T23:38:07Z

@Kanshiroron, I haven't had a chance to dig into this one yet sorry.

Kanshiroron · 2019-05-14T01:18:26Z

Hey @pwillie I am actually testing a workaround I'll keep you posted with the results.
Thanks for following up

Kanshiroron · 2019-05-27T13:31:51Z

Hey @pwillie, after tweaking the adapter configuration, I no longer see the issue appearing. It has been running for 10 days now without any issue.
I changed the default settings to following:

  ES_WORKERS: "20"
  ES_BATCH_MAX_AGE: "30" # in seconds
  ES_BATCH_MAX_DOCS: "5000"
  ES_BATCH_MAX_SIZE: "16384" # in bytes
  ES_INDEX_SHARDS: "1"
  ES_INDEX_REPLICAS: "2"

I suspect the number of shard and replicas to be causing the issue (use to have 5 index shards and 3 replicas)

pwillie · 2019-05-28T11:02:22Z

Thanks for the update and great to hear of your experience.

Kanshiroron · 2019-05-28T11:53:30Z

You're welcome.
What I think was happening is that the es-adapter was receiving more metrics that it was able to push to ElasticSearch.
It would be nice to have an internal check to make sure the number of metrics in the buffer don't keep increasing.

sevagh mentioned this issue Feb 21, 2019

[Possibly] memory leak in BulkIndex writing when going from v5 to v6 of this library olivere/elastic#1047

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Memory leak #22

Memory leak #22

sevagh commented Feb 20, 2019 •

edited

Loading

sevagh commented Feb 20, 2019

sevagh commented Feb 20, 2019

sevagh commented Feb 20, 2019

sevagh commented Feb 20, 2019

sevagh commented Feb 21, 2019

pwillie commented Feb 27, 2019

Kanshiroron commented Apr 30, 2019

sevagh commented May 4, 2019

Kanshiroron commented May 13, 2019

pwillie commented May 13, 2019

Kanshiroron commented May 14, 2019

Kanshiroron commented May 27, 2019

pwillie commented May 28, 2019

Kanshiroron commented May 28, 2019

Memory leak #22

Memory leak #22

Comments

sevagh commented Feb 20, 2019 • edited Loading

sevagh commented Feb 20, 2019

sevagh commented Feb 20, 2019

sevagh commented Feb 20, 2019

sevagh commented Feb 20, 2019

sevagh commented Feb 21, 2019

pwillie commented Feb 27, 2019

Kanshiroron commented Apr 30, 2019

sevagh commented May 4, 2019

Kanshiroron commented May 13, 2019

pwillie commented May 13, 2019

Kanshiroron commented May 14, 2019

Kanshiroron commented May 27, 2019

pwillie commented May 28, 2019

Kanshiroron commented May 28, 2019

sevagh commented Feb 20, 2019 •

edited

Loading