Cached query/filter performance degradation from 1.7.3 to 2.3.2 #18680

jfenc91 · 2016-06-01T11:35:39Z

Elasticsearch version: 2.3.2

JVM version: 1.8.0_91–b14

OS version: 3.13.0-62-generic #102-Ubuntu

I have a set of 300 queries that are submitted concurrently. When tested on a 1.7.3 cluster the queries complete in less than 2 seconds where as on a 2.3.2 cluster with the same hardware they take 60+ seconds to complete.

Here are the hot threads captured during the queries on the 2.3.2 cluster:

Using the default store:
https://gist.github.com/jfenc91/6c8aec1dcca232be533743528aa544a2

Using the mmap store: (I have enough ram for everything to be in memory)
https://gist.github.com/jfenc91/85d30a17a5684a2a2d7b0e9e8924ba0f

Before the filter cache is checked, there is apparently a call to query.createWeight that can get pretty expensive: https://github.com/apache/lucene-solr/blob/releases/lucene-solr/5.5.0/lucene/core/src/java/org/apache/lucene/search/IndexSearcher.java#L904-L907

Hiding the complexity of queries was part of the awesomeness of cached filters/queries. Its seems like with the changes, that some of the evaluation of the query, where terms are compared against the document index, are happening, even though the query results are cached, because of that call to createWeight: https://github.com/apache/lucene-solr/blob/releases/lucene-solr/5.5.0/lucene/core/src/java/org/apache/lucene/search/PhraseQuery.java#L390-L395

I feel like extremely complex queries that changed occasionally were one of the best use cases for caching, but it doesn’t really seem to work with the cache rework that came with ES 2.x. If there were intended to be alternative ways to use this I would love to know. Thanks for reading!

jimczi · 2016-06-01T13:04:49Z

Can you share the queries you're running ? The createWeight you are mentioning is supposed to be very fast but it seems that you have phrase queries with a lot of terms. What is the average number of terms in your phrase query ?

clintongormley · 2016-06-01T17:53:56Z

And is this a real-world test case? Caching is based on query expense and reuse. Does the rest reflect that?

jfenc91 · 2016-06-01T21:51:05Z

Thanks for taking a look @jimferenczi @clintongormley !

This is a real world use case. The traffic against this cluster was replicated from a production setting.

I have confirmed that caching is being used to some extent through checking the cluster metrics. Additionally, I had had to debug an issue where I set the indices.queries.cache.count too high and caused slow cache lookups. Currently I have that set at 500,000 which seems to not cause any issues. If this was not using the cache I would expect to be seeing many more stack traces regarding query evaluation (which I did see on the first few times the query was executed on the cluster).

Unfortunately the exact query is not mine to share, but here is an anonymized sample version. Admittedly it is still structured for 1.7.3 caching. However, when I was profiling with hotthreads I did account for the reality that it takes a certain amount of reuse of the query before it ends up in the LRUCache.

https://gist.github.com/jfenc91/240616ff625bf4d60957c42a6f7fc0f2

On total the query has somewhere between 1-1.5k terms. The phrase queries that I have are ranging from 1 – 12 tokens and are typically no more than 5. The parts of this query that are indicated for caching get reused a few hundered times a day and tend to change a few times a week to a few times a month.

Also some other stats.
number of Indexes: 7
shards per index: 5
Total segments at time of test: ~350
Available cores on data nodes: 160

Cheers!

jpountz · 2016-06-01T22:04:12Z

I think you are right we should try to avoid seeking the terms dict when we are going to use a cached entry anyway.

jpountz · 2016-06-02T08:44:05Z

I opened https://issues.apache.org/jira/browse/LUCENE-7311 for discussion.

jfenc91 · 2016-06-02T19:00:43Z

Thanks for jumping in here @jpountz!

So I did a quick hack to make the call to createWeight more lazy:
b9e4a6a...jfenc91:feature_queryCacheHack
jfenc91/lucene-solr@release_5.5.0...jfenc91:feature_queryCacheHack

Unfortunately, even with that change, the caching is not competitive to the 1.7.3 version. Here is what I am now getting out of hotthreads:
https://gist.github.com/jfenc91/88a3179cf3c3136964c1a8167fbf8b73

So I am left with a few questions/thoughts

I am seeing a bunch of LRUQueryCache.java:643 which in my fork happens when shouldCache(context)= false. So for some reason some parts of the query are not being cached. Since I am fairly certain I am hitting the lru requirement I am less sure about this line: ReaderUtil.getTopLevelContext(context).reader().maxDoc(). Is that basically the number of docs that the query is filtering against? Or the resulting number of docs we would get if the query were “*” instead of some other term?
In 1.7.3 I specified what needed to be cached. So my 1.5k terms in a query phrase resulted in one cached result. Here it seems like lucene is decomposing the query and building a cache for individual portions of it? Should the top level Boolean query be cached as well? If so, is the idea that the other portions of the query will eventual get lru-ed out of the cache and we will just be left with the bitset for the top level Boolean query?

Here are some of my cache stats:
• _size_in_bytes: 18428944,
• total_count: 5121219796,
• hit_count: 48890920,
• miss_count: 5072328876,
• cache_size: 10147,
• cache_count: 232641,
• evictions: 22249

jfenc91 · 2016-06-05T04:20:26Z

After looking into this more, my problem is an ineffective query caching policy.

Things get a bit better with the following:

An XUsageTrackingQueryCachingPolicy with a configurable history size. It seems that boolean queries that are composed of more than 256 sub-queries simply can't be cached right now because the length of the history:

https://github.com/elastic/elasticsearch/blob/v2.3.2/core/src/main/java/org/apache/lucene/search/XUsageTrackingQueryCachingPolicy.java#L83

Removing the guard against caching document sets that are very small. My complex queries tend to match very few results in the index and it looks like there is (smaller than .03% of the segment/shard/index I'm not sure... But regardless I am failing that check):

https://github.com/apache/lucene-solr/blob/releases/lucene-solr/5.5.0/lucene/core/src/java/org/apache/lucene/search/QueryCachingPolicy.java#L95

Removing the guard against caching in large segments/shards (not sure which). This seems to be another one of those guards that really prevent caching from working at moments where it is needed most. Especially with roaring bitmaps, shards tons of documents combined with queries that have very few hits are really where caching is at it's best. Admittedly I understand this check the least. So, maybe I am missing something and this wasn’t really necessary to remove.

https://github.com/apache/lucene-solr/blob/releases/lucene-solr/5.5.0/lucene/core/src/java/org/apache/lucene/search/LRUQueryCache.java#L613

Even after addressing these three issues, I found my cache was being swamped by simpler queries and the stuff that was really important to cache wasn't really surviving evictions. I could implement and propose changes to the existing query cache policy to better fit my use case, but at the end of the day, it is really hard to make sure those changes won't end up hurting some other use case. I think this kind of highlights the need for pluggable query cache policies.

What are your thoughts @clintongormley @jpountz?

More Notes:
After eliminating some locking in the LRU cache (I changed it to a ReentrantReadWriteLock) and replacing the logic of shouldCache() with something that better fit my use case, my 3 second latency spikes with real traffic on 1.7 are now comparable to 14 second latency spikes with 2.3. The query equality checks are possibly slowing things down compared to the previous cache keys. But anyway, It would be really awesome if there was a path to make future versions of Elasticsearch performant for my use case.

longliveenduro · 2016-06-21T08:57:52Z

Very interesting, we also have at least doubled our latency by using ES 2.3.3 and we are using simpler filters than in your query. I hope there will be some progress.

ben-manes · 2016-07-17T03:54:45Z

#16802 might be helpful here.

longliveenduro · 2016-07-18T08:11:16Z

Thanks for the hint! Looks promising!

amazoyer · 2017-02-03T13:06:31Z

I have the same concern with the caching policy. It would be great if caching policy was pluggable.
@jfenc91 : did you find a work around for your problem?

jfenc91 · 2017-02-03T18:08:53Z

@amazoyer I'm still on 1.7.3. The best option I have come across is indexing (and then reindexing + updating) the query result with a query id. Although ES is ridiculously fast and doesn't need query caching for the typical use case.

kainosnoema · 2017-03-15T05:48:00Z

We were also just bitten by the caching changes in 2.x+ on a production workload. Currently still trying to find a solution, but it's looking like ES just can't do what it used to here. Queries that include about a half-dozen cached terms filters with up to hundreds of terms each went from ~80ms to 8-9 seconds—no longer able to be run in real-time.

Are there any plans to bring some level of caching control back?

setaou · 2017-03-15T08:36:02Z

We did hit what seemed to be the same bug in the end of 2016. Having found no solution, we tried to fork ES+Lucene to bring back the _cache query parameter in order to better control the caching (see https://github.com/geneanet/elasticsearch/commits/geneanet-cachetuning and https://github.com/geneanet/lucene-solr/commits/geneanet-cachetuning for reference). It did not improve the situation, so either our patch was not really effective or the problem was more complex than just adjusting what should be cached or not.

Currently, our only solution is to keep using ES 1.7, as ES 5 has for our use cases a slightly lower performance and a much greater resource usage (about 3 x more CPU usage).

jpountz · 2017-03-15T20:51:33Z

Are there any plans to bring some level of caching control back?

We made and will keep making changes that improve caching, however we have no plans to bring back cache control options.

It did not improve the situation, so either our patch was not really effective or the problem was more complex than just adjusting what should be cached or not.

This is very likely indeed. ES 1.7 was relying a lot on caching into bitsets in order to be able to run conjunctions using bitwise operations, but in a way that did not fit well into how Lucene performs query planning. In ES 2.0 we changed cache entries to have a sparse representation and conjunctions to run more consistently, in a way that is easier to reason about but does not use bitwise operations. As a consequence, some queries might indeed perform slower (but other queries also perform faster) and just changing the query caching heuristics is not enough to restore the 1.7 behaviour.

jfenc91 · 2017-03-16T02:10:03Z

I got 75% of the way to the old performance when I ensured that the cache was used optimally as if I was explicitly caching. The remaining bit may be obtainable by optimizing out the query comparison (since that is very expensive for the use case where caching matters) and instead only comparing provided cache keys.

Honestly, I believe the situation where ES is now targeting for caching is where things didn't need to be cached in the first place. I would love to see ES go back to the direction of at least offering explicate caching (because automatic caching is really hard to get right and it probably is not possible to get to an effective automatic caching solution by only leveraging the lucene layer).

clintongormley · 2017-03-21T20:22:41Z

Keep upgrading - things have improved greatly in 5.x, and we have more optimizations which will be released shortly.

kainosnoema · 2017-03-21T20:28:01Z

@clintongormley thanks, that's actually our current strategy. We're re-writing all our queries to be 5.x compatible and will be upgrading later this week or early next. I'll update here with the results.

jpountz · 2017-03-31T09:56:24Z

There have been a number of changes that should make things better in 5.3/5.4 so I'm going to close that issue. I'm not saying it will now perform on par with 1.7, some queries are still going to be slower and others faster. When facing slow queries, it is also important to look a whether documents are modeled in a good way: the best speedups are obtained by modeling documents in a way that makes queries simpler. Here is a subset of changes that have helped filter caching over the last months:

Do not cache term queries. #21566 term queries are not cached anymore, leaving space in the history for more complex queries
Nested queries should avoid adding unnecessary filters when possible. #23079 nested queries and mappings add fewer implicit filters
https://issues.apache.org/jira/browse/LUCENE-7677 Compound filters are cached earlier than leaf filters.
https://issues.apache.org/jira/browse/LUCENE-7235 Do not take the lock on small segments.
https://issues.apache.org/jira/browse/LUCENE-7237 Skip using the cache under contention.
Improve concurrency of ShardCoreKeyMap. #22316 Improved concurrency of ShardCoreKeyMap.

In addition to that, there have been efforts to make queries faster when they are not cached, in particular range (https://issues.apache.org/jira/browse/LUCENE-7643) and nested (https://issues.apache.org/jira/browse/LUCENE-7654).

clintongormley added feedback_needed :Search/Search Search-related issues that do not fall into other categories :Cache labels Jun 1, 2016

TwP mentioned this issue Sep 26, 2016

Increase the size of the history that the query cache tracks. #20116

Closed

jpountz closed this as completed Mar 31, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cached query/filter performance degradation from 1.7.3 to 2.3.2 #18680

Cached query/filter performance degradation from 1.7.3 to 2.3.2 #18680

jfenc91 commented Jun 1, 2016

jimczi commented Jun 1, 2016

clintongormley commented Jun 1, 2016

jfenc91 commented Jun 1, 2016

jpountz commented Jun 1, 2016

jpountz commented Jun 2, 2016

jfenc91 commented Jun 2, 2016 •

edited

Loading

jfenc91 commented Jun 5, 2016 •

edited

Loading

longliveenduro commented Jun 21, 2016

ben-manes commented Jul 17, 2016

longliveenduro commented Jul 18, 2016

amazoyer commented Feb 3, 2017 •

edited

Loading

jfenc91 commented Feb 3, 2017 •

edited

Loading

kainosnoema commented Mar 15, 2017 •

edited

Loading

setaou commented Mar 15, 2017

jpountz commented Mar 15, 2017

jfenc91 commented Mar 16, 2017 •

edited

Loading

clintongormley commented Mar 21, 2017

kainosnoema commented Mar 21, 2017

jpountz commented Mar 31, 2017

Cached query/filter performance degradation from 1.7.3 to 2.3.2 #18680

Cached query/filter performance degradation from 1.7.3 to 2.3.2 #18680

Comments

jfenc91 commented Jun 1, 2016

jimczi commented Jun 1, 2016

clintongormley commented Jun 1, 2016

jfenc91 commented Jun 1, 2016

jpountz commented Jun 1, 2016

jpountz commented Jun 2, 2016

jfenc91 commented Jun 2, 2016 • edited Loading

jfenc91 commented Jun 5, 2016 • edited Loading

longliveenduro commented Jun 21, 2016

ben-manes commented Jul 17, 2016

longliveenduro commented Jul 18, 2016

amazoyer commented Feb 3, 2017 • edited Loading

jfenc91 commented Feb 3, 2017 • edited Loading

kainosnoema commented Mar 15, 2017 • edited Loading

setaou commented Mar 15, 2017

jpountz commented Mar 15, 2017

jfenc91 commented Mar 16, 2017 • edited Loading

clintongormley commented Mar 21, 2017

kainosnoema commented Mar 21, 2017

jpountz commented Mar 31, 2017

jfenc91 commented Jun 2, 2016 •

edited

Loading

jfenc91 commented Jun 5, 2016 •

edited

Loading

amazoyer commented Feb 3, 2017 •

edited

Loading

jfenc91 commented Feb 3, 2017 •

edited

Loading

kainosnoema commented Mar 15, 2017 •

edited

Loading

jfenc91 commented Mar 16, 2017 •

edited

Loading