Skip to content

Cached query/filter performance degradation from 1.7.3 to 2.3.2 #18680

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
jfenc91 opened this issue Jun 1, 2016 · 19 comments
Closed

Cached query/filter performance degradation from 1.7.3 to 2.3.2 #18680

jfenc91 opened this issue Jun 1, 2016 · 19 comments
Labels
feedback_needed :Search/Search Search-related issues that do not fall into other categories

Comments

@jfenc91
Copy link

jfenc91 commented Jun 1, 2016

Elasticsearch version: 2.3.2

JVM version: 1.8.0_91–b14

OS version: 3.13.0-62-generic #102-Ubuntu

I have a set of 300 queries that are submitted concurrently. When tested on a 1.7.3 cluster the queries complete in less than 2 seconds where as on a 2.3.2 cluster with the same hardware they take 60+ seconds to complete.

Here are the hot threads captured during the queries on the 2.3.2 cluster:

Using the default store:
https://gist.github.com/jfenc91/6c8aec1dcca232be533743528aa544a2

Using the mmap store: (I have enough ram for everything to be in memory)
https://gist.github.com/jfenc91/85d30a17a5684a2a2d7b0e9e8924ba0f

Before the filter cache is checked, there is apparently a call to query.createWeight that can get pretty expensive: https://github.com/apache/lucene-solr/blob/releases/lucene-solr/5.5.0/lucene/core/src/java/org/apache/lucene/search/IndexSearcher.java#L904-L907

Hiding the complexity of queries was part of the awesomeness of cached filters/queries. Its seems like with the changes, that some of the evaluation of the query, where terms are compared against the document index, are happening, even though the query results are cached, because of that call to createWeight: https://github.com/apache/lucene-solr/blob/releases/lucene-solr/5.5.0/lucene/core/src/java/org/apache/lucene/search/PhraseQuery.java#L390-L395

I feel like extremely complex queries that changed occasionally were one of the best use cases for caching, but it doesn’t really seem to work with the cache rework that came with ES 2.x. If there were intended to be alternative ways to use this I would love to know. Thanks for reading!

@jimczi
Copy link
Contributor

jimczi commented Jun 1, 2016

Can you share the queries you're running ? The createWeight you are mentioning is supposed to be very fast but it seems that you have phrase queries with a lot of terms. What is the average number of terms in your phrase query ?

@clintongormley clintongormley added feedback_needed :Search/Search Search-related issues that do not fall into other categories :Cache labels Jun 1, 2016
@clintongormley
Copy link
Contributor

And is this a real-world test case? Caching is based on query expense and reuse. Does the rest reflect that?

@jfenc91
Copy link
Author

jfenc91 commented Jun 1, 2016

Thanks for taking a look @jimferenczi @clintongormley !

This is a real world use case. The traffic against this cluster was replicated from a production setting.

I have confirmed that caching is being used to some extent through checking the cluster metrics. Additionally, I had had to debug an issue where I set the indices.queries.cache.count too high and caused slow cache lookups. Currently I have that set at 500,000 which seems to not cause any issues. If this was not using the cache I would expect to be seeing many more stack traces regarding query evaluation (which I did see on the first few times the query was executed on the cluster).

Unfortunately the exact query is not mine to share, but here is an anonymized sample version. Admittedly it is still structured for 1.7.3 caching. However, when I was profiling with hotthreads I did account for the reality that it takes a certain amount of reuse of the query before it ends up in the LRUCache.

https://gist.github.com/jfenc91/240616ff625bf4d60957c42a6f7fc0f2

On total the query has somewhere between 1-1.5k terms. The phrase queries that I have are ranging from 1 – 12 tokens and are typically no more than 5. The parts of this query that are indicated for caching get reused a few hundered times a day and tend to change a few times a week to a few times a month.

Also some other stats.
number of Indexes: 7
shards per index: 5
Total segments at time of test: ~350
Available cores on data nodes: 160

Cheers!

@jpountz
Copy link
Contributor

jpountz commented Jun 1, 2016

I think you are right we should try to avoid seeking the terms dict when we are going to use a cached entry anyway.

@jpountz
Copy link
Contributor

jpountz commented Jun 2, 2016

I opened https://issues.apache.org/jira/browse/LUCENE-7311 for discussion.

@jfenc91
Copy link
Author

jfenc91 commented Jun 2, 2016

Thanks for jumping in here @jpountz!

So I did a quick hack to make the call to createWeight more lazy:
b9e4a6a...jfenc91:feature_queryCacheHack
jfenc91/lucene-solr@release_5.5.0...jfenc91:feature_queryCacheHack

Unfortunately, even with that change, the caching is not competitive to the 1.7.3 version. Here is what I am now getting out of hotthreads:
https://gist.github.com/jfenc91/88a3179cf3c3136964c1a8167fbf8b73

So I am left with a few questions/thoughts

  1. I am seeing a bunch of LRUQueryCache.java:643 which in my fork happens when shouldCache(context)= false. So for some reason some parts of the query are not being cached. Since I am fairly certain I am hitting the lru requirement I am less sure about this line: ReaderUtil.getTopLevelContext(context).reader().maxDoc(). Is that basically the number of docs that the query is filtering against? Or the resulting number of docs we would get if the query were “*” instead of some other term?

  2. In 1.7.3 I specified what needed to be cached. So my 1.5k terms in a query phrase resulted in one cached result. Here it seems like lucene is decomposing the query and building a cache for individual portions of it? Should the top level Boolean query be cached as well? If so, is the idea that the other portions of the query will eventual get lru-ed out of the cache and we will just be left with the bitset for the top level Boolean query?

Here are some of my cache stats:
• _size_in_bytes: 18428944,
• total_count: 5121219796,
• hit_count: 48890920,
• miss_count: 5072328876,
• cache_size: 10147,
• cache_count: 232641,
• evictions: 22249

@jfenc91
Copy link
Author

jfenc91 commented Jun 5, 2016

After looking into this more, my problem is an ineffective query caching policy.

Things get a bit better with the following:

  1. An XUsageTrackingQueryCachingPolicy with a configurable history size. It seems that boolean queries that are composed of more than 256 sub-queries simply can't be cached right now because the length of the history:

https://github.com/elastic/elasticsearch/blob/v2.3.2/core/src/main/java/org/apache/lucene/search/XUsageTrackingQueryCachingPolicy.java#L83

  1. Removing the guard against caching document sets that are very small. My complex queries tend to match very few results in the index and it looks like there is (smaller than .03% of the segment/shard/index I'm not sure... But regardless I am failing that check):

https://github.com/apache/lucene-solr/blob/releases/lucene-solr/5.5.0/lucene/core/src/java/org/apache/lucene/search/QueryCachingPolicy.java#L95

  1. Removing the guard against caching in large segments/shards (not sure which). This seems to be another one of those guards that really prevent caching from working at moments where it is needed most. Especially with roaring bitmaps, shards tons of documents combined with queries that have very few hits are really where caching is at it's best. Admittedly I understand this check the least. So, maybe I am missing something and this wasn’t really necessary to remove.

https://github.com/apache/lucene-solr/blob/releases/lucene-solr/5.5.0/lucene/core/src/java/org/apache/lucene/search/LRUQueryCache.java#L613

Even after addressing these three issues, I found my cache was being swamped by simpler queries and the stuff that was really important to cache wasn't really surviving evictions. I could implement and propose changes to the existing query cache policy to better fit my use case, but at the end of the day, it is really hard to make sure those changes won't end up hurting some other use case. I think this kind of highlights the need for pluggable query cache policies.

What are your thoughts @clintongormley @jpountz?

More Notes:
After eliminating some locking in the LRU cache (I changed it to a ReentrantReadWriteLock) and replacing the logic of shouldCache() with something that better fit my use case, my 3 second latency spikes with real traffic on 1.7 are now comparable to 14 second latency spikes with 2.3. The query equality checks are possibly slowing things down compared to the previous cache keys. But anyway, It would be really awesome if there was a path to make future versions of Elasticsearch performant for my use case.

@longliveenduro
Copy link

Very interesting, we also have at least doubled our latency by using ES 2.3.3 and we are using simpler filters than in your query. I hope there will be some progress.

@ben-manes
Copy link

#16802 might be helpful here.

@longliveenduro
Copy link

Thanks for the hint! Looks promising!

@amazoyer
Copy link

amazoyer commented Feb 3, 2017

I have the same concern with the caching policy. It would be great if caching policy was pluggable.
@jfenc91 : did you find a work around for your problem?

@jfenc91
Copy link
Author

jfenc91 commented Feb 3, 2017

@amazoyer I'm still on 1.7.3. The best option I have come across is indexing (and then reindexing + updating) the query result with a query id. Although ES is ridiculously fast and doesn't need query caching for the typical use case.

@kainosnoema
Copy link

kainosnoema commented Mar 15, 2017

We were also just bitten by the caching changes in 2.x+ on a production workload. Currently still trying to find a solution, but it's looking like ES just can't do what it used to here. Queries that include about a half-dozen cached terms filters with up to hundreds of terms each went from ~80ms to 8-9 seconds—no longer able to be run in real-time.

Are there any plans to bring some level of caching control back?

@setaou
Copy link

setaou commented Mar 15, 2017

We did hit what seemed to be the same bug in the end of 2016. Having found no solution, we tried to fork ES+Lucene to bring back the _cache query parameter in order to better control the caching (see https://github.com/geneanet/elasticsearch/commits/geneanet-cachetuning and https://github.com/geneanet/lucene-solr/commits/geneanet-cachetuning for reference). It did not improve the situation, so either our patch was not really effective or the problem was more complex than just adjusting what should be cached or not.

Currently, our only solution is to keep using ES 1.7, as ES 5 has for our use cases a slightly lower performance and a much greater resource usage (about 3 x more CPU usage).

@jpountz
Copy link
Contributor

jpountz commented Mar 15, 2017

Are there any plans to bring some level of caching control back?

We made and will keep making changes that improve caching, however we have no plans to bring back cache control options.

It did not improve the situation, so either our patch was not really effective or the problem was more complex than just adjusting what should be cached or not.

This is very likely indeed. ES 1.7 was relying a lot on caching into bitsets in order to be able to run conjunctions using bitwise operations, but in a way that did not fit well into how Lucene performs query planning. In ES 2.0 we changed cache entries to have a sparse representation and conjunctions to run more consistently, in a way that is easier to reason about but does not use bitwise operations. As a consequence, some queries might indeed perform slower (but other queries also perform faster) and just changing the query caching heuristics is not enough to restore the 1.7 behaviour.

@jfenc91
Copy link
Author

jfenc91 commented Mar 16, 2017

I got 75% of the way to the old performance when I ensured that the cache was used optimally as if I was explicitly caching. The remaining bit may be obtainable by optimizing out the query comparison (since that is very expensive for the use case where caching matters) and instead only comparing provided cache keys.

Honestly, I believe the situation where ES is now targeting for caching is where things didn't need to be cached in the first place. I would love to see ES go back to the direction of at least offering explicate caching (because automatic caching is really hard to get right and it probably is not possible to get to an effective automatic caching solution by only leveraging the lucene layer).

@clintongormley
Copy link
Contributor

Keep upgrading - things have improved greatly in 5.x, and we have more optimizations which will be released shortly.

@kainosnoema
Copy link

@clintongormley thanks, that's actually our current strategy. We're re-writing all our queries to be 5.x compatible and will be upgrading later this week or early next. I'll update here with the results.

@jpountz
Copy link
Contributor

jpountz commented Mar 31, 2017

There have been a number of changes that should make things better in 5.3/5.4 so I'm going to close that issue. I'm not saying it will now perform on par with 1.7, some queries are still going to be slower and others faster. When facing slow queries, it is also important to look a whether documents are modeled in a good way: the best speedups are obtained by modeling documents in a way that makes queries simpler. Here is a subset of changes that have helped filter caching over the last months:

In addition to that, there have been efforts to make queries faster when they are not cached, in particular range (https://issues.apache.org/jira/browse/LUCENE-7643) and nested (https://issues.apache.org/jira/browse/LUCENE-7654).

@jpountz jpountz closed this as completed Mar 31, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feedback_needed :Search/Search Search-related issues that do not fall into other categories
Projects
None yet
Development

No branches or pull requests

9 participants