-
Notifications
You must be signed in to change notification settings - Fork 25.2k
Cached query/filter performance degradation from 1.7.3 to 2.3.2 #18680
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Can you share the queries you're running ? The createWeight you are mentioning is supposed to be very fast but it seems that you have phrase queries with a lot of terms. What is the average number of terms in your phrase query ? |
And is this a real-world test case? Caching is based on query expense and reuse. Does the rest reflect that? |
Thanks for taking a look @jimferenczi @clintongormley ! This is a real world use case. The traffic against this cluster was replicated from a production setting. I have confirmed that caching is being used to some extent through checking the cluster metrics. Additionally, I had had to debug an issue where I set the indices.queries.cache.count too high and caused slow cache lookups. Currently I have that set at 500,000 which seems to not cause any issues. If this was not using the cache I would expect to be seeing many more stack traces regarding query evaluation (which I did see on the first few times the query was executed on the cluster). Unfortunately the exact query is not mine to share, but here is an anonymized sample version. Admittedly it is still structured for 1.7.3 caching. However, when I was profiling with hotthreads I did account for the reality that it takes a certain amount of reuse of the query before it ends up in the LRUCache. https://gist.github.com/jfenc91/240616ff625bf4d60957c42a6f7fc0f2 On total the query has somewhere between 1-1.5k terms. The phrase queries that I have are ranging from 1 – 12 tokens and are typically no more than 5. The parts of this query that are indicated for caching get reused a few hundered times a day and tend to change a few times a week to a few times a month. Also some other stats. Cheers! |
I think you are right we should try to avoid seeking the terms dict when we are going to use a cached entry anyway. |
I opened https://issues.apache.org/jira/browse/LUCENE-7311 for discussion. |
Thanks for jumping in here @jpountz! So I did a quick hack to make the call to createWeight more lazy: Unfortunately, even with that change, the caching is not competitive to the 1.7.3 version. Here is what I am now getting out of hotthreads: So I am left with a few questions/thoughts
Here are some of my cache stats: |
After looking into this more, my problem is an ineffective query caching policy. Things get a bit better with the following:
Even after addressing these three issues, I found my cache was being swamped by simpler queries and the stuff that was really important to cache wasn't really surviving evictions. I could implement and propose changes to the existing query cache policy to better fit my use case, but at the end of the day, it is really hard to make sure those changes won't end up hurting some other use case. I think this kind of highlights the need for pluggable query cache policies. What are your thoughts @clintongormley @jpountz? More Notes: |
Very interesting, we also have at least doubled our latency by using ES 2.3.3 and we are using simpler filters than in your query. I hope there will be some progress. |
#16802 might be helpful here. |
Thanks for the hint! Looks promising! |
I have the same concern with the caching policy. It would be great if caching policy was pluggable. |
@amazoyer I'm still on 1.7.3. The best option I have come across is indexing (and then reindexing + updating) the query result with a query id. Although ES is ridiculously fast and doesn't need query caching for the typical use case. |
We were also just bitten by the caching changes in 2.x+ on a production workload. Currently still trying to find a solution, but it's looking like ES just can't do what it used to here. Queries that include about a half-dozen cached terms filters with up to hundreds of terms each went from ~80ms to 8-9 seconds—no longer able to be run in real-time. Are there any plans to bring some level of caching control back? |
We did hit what seemed to be the same bug in the end of 2016. Having found no solution, we tried to fork ES+Lucene to bring back the _cache query parameter in order to better control the caching (see https://github.com/geneanet/elasticsearch/commits/geneanet-cachetuning and https://github.com/geneanet/lucene-solr/commits/geneanet-cachetuning for reference). It did not improve the situation, so either our patch was not really effective or the problem was more complex than just adjusting what should be cached or not. Currently, our only solution is to keep using ES 1.7, as ES 5 has for our use cases a slightly lower performance and a much greater resource usage (about 3 x more CPU usage). |
We made and will keep making changes that improve caching, however we have no plans to bring back cache control options.
This is very likely indeed. ES 1.7 was relying a lot on caching into bitsets in order to be able to run conjunctions using bitwise operations, but in a way that did not fit well into how Lucene performs query planning. In ES 2.0 we changed cache entries to have a sparse representation and conjunctions to run more consistently, in a way that is easier to reason about but does not use bitwise operations. As a consequence, some queries might indeed perform slower (but other queries also perform faster) and just changing the query caching heuristics is not enough to restore the 1.7 behaviour. |
I got 75% of the way to the old performance when I ensured that the cache was used optimally as if I was explicitly caching. The remaining bit may be obtainable by optimizing out the query comparison (since that is very expensive for the use case where caching matters) and instead only comparing provided cache keys. Honestly, I believe the situation where ES is now targeting for caching is where things didn't need to be cached in the first place. I would love to see ES go back to the direction of at least offering explicate caching (because automatic caching is really hard to get right and it probably is not possible to get to an effective automatic caching solution by only leveraging the lucene layer). |
Keep upgrading - things have improved greatly in 5.x, and we have more optimizations which will be released shortly. |
@clintongormley thanks, that's actually our current strategy. We're re-writing all our queries to be 5.x compatible and will be upgrading later this week or early next. I'll update here with the results. |
There have been a number of changes that should make things better in 5.3/5.4 so I'm going to close that issue. I'm not saying it will now perform on par with 1.7, some queries are still going to be slower and others faster. When facing slow queries, it is also important to look a whether documents are modeled in a good way: the best speedups are obtained by modeling documents in a way that makes queries simpler. Here is a subset of changes that have helped filter caching over the last months:
In addition to that, there have been efforts to make queries faster when they are not cached, in particular |
Elasticsearch version: 2.3.2
JVM version: 1.8.0_91–b14
OS version: 3.13.0-62-generic #102-Ubuntu
I have a set of 300 queries that are submitted concurrently. When tested on a 1.7.3 cluster the queries complete in less than 2 seconds where as on a 2.3.2 cluster with the same hardware they take 60+ seconds to complete.
Here are the hot threads captured during the queries on the 2.3.2 cluster:
Using the default store:
https://gist.github.com/jfenc91/6c8aec1dcca232be533743528aa544a2
Using the mmap store: (I have enough ram for everything to be in memory)
https://gist.github.com/jfenc91/85d30a17a5684a2a2d7b0e9e8924ba0f
Before the filter cache is checked, there is apparently a call to query.createWeight that can get pretty expensive: https://github.com/apache/lucene-solr/blob/releases/lucene-solr/5.5.0/lucene/core/src/java/org/apache/lucene/search/IndexSearcher.java#L904-L907
Hiding the complexity of queries was part of the awesomeness of cached filters/queries. Its seems like with the changes, that some of the evaluation of the query, where terms are compared against the document index, are happening, even though the query results are cached, because of that call to createWeight: https://github.com/apache/lucene-solr/blob/releases/lucene-solr/5.5.0/lucene/core/src/java/org/apache/lucene/search/PhraseQuery.java#L390-L395
I feel like extremely complex queries that changed occasionally were one of the best use cases for caching, but it doesn’t really seem to work with the cache rework that came with ES 2.x. If there were intended to be alternative ways to use this I would love to know. Thanks for reading!
The text was updated successfully, but these errors were encountered: