-
Notifications
You must be signed in to change notification settings - Fork 25.2k
Do not cache term queries. #21566
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Do not cache term queries. #21566
Conversation
There have been reports that the query cache did not manage to speed up search requests when the query includes a large number of different sub queries since a single request may manage to exhaust the whole history (256 queries) while the query cache only starts caching queries once they appear multiple times in the history (elastic#16031). On the other hand, increasing the size of the query cache is a bit controversial (elastic#20116) so this pull request proposes a different approach that consists of never caching term queries, and not adding them to the history of queries either. The reasoning is that these queries should be fast anyway, regardless of caching, so taking them out of the equation should not cause any slow down. On the other hand, the fact that they are not added to the cache history anymore means that other queries have greater chances of being cached.
can we get a simple unittest that tests our policy wrapper? |
// This setting is an escape hatch in case not caching term queries would slow some users down | ||
// Do not document. | ||
public static final Setting<Boolean> INDEX_QUERY_CACHE_TERM_QUERIES_SETTING = | ||
Setting.boolSetting("index.queries.cache.term_queries", true, Property.IndexScope); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems like it would make more sense if the setting being false
meant that term queries were not cached. Right now just looking at the name, it seems like setting it to true
would mean that term queries are cached.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
which is the right thing from a BWC perspective?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
which is the right thing from a BWC perspective?
I don't understand, not caching term queries is correct from the BWC perspective?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
} | ||
|
||
public void testDoesNotPutTermQueriesIntoTheHistory() { | ||
boolean[] used = new boolean[1]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why the boolean array here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I needed a wrapper around a boolean so that it could be modified from inside the below anonymous class. I used to use AtomicBoolean in that case, but I have seen others doing that with an array recently which I liked more, so I stole this habit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM 2
@jpountz and I discussed backporting this to 2.4 without changing the default. It would give use some flexibility if users hit problems and it's very very low risk. It will also allow for more feedback regarding the setting. /cc @clintongormley |
There have been reports that the query cache did not manage to speed up search requests when the query includes a large number of different sub queries since a single request may manage to exhaust the whole history (256 queries) while the query cache only starts caching queries once they appear multiple times in the history (#16031). On the other hand, increasing the size of the query cache is a bit controversial (#20116) so this pull request proposes a different approach that consists of never caching term queries, and not adding them to the history of queries either. The reasoning is that these queries should be fast anyway, regardless of caching, so taking them out of the equation should not cause any slow down. On the other hand, the fact that they are not added to the cache history anymore means that other queries have greater chances of being cached.
There have been reports that the query cache did not manage to speed up search requests when the query includes a large number of different sub queries since a single request may manage to exhaust the whole history (256 queries) while the query cache only starts caching queries once they appear multiple times in the history (#16031). On the other hand, increasing the size of the query cache is a bit controversial (#20116) so this pull request proposes a different approach that consists of never caching term queries, and not adding them to the history of queries either. The reasoning is that these queries should be fast anyway, regardless of caching, so taking them out of the equation should not cause any slow down. On the other hand, the fact that they are not added to the cache history anymore means that other queries have greater chances of being cached.
There have been reports that the query cache did not manage to speed up search requests when the query includes a large number of different sub queries since a single request may manage to exhaust the whole history (256 queries) while the query cache only starts caching queries once they appear multiple times in the history (#16031). On the other hand, increasing the size of the query cache is a bit controversial (#20116) so this pull request proposes a different approach that consists of never caching term queries, and not adding them to the history of queries either. The reasoning is that these queries should be fast anyway, regardless of caching, so taking them out of the equation should not cause any slow down. On the other hand, the fact that they are not added to the cache history anymore means that other queries have greater chances of being cached.
OK I pushed the new setting to the 6.x, 5.x, 5.0 and 2.4 branches, the default being to cache term queries on 2.4 and 5.0 and to not cache them on 5.x and 6.x. |
* master: (22 commits) Add proper toString() method to UpdateTask (#21582) Fix `InternalEngine#isThrottled` to not always return `false`. (#21592) add `ignore_missing` option to SplitProcessor (#20982) fix trace_match behavior for when there is only one grok pattern (#21413) Remove dead code from GetResponse.java Fixes date range query using epoch with timezone (#21542) Do not cache term queries. (#21566) Updated dynamic mapper section Docs: Clarify date_histogram bucket sizes for DST time zones Handle release of 5.0.1 Fix skip reason for stats API parameters test Reduce skip version for stats API parameter tests Strict level parsing for indices stats Remove cluster update task when task times out (#21578) [DOCS] Mention "all-fields" mode doesn't search across nested documents InternalTestCluster: when restarting a node we should validate the cluster is formed via the node we just restarted Fixed bad asciidoc in boolean mapping docs Fixed bad asciidoc ID in node stats Be strict when parsing values searching for booleans (#21555) Fix time zone rounding edge case for DST overlaps ...
Ok, so just double checking the default is not to have this optimization enabled for 2.4 - as the release notes say something else... |
Indeed, the 2.4 branch got the setting but kept the same default behaviour. I suspect the issue you mention with the release notes is that they use the title of this PR as a description? |
@jpountz Likely so. Also, a bit of context why I care about this one. See my comments in one of the original issues, and in addition we are currently seeing the following with 2.3.2 on a 15GB machine (7GB heap, default query cache settings): "query_cache": {
"memory_size": "20.8mb",
"memory_size_in_bytes": 21869016,
"total_count": 1828019410,
"hit_count": 66825483,
"miss_count": 1761193927,
"cache_size": 3666,
"cache_count": 2960592,
"evictions": 2956926
} The high number of evictions and low percentage of used memory suggests an issue - and I'm wondering if we are experiencing what you've been describing here or could it be something else? |
The number of evictions might not be an issue: evictions happen naturally with read-write indices because of merges. So if your cluster has some read-write indices, this might just indicate that few filters meet the required criteria to be considered worth caching. If your indices are all read-only however, then this is indeed concerning. The low memory usage however might be related to this issue indeed. |
@jpountz I'm using elastic search 5.2.0. I see in your commit that one can enable back the terms query by setting https://gist.github.com/sundarv85/5aec5d8a50958bee5891e10d11b87fae But at the end I do not see the query_cache to increase. Am I missing something here? |
elastic#21566 Do not cache term queries.
There have been reports that the query cache did not manage to speed up search
requests when the query includes a large number of different sub queries since
a single request may manage to exhaust the whole history (256 queries) while
the query cache only starts caching queries once they appear multiple times in
the history (#16031). On the other hand, increasing the size of the query cache
is a bit controversial (#20116) so this pull request proposes a different
approach that consists of never caching term queries, and not adding them to the
history of queries either. The reasoning is that these queries should be fast
anyway, regardless of caching, so taking them out of the equation should not
cause any slow down. On the other hand, the fact that they are not added to the
cache history anymore means that other queries have greater chances of being
cached.