Skip to content

Do not cache term queries. #21566

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Nov 16, 2016
Merged

Conversation

jpountz
Copy link
Contributor

@jpountz jpountz commented Nov 15, 2016

There have been reports that the query cache did not manage to speed up search
requests when the query includes a large number of different sub queries since
a single request may manage to exhaust the whole history (256 queries) while
the query cache only starts caching queries once they appear multiple times in
the history (#16031). On the other hand, increasing the size of the query cache
is a bit controversial (#20116) so this pull request proposes a different
approach that consists of never caching term queries, and not adding them to the
history of queries either. The reasoning is that these queries should be fast
anyway, regardless of caching, so taking them out of the equation should not
cause any slow down. On the other hand, the fact that they are not added to the
cache history anymore means that other queries have greater chances of being
cached.

There have been reports that the query cache did not manage to speed up search
requests when the query includes a large number of different sub queries since
a single request may manage to exhaust the whole history (256 queries) while
the query cache only starts caching queries once they appear multiple times in
the history (elastic#16031). On the other hand, increasing the size of the query cache
is a bit controversial (elastic#20116) so this pull request proposes a different
approach that consists of never caching term queries, and not adding them to the
history of queries either. The reasoning is that these queries should be fast
anyway, regardless of caching, so taking them out of the equation should not
cause any slow down. On the other hand, the fact that they are not added to the
cache history anymore means that other queries have greater chances of being
cached.
@s1monw
Copy link
Contributor

s1monw commented Nov 15, 2016

can we get a simple unittest that tests our policy wrapper?

// This setting is an escape hatch in case not caching term queries would slow some users down
// Do not document.
public static final Setting<Boolean> INDEX_QUERY_CACHE_TERM_QUERIES_SETTING =
Setting.boolSetting("index.queries.cache.term_queries", true, Property.IndexScope);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like it would make more sense if the setting being false meant that term queries were not cached. Right now just looking at the name, it seems like setting it to true would mean that term queries are cached.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

which is the right thing from a BWC perspective?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

which is the right thing from a BWC perspective?

I don't understand, not caching term queries is correct from the BWC perspective?

@jpountz
Copy link
Contributor Author

jpountz commented Nov 15, 2016

@s1monw @dakrone I pushed a commit.

Copy link
Member

@dakrone dakrone left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

}

public void testDoesNotPutTermQueriesIntoTheHistory() {
boolean[] used = new boolean[1];
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why the boolean array here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I needed a wrapper around a boolean so that it could be modified from inside the below anonymous class. I used to use AtomicBoolean in that case, but I have seen others doing that with an array recently which I liked more, so I stole this habit.

Copy link
Contributor

@s1monw s1monw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM 2

@s1monw
Copy link
Contributor

s1monw commented Nov 16, 2016

@jpountz and I discussed backporting this to 2.4 without changing the default. It would give use some flexibility if users hit problems and it's very very low risk. It will also allow for more feedback regarding the setting. /cc @clintongormley

@jpountz jpountz merged commit 00de8e0 into elastic:master Nov 16, 2016
@jpountz jpountz deleted the do_not_cache_term_queries branch November 16, 2016 09:02
jpountz added a commit that referenced this pull request Nov 16, 2016
There have been reports that the query cache did not manage to speed up search
requests when the query includes a large number of different sub queries since
a single request may manage to exhaust the whole history (256 queries) while
the query cache only starts caching queries once they appear multiple times in
the history (#16031). On the other hand, increasing the size of the query cache
is a bit controversial (#20116) so this pull request proposes a different
approach that consists of never caching term queries, and not adding them to the
history of queries either. The reasoning is that these queries should be fast
anyway, regardless of caching, so taking them out of the equation should not
cause any slow down. On the other hand, the fact that they are not added to the
cache history anymore means that other queries have greater chances of being
cached.
jpountz added a commit that referenced this pull request Nov 16, 2016
There have been reports that the query cache did not manage to speed up search
requests when the query includes a large number of different sub queries since
a single request may manage to exhaust the whole history (256 queries) while
the query cache only starts caching queries once they appear multiple times in
the history (#16031). On the other hand, increasing the size of the query cache
is a bit controversial (#20116) so this pull request proposes a different
approach that consists of never caching term queries, and not adding them to the
history of queries either. The reasoning is that these queries should be fast
anyway, regardless of caching, so taking them out of the equation should not
cause any slow down. On the other hand, the fact that they are not added to the
cache history anymore means that other queries have greater chances of being
cached.
jpountz added a commit that referenced this pull request Nov 16, 2016
There have been reports that the query cache did not manage to speed up search
requests when the query includes a large number of different sub queries since
a single request may manage to exhaust the whole history (256 queries) while
the query cache only starts caching queries once they appear multiple times in
the history (#16031). On the other hand, increasing the size of the query cache
is a bit controversial (#20116) so this pull request proposes a different
approach that consists of never caching term queries, and not adding them to the
history of queries either. The reasoning is that these queries should be fast
anyway, regardless of caching, so taking them out of the equation should not
cause any slow down. On the other hand, the fact that they are not added to the
cache history anymore means that other queries have greater chances of being
cached.
@jpountz
Copy link
Contributor Author

jpountz commented Nov 16, 2016

OK I pushed the new setting to the 6.x, 5.x, 5.0 and 2.4 branches, the default being to cache term queries on 2.4 and 5.0 and to not cache them on 5.x and 6.x.

jasontedor added a commit that referenced this pull request Nov 16, 2016
* master: (22 commits)
  Add proper toString() method to UpdateTask (#21582)
  Fix `InternalEngine#isThrottled` to not always return `false`. (#21592)
  add `ignore_missing` option to SplitProcessor (#20982)
  fix trace_match behavior for when there is only one grok pattern (#21413)
  Remove dead code from GetResponse.java
  Fixes date range query using epoch with timezone (#21542)
  Do not cache term queries. (#21566)
  Updated dynamic mapper section
  Docs: Clarify date_histogram bucket sizes for DST time zones
  Handle release of 5.0.1
  Fix skip reason for stats API parameters test
  Reduce skip version for stats API parameter tests
  Strict level parsing for indices stats
  Remove cluster update task when task times out (#21578)
  [DOCS] Mention "all-fields" mode doesn't search across nested documents
  InternalTestCluster: when restarting a node we should validate the cluster is formed via the node we just restarted
  Fixed bad asciidoc in boolean mapping docs
  Fixed bad asciidoc ID in node stats
  Be strict when parsing values searching for booleans (#21555)
  Fix time zone rounding edge case for DST overlaps
  ...
@synhershko
Copy link
Contributor

Ok, so just double checking the default is not to have this optimization enabled for 2.4 - as the release notes say something else...

@jpountz
Copy link
Contributor Author

jpountz commented Nov 28, 2016

Indeed, the 2.4 branch got the setting but kept the same default behaviour. I suspect the issue you mention with the release notes is that they use the title of this PR as a description?

@synhershko
Copy link
Contributor

synhershko commented Nov 28, 2016

@jpountz Likely so.

Also, a bit of context why I care about this one. See my comments in one of the original issues, and in addition we are currently seeing the following with 2.3.2 on a 15GB machine (7GB heap, default query cache settings):

    "query_cache": {
      "memory_size": "20.8mb",
      "memory_size_in_bytes": 21869016,
      "total_count": 1828019410,
      "hit_count": 66825483,
      "miss_count": 1761193927,
      "cache_size": 3666,
      "cache_count": 2960592,
      "evictions": 2956926
    }

The high number of evictions and low percentage of used memory suggests an issue - and I'm wondering if we are experiencing what you've been describing here or could it be something else?

@jpountz
Copy link
Contributor Author

jpountz commented Nov 28, 2016

The high number of evictions and low percentage of used memory suggests an issue

The number of evictions might not be an issue: evictions happen naturally with read-write indices because of merges. So if your cluster has some read-write indices, this might just indicate that few filters meet the required criteria to be considered worth caching. If your indices are all read-only however, then this is indeed concerning.

The low memory usage however might be related to this issue indeed.

@sundarv85
Copy link

@jpountz I'm using elastic search 5.2.0. I see in your commit that one can enable back the terms query by setting term_queries in index setting to true. In our case, we only have terms in our queries, so we want the terms queries to be cached too. I tried the following gist

https://gist.github.com/sundarv85/5aec5d8a50958bee5891e10d11b87fae

But at the end I do not see the query_cache to increase. Am I missing something here?

RickyLau added a commit to ElasticsearchClub/elasticsearch that referenced this pull request Mar 11, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants