Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix auto date histogram rounding assertion bug #17023

Open
wants to merge 14 commits into
base: main
Choose a base branch
from

Conversation

finnegancarroll
Copy link
Contributor

@finnegancarroll finnegancarroll commented Jan 14, 2025

Description

The auto date histogram agg contains an optimization which skips traditional doc collection and instead examines the "pre-aggregated" doc counts contained in the BKD tree of each segment. Several conditions need to be true before this optimization can execute for a given segment. One such condition is we must be able to determine a set of ranges (or rounding) for the segment under consideration before optimizing.

Normally ranges for the auto date histogram aggregation are updated as needed over the course of collecting documents but the filter rewrite optimization will update the preparedRounding of the agg in accordance with the min & max values of the segment under consideration ahead of time since it skips regular doc collection.

As a result, it is possible for our preparedRounding to shrink as the rounding built from the segment could easily be smaller than the rounding previously used by our shard level aggregator.

This usually does not pose a problem as the preparedRounding will be updated accordingly when we collect our next document, or reduce our shard level aggs into a single top level agg.

The specific case where this becomes problematic is when our preparedRounding is delegating to a "bounded" structure. When we prepare a rounding we do so for the min & max epoch time of our shard since this allows us to optimize the structure we delegate rounding to.

For some ranges of epoch time and time zones rounding will be little more than a modulo operation. However if our min & max epoch time crosses "transitions" such as daylight savings we may want to delegate rounding to a linked list or array structure to quickly lookup these transitions. This is why the specific occurrence of this bug linked in the initial issue only appears when "time_zone":"America/New_York".

The combination of delegating rounding to these strictly bounded structures and the filter rewrite optimization "replaying" our previous bucket keys fails an assertion within our preparedRounding as our previous bucket keys are not guaranteed to fit within the strict bounds of the rounding prepared for the current segment being collected.

The changes in this PR resolve this by ensuring the filter rewrite optimization only ever increases the granularity of our preparedRounding.

Related Issues

Resolves #16932

Check List

  • Functionality includes testing.
  • API changes companion pull request created, if applicable.
  • Public documentation issue/PR created, if applicable.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

@github-actions github-actions bot added bug Something isn't working Search Search query, autocomplete ...etc labels Jan 14, 2025
Copy link
Contributor

❕ Gradle check result for 0ecdf31: UNSTABLE

Please review all flaky tests that succeeded after retry and create an issue if one does not already exist to track the flaky failure.

Copy link

codecov bot commented Jan 14, 2025

Codecov Report

Attention: Patch coverage is 86.66667% with 2 lines in your changes missing coverage. Please review.

Project coverage is 72.41%. Comparing base (c6dfc65) to head (dc88032).

Files with missing lines Patch % Lines
.../src/main/java/org/opensearch/common/Rounding.java 75.00% 1 Missing ⚠️
...ain/java/org/opensearch/search/DocValueFormat.java 0.00% 1 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##               main   #17023      +/-   ##
============================================
+ Coverage     72.24%   72.41%   +0.16%     
- Complexity    65421    65503      +82     
============================================
  Files          5306     5306              
  Lines        304215   304222       +7     
  Branches      44118    44119       +1     
============================================
+ Hits         219788   220294     +506     
+ Misses        66414    65903     -511     
- Partials      18013    18025      +12     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@finnegancarroll
Copy link
Contributor Author

@bowenlan-amzn can you take a look when you have a chance? Thank you!

@bowenlan-amzn
Copy link
Member

I remember the optimization will not be applied if the aggregation defined a timezone, so this bug is kind of a surprise.
But the real story is the block of timezone happened right after the bug 😔

final Rounding rounding = getRounding(bounds[0], bounds[1]);
final OptionalLong intervalOpt = Rounding.getInterval(rounding);

The block of timezone is inside getInterval

if (!isUTCTimeZone(((TimeUnitRounding) rounding).timeZone)) {
// Fast filter aggregation cannot be used if it needs time zone rounding
return OptionalLong.empty();
}

The timezone is part of the Rounding object, so we need to have Rounding first then check the timezone. However, in auto datehistogram, to get the rounding, we are updating prepared rounding also which leading to this bug.

I recommend we add a simple timezone check right before

final Rounding rounding = getRounding(bounds[0], bounds[1]);
final OptionalLong intervalOpt = Rounding.getInterval(rounding);

It purely takes in a Rounding (for autodatehistogram, this can just be the first Rounding in RouningInfos, since every Rounding would have the same timezone information) and do the check
if (!isUTCTimeZone(((TimeUnitRounding) rounding).timeZone)) {
// Fast filter aggregation cannot be used if it needs time zone rounding
return OptionalLong.empty();
}

If the check doesn't pass, we don't even bother to go inside getRounding method.

The check you added here (not shrink the rounding) is still meaningful, agree it should be ever increasing, but not shrink depending on the next segment processed.

Copy link
Contributor

✅ Gradle check result for 569502d: SUCCESS

@finnegancarroll
Copy link
Contributor Author

@jainankitk can you take a look when you have a chance? Thanks!

@bowenlan-amzn
Copy link
Member

bowenlan-amzn commented Jan 22, 2025

Hard to describe my suggest change, so I create a commit, please take a look.
bowenlan-amzn@e0de5d0

@bowenlan-amzn
Copy link
Member

Another thing I'm wondering is why we cannot support timezone in this optimization.

When timezone is there, the aggregation default method first converts the date values to local value, do the rounding, then convert it back to UTC
We can imagine with DST, for a daily aggregation, some day may have 23 hours, some day may have 25 hours

To support that in our optimization, we should figure out the right ranges for certain timezone and represent in UTC, then use these ranges to intersect the index structure of date field which saves UTC values.

@finnegancarroll
Copy link
Contributor Author

Hard to describe my suggest change, so I create a commit, please take a look.

I see what you mean this is much cleaner. Cherry picked this commit.

Copy link
Contributor

❌ Gradle check result for d4f5dc1: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Copy link
Contributor

✅ Gradle check result for d4f5dc1: SUCCESS

finnegancarroll and others added 14 commits January 23, 2025 12:02
… preparedRounding of agg.

Signed-off-by: Finn Carroll <[email protected]>
Signed-off-by: Finn Carroll <[email protected]>
Signed-off-by: Finn Carroll <[email protected]>
Signed-off-by: Finn Carroll <[email protected]>
Signed-off-by: Finn Carroll <[email protected]>
Signed-off-by: Finn Carroll <[email protected]>
Signed-off-by: bowenlan-amzn <[email protected]>
Copy link
Contributor

✅ Gradle check result for dc88032: SUCCESS

@@ -489,7 +489,7 @@ public double roundingSize(long utcMillis, DateTimeUnit timeUnit) {
*
* @opensearch.internal
*/
static class TimeUnitRounding extends Rounding {
public static class TimeUnitRounding extends Rounding {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not needed

Comment on lines +205 to +208
roundingIdx = Math.max(prevRoundingIdx, roundingIdx);
if (roundingIdx != prevRoundingIdx) {
preparedRounding = prepareRounding(roundingIdx);
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

slightly more readable?

if (roundingIdx > prevRoundingIdx) {
    roundingIdx = prevRoundingIdx;
    preparedRounding = prepareRounding(roundingIdx);
}

Comment on lines +438 to +453
/**
* Quick explanation of the two below conditions:
*
* 1. [targetBuckets * roundingInfos[roundingIdx].getMaximumInnerInterval()]
* Represents the total bucket count possible before we will exceed targetBuckets
* even if we use the maximum inner interval of our current rounding. For example, consider the
* DAYS_OF_MONTH rounding where the maximum inner interval is 7 days (i.e. 1 week buckets).
* targetBuckets * roundingInfos[roundingIdx].getMaximumInnerInterval() would then be the number of
* 1 day buckets possible such that if we re-bucket to 1 week buckets we will have more 1 week buckets
* than our targetBuckets limit. If the current count of buckets exceeds this limit we must update
* our rounding.
*
* 2. [targetBuckets * roundingInfos[roundingIdx].getMaximumRoughEstimateDurationMillis()]
* The total duration of ms covered by our current rounding. In the case of MINUTES_OF_HOUR rounding
* getMaximumRoughEstimateDurationMillis is 60000. If our current total range in millis (max - min)
* exceeds this range we must update our rounding.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for adding details documentation!

@jainankitk jainankitk added the backport 2.x Backport to 2.x branch label Jan 23, 2025
@jainankitk
Copy link
Collaborator

Another thing I'm wondering is why we cannot support timezone in this optimization.

I agree with @bowenlan-amzn. We should try to support the optimization with timezones as well to have wider coverage. @bowenlan-amzn - Can you please create issue for same?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport 2.x Backport to 2.x branch bug Something isn't working Search Search query, autocomplete ...etc
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[BUG] auto_date_histogram query with time_zone can throw AssertionError
3 participants