Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Feature Filtering in Model Validation #1258

Merged
merged 1 commit into from
Jul 8, 2024

Conversation

kaituo
Copy link
Collaborator

@kaituo kaituo commented Jul 2, 2024

Description

Previously, we used date histogram aggregation for feature sparsity validation and interval recommendation. But it is possible a feature itself has filters besides the detector's filter. This can cause incorrect interval recommendations. We may fail to root cause feature definition itself caused data sparsity. This PR improves feature sparsity validation and interval recommendation by incorporating feature aggregation. This prevents incorrect interval recommendations and helps identify if feature definitions are causing data sparsity. The same date range aggregation used in the cold start is now employed to retrieve data and verify sparsity.

Detailed Changes

  • Feature Sparsity Validation: Integrated feature aggregation in the sparsity validation and interval recommendation processes. When a feature is defined, we utilize date range aggregation, consistent with cold start methodology, to ensure accurate data retrieval and sparsity verification.
  • Coverly report: To fix [BUG] CodeCov is failing project-tools#87, I followed [CI] CodeCov no longer works after upgrade to v4 of the action flow-framework#490 to upgrade codecov GitHub Action dependency from v3 to v4 and and add the standard-named token.
  • Avoid redundant coverly report: Limited coverage upload to the macOS environment to avoid redundant uploads across Linux, macOS, and Windows builds.
  • Error Message Refactoring: Transferred shareable error messages from ADCommonMessages to CommonMessages.
  • Code Refactoring: Refactored ModelColdStart and SearchFeatureDao for reusability in model validation and interval recommendation.
  • Moved HistogramAggregationHelper to AggregationPrep, which now includes both histogram and date_range aggregations.
  • Top Entities Retrieval Logic: Improved LatestTimeRetriever logic: Previously in LatestTimeRetriever we retrieve top entities first before retrieving latest time based on current time. But it is possible that the window delay is high and we could get no top entities. We now retrieve the latest time first, then fetches top entities based on this time. The change increases the likelihood of obtaining top entities, especially when window delay is high.
  • Bug Fix in ModelValidationActionHandler: Fixed an issue in checkFeatureQueryDelegate where missing colons in messages caused an index out of bounds exception in BaseValidateConfigTransportAction.getFeatureSubIssuesFromErrorMessage.
  • Previously, in HistogramAggregationHelper.getNumberOfSamples, we queried up to 24 hours of data. This approach could result in an incorrect bucket coverage rate, especially when the historical data is limited. Even if the limited history provides 100% coverage, the bucket coverage rate might still appear low within the one day time frame. This PR updates HistogramAggregationHelper.getNumberOfSamples to limit data queries to config.getHistoryIntervals(), ensuring that the queried interval is at least the shingle size plus 32. This adjustment improves the accuracy of bucket coverage rate calculations.
  • Improved password generation in security tests: Current implementation may cause "Password is similar to user name". This PR improve the randomness and ensure the generated password is not similar to the username by using a Secure Random Generator and ensure that the password includes a mix of uppercase, lowercase, digits, and special characters.

Testing done:

  1. added various integ and unit tests. Test coverage increased from 71.83% to 77.05%.

Issues Resolved

List any issues this PR will resolve, e.g. Closes [...].

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

@opensearch-trigger-bot opensearch-trigger-bot bot added infra Changes to infrastructure, testing, CI/CD, pipelines, etc. backport 2.x labels Jul 2, 2024
@kaituo kaituo force-pushed the bash5 branch 2 times, most recently from df241c4 to be78b86 Compare July 2, 2024 22:16
@kaituo kaituo changed the title Consider feature filter in model validation Add Feature Filtering in Model Validation Jul 2, 2024
Copy link

codecov bot commented Jul 2, 2024

Codecov Report

Attention: Patch coverage is 68.12227% with 73 lines in your changes missing coverage. Please review.

Project coverage is 77.05%. Comparing base (3f0fc8c) to head (1e8be88).
Report is 2 commits behind head on main.

Additional details and impacted files

Impacted file tree graph

@@             Coverage Diff              @@
##               main    #1258      +/-   ##
============================================
+ Coverage     71.83%   77.05%   +5.22%     
- Complexity     4898     5287     +389     
============================================
  Files           518      517       -1     
  Lines         22879    22890      +11     
  Branches       2245     2239       -6     
============================================
+ Hits          16434    17639    +1205     
+ Misses         5410     4216    -1194     
  Partials       1035     1035              
Flag Coverage Δ
plugin 77.05% <68.12%> (+5.22%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files Coverage Δ
...a/org/opensearch/ad/constant/ADCommonMessages.java 95.83% <ø> (ø)
...d/rest/handler/ADModelValidationActionHandler.java 100.00% <ø> (+100.00%) ⬆️
...nsport/ValidateAnomalyDetectorTransportAction.java 100.00% <ø> (ø)
.../handler/ForecastModelValidationActionHandler.java 100.00% <ø> (+100.00%) ⬆️
...t/transport/ValidateForecasterTransportAction.java 100.00% <ø> (+70.00%) ⬆️
...opensearch/timeseries/constant/CommonMessages.java 97.43% <ø> (ø)
...a/org/opensearch/timeseries/ml/ModelColdStart.java 79.25% <100.00%> (-0.95%) ⬇️
...ch/timeseries/model/IntervalTimeConfiguration.java 92.85% <ø> (+7.14%) ⬆️
...opensearch/timeseries/model/TimeConfiguration.java 100.00% <ø> (ø)
...ansport/BaseSuggestConfigParamTransportAction.java 68.51% <100.00%> (+44.98%) ⬆️
... and 8 more

... and 88 files with indirect coverage changes

@kaituo
Copy link
Collaborator Author

kaituo commented Jul 3, 2024

WhiteSource Security Check is failing. Same for other repo (e.g., KNN) as well. Error:

Oops! An error occurred while running the Security Check.The Scanner was unable to connect to the repository or branch and clone it.
ref: https://github.com/opensearch-project/anomaly-detection/pull/1258/checks?check_run_id=26966747402

Reached out to infra team to get this resolved.

@kaituo kaituo merged commit 63dacaa into opensearch-project:main Jul 8, 2024
16 of 17 checks passed
opensearch-trigger-bot bot pushed a commit that referenced this pull request Jul 8, 2024
Signed-off-by: Kaituo Li <[email protected]>
(cherry picked from commit 63dacaa)
Signed-off-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
kaituo added a commit to kaituo/anomaly-detection-1 that referenced this pull request Jul 8, 2024
kaituo added a commit that referenced this pull request Jul 8, 2024
* Consider feature filter in model validation (#1258)

Signed-off-by: Kaituo Li <[email protected]>

* follow opensearch-project/k-NN#1795

Signed-off-by: Kaituo Li <[email protected]>

---------

Signed-off-by: Kaituo Li <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport 2.x infra Changes to infrastructure, testing, CI/CD, pipelines, etc.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[BUG] CodeCov is failing
2 participants