Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Checking feature queries as part of Validate API recommendation #445

Closed
amitgalitz opened this issue Mar 15, 2022 · 8 comments
Closed

Checking feature queries as part of Validate API recommendation #445

amitgalitz opened this issue Mar 15, 2022 · 8 comments
Labels
enhancement New feature or request

Comments

@amitgalitz
Copy link
Member

Is your feature request related to a problem? Please describe.
One of the non-blocker validate API goal is to determine an optimal detector interval length recommendation based on checking the data sparsity with all configurations applied. In the original PR (#384) the way feature queries were taken into account was by checking if all fields that are looked at by the feature queries exist in a single document during a given interval. However the logic behind this check was incorrect since the feature fields don’t have to all exist in the same document but rather can be in separate document but all in the same interval. A separate PR (#412) made sure to remove this check from the interval recommendation making it so we are not over validating but instead not taking the feature queries into account when recommending an interval.

Describe the solution you'd like
Possible solutions:

  1. In order to find an optimal interval and further identify if the feature query is the root cause of sparse data we can try multiple different intervals for each feature query the detector has.
    1. This means we might find multiple different interval suggestion for each feature query and we can then recommend the longest interval out of all feature queries.
    2. We then potentially need to decide how to deal with the case where some feature queries lead to a different interval recommendation and some feature queries don’t have enough data no matter the interval (change response type so we can both provide interval rec and state the features with issue or simply state the features with no interval found)
  2. Add a sub-aggregation that looks for the feature fields inside of each interval bucket within the date histogram aggregation that is currently implemented
    1. This will need to have further perf testing as it would mean a sub aggregation within up to 1440 different interval buckets, and this call itself is already occurring multiple times for different intervals.
@amitgalitz amitgalitz added the enhancement New feature or request label Mar 15, 2022
@elfisher
Copy link

This looks really interesting! From an end-user perspective, does this improve AD accuracy, performance, or something else?

@amitgalitz
Copy link
Member Author

@elfisher From the user-perspective it means the Validation API is more accurate. In Opensearch 1.3 we added a new Validation API that runs on the last step of creating an Anomaly detector which validates if the given configurations will likely create a detector that successfully initializes and completes model training. Users can also call the Validation API directly through the backend https://opensearch.org/docs/latest/monitoring-plugins/ad/api/#validate-detector.


This enhancement will lead to the Validation API having the ability to suggest to user if a specific feature field is causing sparse data and give user a call-out they should probably change there feature field or to expect potentially longer initialization times or no initialization at all. Currently Validation API doesn’t take the specific feature fields completely into account and just the other configurations.


Basically before creating a detector users will be even more informed if there configurations have any issues.

@elfisher
Copy link

Thanks for the clarification @amitgalitz! If you don't mind, can you create an issue in the doc repo to track this update for 2.1? Since this is an improvement to the API we should make sure we get it documented.

@amitgalitz amitgalitz added v2.2.0 and removed v2.1.0 labels Jul 7, 2022
@ohltyler
Copy link
Member

@amitgalitz should this be re-labeled as 2.3?

@amitgalitz
Copy link
Member Author

@amitgalitz should this be re-labeled as 2.3?

Good point, I'll actually remove version label right now and discuss with Sean on priority

@amitgalitz amitgalitz removed the v2.2.0 label Aug 10, 2022
@ohltyler
Copy link
Member

@amitgalitz should this be re-labeled as 2.3?

Good point, I'll actually remove version label right now and discuss with Sean on priority

Sounds good - i'll set as 2.3 tentatively

@ohltyler ohltyler added v2.4.0 and removed v2.3.0 labels Sep 7, 2022
@amitgalitz amitgalitz removed the v2.4.0 label Oct 21, 2022
@kaituo
Copy link
Collaborator

kaituo commented Jul 8, 2024

sent the PR to fix the issue: #1258

My solution is a little different:

  • When suggesting interval, I am using cold start queries if the feature exist (it is possible users haven’t defined features when invoking suggest interval APIs). If there is any one feature missing, I would regard the sample (might including multiple features) missing.
  • When finding feature sparsity, I would use feature aggregation instead of exist query now. I found people can write complicated query in a feature, the current logic won’t be able to find feature field name. Also, as you mentioned in the issue, not all fields might exist in the same documents. Also, they may create a runtime field too. That’s even more complex.
  • If a feature aggregation returns enough non-empty value for cold start, I would assume it won’t cause sparsity.

@kaituo
Copy link
Collaborator

kaituo commented Jul 8, 2024

close the issue for now. @amitgalitz feel free to reopen if you find my PR needs improvement.

@kaituo kaituo closed this as completed Jul 8, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

4 participants