Checking feature queries as part of Validate API recommendation #445

amitgalitz · 2022-03-15T17:12:19Z

Is your feature request related to a problem? Please describe.
One of the non-blocker validate API goal is to determine an optimal detector interval length recommendation based on checking the data sparsity with all configurations applied. In the original PR (#384) the way feature queries were taken into account was by checking if all fields that are looked at by the feature queries exist in a single document during a given interval. However the logic behind this check was incorrect since the feature fields don’t have to all exist in the same document but rather can be in separate document but all in the same interval. A separate PR (#412) made sure to remove this check from the interval recommendation making it so we are not over validating but instead not taking the feature queries into account when recommending an interval.

Describe the solution you'd like
Possible solutions:

In order to find an optimal interval and further identify if the feature query is the root cause of sparse data we can try multiple different intervals for each feature query the detector has.
1. This means we might find multiple different interval suggestion for each feature query and we can then recommend the longest interval out of all feature queries.
2. We then potentially need to decide how to deal with the case where some feature queries lead to a different interval recommendation and some feature queries don’t have enough data no matter the interval (change response type so we can both provide interval rec and state the features with issue or simply state the features with no interval found)
Add a sub-aggregation that looks for the feature fields inside of each interval bucket within the date histogram aggregation that is currently implemented
1. This will need to have further perf testing as it would mean a sub aggregation within up to 1440 different interval buckets, and this call itself is already occurring multiple times for different intervals.

elfisher · 2022-04-21T12:59:50Z

This looks really interesting! From an end-user perspective, does this improve AD accuracy, performance, or something else?

amitgalitz · 2022-04-21T17:04:59Z

@elfisher From the user-perspective it means the Validation API is more accurate. In Opensearch 1.3 we added a new Validation API that runs on the last step of creating an Anomaly detector which validates if the given configurations will likely create a detector that successfully initializes and completes model training. Users can also call the Validation API directly through the backend https://opensearch.org/docs/latest/monitoring-plugins/ad/api/#validate-detector.

 This enhancement will lead to the Validation API having the ability to suggest to user if a specific feature field is causing sparse data and give user a call-out they should probably change there feature field or to expect potentially longer initialization times or no initialization at all. Currently Validation API doesn’t take the specific feature fields completely into account and just the other configurations.
 
 Basically before creating a detector users will be even more informed if there configurations have any issues.

elfisher · 2022-05-19T13:54:38Z

Thanks for the clarification @amitgalitz! If you don't mind, can you create an issue in the doc repo to track this update for 2.1? Since this is an improvement to the API we should make sure we get it documented.

ohltyler · 2022-08-10T18:17:25Z

@amitgalitz should this be re-labeled as 2.3?

amitgalitz · 2022-08-10T18:28:25Z

@amitgalitz should this be re-labeled as 2.3?

Good point, I'll actually remove version label right now and discuss with Sean on priority

ohltyler · 2022-08-10T19:32:24Z

@amitgalitz should this be re-labeled as 2.3?

Good point, I'll actually remove version label right now and discuss with Sean on priority

Sounds good - i'll set as 2.3 tentatively

kaituo · 2024-07-08T18:56:17Z

sent the PR to fix the issue: #1258

My solution is a little different:

When suggesting interval, I am using cold start queries if the feature exist (it is possible users haven’t defined features when invoking suggest interval APIs). If there is any one feature missing, I would regard the sample (might including multiple features) missing.
When finding feature sparsity, I would use feature aggregation instead of exist query now. I found people can write complicated query in a feature, the current logic won’t be able to find feature field name. Also, as you mentioned in the issue, not all fields might exist in the same documents. Also, they may create a runtime field too. That’s even more complex.
If a feature aggregation returns enough non-empty value for cold start, I would assume it won’t cause sparsity.

kaituo · 2024-07-08T18:57:16Z

close the issue for now. @amitgalitz feel free to reopen if you find my PR needs improvement.

amitgalitz added the enhancement New feature or request label Mar 15, 2022

amitgalitz added the v2.1.0 label Apr 18, 2022

amitgalitz added v2.2.0 and removed v2.1.0 labels Jul 7, 2022

amitgalitz removed the v2.2.0 label Aug 10, 2022

ohltyler added the v2.3.0 label Aug 10, 2022

ohltyler added v2.4.0 and removed v2.3.0 labels Sep 7, 2022

amitgalitz removed the v2.4.0 label Oct 21, 2022

kaituo closed this as completed Jul 8, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Checking feature queries as part of Validate API recommendation #445

Checking feature queries as part of Validate API recommendation #445

amitgalitz commented Mar 15, 2022

elfisher commented Apr 21, 2022

amitgalitz commented Apr 21, 2022

elfisher commented May 19, 2022

ohltyler commented Aug 10, 2022

amitgalitz commented Aug 10, 2022

ohltyler commented Aug 10, 2022

kaituo commented Jul 8, 2024

kaituo commented Jul 8, 2024

Checking feature queries as part of Validate API recommendation #445

Checking feature queries as part of Validate API recommendation #445

Comments

amitgalitz commented Mar 15, 2022

elfisher commented Apr 21, 2022

amitgalitz commented Apr 21, 2022

elfisher commented May 19, 2022

ohltyler commented Aug 10, 2022

amitgalitz commented Aug 10, 2022

ohltyler commented Aug 10, 2022

kaituo commented Jul 8, 2024

kaituo commented Jul 8, 2024