Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for interval field for Date histogram #2037

Open
fmassot opened this issue May 10, 2023 · 11 comments
Open

Add support for interval field for Date histogram #2037

fmassot opened this issue May 10, 2023 · 11 comments
Assignees

Comments

@fmassot
Copy link
Contributor

fmassot commented May 10, 2023

As described in quickwit-oss/quickwit#3250, we need to support the OpenSearch API too...

To reach that for the date histogram, we need to support the interval field in addition to the fixed_interval field (there is a calendar_interval field but we don't support it yet).

We discussed this issue with @PSeitz, we have 2 ways of doing it:

  1. having separate structs to deserialize the JSON formatted date histogram aggregation
  2. adding a new field interval on the DateHistogramAggregationReq and adding some validation/preprocessing so that we:
  • raise an error if both interval and fixed_interval or interval and calendar_interval are present
  • if interval is present, use it to populate fixed_interval as we only support fixed_interval for now.
@fulmicoton
Copy link
Collaborator

@PSeitz Can you take care of this?

We can also discuss long term plans to ditch the JSON representation of aggregations in tantivy, and just have agregation definition "objects" in there. We would then have a JSON -> Aggregation definition object translation layer in quickwit instead.
However we need to support this for airmail.

@PSeitz
Copy link
Contributor

PSeitz commented May 10, 2023

Elastic search 7.2:

[7.2] Deprecated in 7.2. interval field is deprecated Historically both calendar and fixed intervals were configured in a single interval field, which led to confusing semantics. Specifying 1d would be assumed as a calendar-aware time, whereas 2d would be interpreted as fixed time. To get "one day" of fixed time, the user would need to specify the next smaller unit (in this case, 24h).

Mapping interval to fixed_interval is only an option for values that are fixed interval in es, otherwise that would break elastic search compatibility.

Elastic search and OpenSearch should not be mixed in the same API endpoint, if they are not compatible, which seems to be the the case for date histogram. So I think we need an explicit OpenSearch endpoint in quickwit.

We don't need separate structs, it could be just a union which then gets validated for elastic search or open search compatibility.

As of now, I don't think we gain something by ditching the JSON representation of aggregations in tantivy, but it would cause a lot of code duplication.

@fmassot
Copy link
Contributor Author

fmassot commented May 10, 2023

oh yes sorry for my lack of accuracy, @PSeitz you indeed mentioned the solution of having a dedicated endpoint for OpenSearch.

Mapping interval to fixed_interval is only an option for values that are fixed interval in es, otherwise that would break elastic search compatibility.

Agreed. But we only need to handle fixed intervals in interval for airmail. We could support only that.

Elastic search and OpenSearch should not be mixed in the same API endpoint, if they are not compatible, which seems to be the the case for date histogram. So I think we need an explicit OpenSearch endpoint in quickwit.

Looking at the history of Elasticsearch, I don't see this as an incompatibility but as a support for older elasticsearch version. And when I look at this opensearch issue I would expect them to remove the interval support in the future (though it can take some time).

I think it's overkill to distinguish OpenSearch and Elasticsearch API just for this interval field. That being said, I acknowledge that it would be good to check all endpoints needed for airmail and check if we have other issues like this one. I'm going to do that tonight.

@fmassot
Copy link
Contributor Author

fmassot commented May 11, 2023

@PSeitz Just to confirm that I did not find other queries with an API difference between Elasticsearch en OpenSearch for the airmail project.

I had a look at the following queries: query_string, exists, term, percentiles, multi_match.

@fmassot
Copy link
Contributor Author

fmassot commented May 16, 2023

@PSeitz do you need more info on your side?

@PSeitz
Copy link
Contributor

PSeitz commented May 16, 2023

Looking at the OpenSearch docs, it's unclear if they handle calendar aware intervals or not.
So should we implement the interval halfway, only with values higher than 1?

@fmassot
Copy link
Contributor Author

fmassot commented May 16, 2023

For airmail, we need to support: \d+s, \d+m, \d+h.
I think we can only support what we support currently for the fixed_interval. On OpenSearch, you can have a look at the code here.

@PSeitz
Copy link
Contributor

PSeitz commented May 16, 2023

1m is date-aware in elastic search, this is incompatible with the fixed interval handling we have. Most users are familiar with elastic search and they would all run into the same API issue (probably unknowingly for some time).
2m etc. is fixed interval on the interval field.

@fmassot
Copy link
Contributor Author

fmassot commented May 16, 2023

you mean 1d was date-aware in elasticsearch? So to avoid issues, we could only support ms, s, m, h and avoid the problematic d.

@PSeitz
Copy link
Contributor

PSeitz commented May 16, 2023

1m, 1h, 1d, 1w, 1M, 1q, 1y. They are all date aware. All other values are not, e.g. 24h, 2w etc.

So we could implement the interval halfway, only for values higher than 1.
It's a weird API, and will probably cause some user issues, but at least not returning wrong results.

I think it would be better if airmail could call the new API.

@fmassot
Copy link
Contributor Author

fmassot commented May 16, 2023

Ok, I have just understood how this was broken before... I thought the issue was only on 1d when reading the comment you quoted previously.
Moreover, I saw yesterday in the OpenSearch code that it should be already possible to use the fixed_interval parameter. It's just not documented. This old spec is crazier than I thought... Let's try to push back this on the client side.

(discussion ref)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants