add optional values field to response filtering terms #160

gsfk · 2024-08-20T16:48:45Z

PR for issue #79, where I suggested adding a field to filtering terms, so here is the change. ~~Targeting the main branch since dev is trailing main.~~

The proposal is to bring FilteringTerm in the request into better alignment with FilteringTerm in the response from various /filtering_terms endpoints. I ignore a third, similar definition of filters available at /framework/src/configuration/filteringTermSchema, this file is orphaned and is falling behind the rest of the spec.

This PR proposes an optional addition to the response; this is a non-breaking change (see this example) which among other selling points reduces the verbosity of /filtering_terms responses. See these slides for further discussion.

Motivation / use cases

An alphanumeric query has required fields id, operator, value; an example query (from the docs) is

{
    "id": "geographicOrigin",
    "operator": "=",
    "value": "England"
}

But the response filtering term discards both the operator and value fields. The expected workaround among some beacon users appears to be to package the id, operator and value into a single string so it can be returned as an id, eg:

id: geographicOrigin=England

or something similar. But this is unsupported by the specification and presents other limitations, such as verbose unstructured responses from /filtering_terms endpoints, see slides linked above for discussion. The proposed solution is to not discard the "value" field.

The proposed change was motivated by the addition of beacon v2 to the Bento platform, where we encountered a few issues with alphanumeric search:

(1) String fields not covered by a specification: In order to handle data from the Quebec covid bioank (Biobanque québécoise de la COVID-19), we added an extra property field for subject domicile, with values matching the clinical data collected:

[
    "At home",
    "In residence for the elderly (RPA)",
    "Nursing home (CHSLD)",
    "Homeless",

    ... etc...
]

Some fields (such as "homeless") map well to existing ontology terms, but others involve Quebec-specific terminology such as "CHSLD"; domicile data is stored as a string from an enum of acceptable values. We anticipate similar issues handling data from the upcoming Pan-Canadian Genome Library.

(2) Binned Search: Bento beacon is focused on public anonymous search, so some fields are searchable by predefined bins only, to inhibit re-identification attacks. Here are some example bins for BMI, again from the covid biobank:

[
    "< 20",
    "[20, 25)",
    "[25, 27)",
    "[27, 30)",
    "[30, 35)",
    "[35, 40)",
    "≥ 40"
]

but these bins are not discoverable without workarounds. Some fields may have a large number of bins, which makes the expected workaround above needlessly verbose.

Discussion

The goal here is to improve alignment between request and response filtering terms, since only the latter is discoverable.

The problem is not simply that value and operator go missing when we make a /filtering_term response, but that there is also an ambiguity in how different queries treat the id field. For OntologyFilter, id represents a value to be found in the data, but for AlphanumericFilter, id is a field to be queried. The design of response filtering terms seems to have considered only the first kind of id. Consequently, only OntologyFilter is truly discoverable, other kinds of query require workarounds outside of the specification.

What if I don't like this PR?

The discoverability problem for alphanumeric filters still exists! There are alternative solutions:

codify the expected workaround (squashing everything into a single string)
adopt the "target" proposal mentioned in issue 115: this goes some way to preserving the connection between the values being searched for and the fields where you can expect to find them.

Both of these alternatives leave us with a flattened, verbose /filtering_terms response which this PR eliminates, but these are better than changing nothing. Or, as I've heard suggested, do a complete redesign of filters. But that looks like Beacon v3, which is too far in the future.

tb143 · 2024-10-28T17:24:14Z

This solution and PR has been presented to the Beacon Filters Scout and we are asking for votes on whether to accept. During the Filters Scout meeting today, there was a vote of 5 in favour of accepting and 0 against.

We want to give those who couldn't attend the meeting, or need more time to consider things, an opportunity to vote. Therefore, leave +1 or -1 before the next Filters Scout meeting on 25th November 2024. A final decision will be made then.

redmitry · 2024-10-28T18:13:32Z

I think, the proposed schema should be more strict.
I would add "minItems": 1 into "values" - we must provide at least one value.
If we want to use enumeration, we should consider "enumeration" type of the filter and put a condition it:

"dependentSchemas": {
    "values": {
        "properties": {
            "type": { "const": "enumeration" } 
        }
    }
}

So basically allow "values" only when "type" is "enumeration"

Cheers,

Dmitry

costero-e · 2024-10-29T09:28:14Z

Hi @gsfk and thanks for this PR. I'm 100% in favor of accepting this proposal, but I would first direct it towards dev and then, when we decide which milestone this fits in, merge it to main with the corresponding announcement. I know this is not a breaking change but this is not an urgent fix but a feature, so we should follow the branch management rules. Why do you want to make the request to main directly?
Regarding @redmitry's suggestion, I agree that the "minItems": 1 is useful to not let the beacons return empty arrays of values but for the "type": { "const": "enumeration" } this would restrict the queries just to the values that appear for each filtering term, which means that if you have values, then what you type in value for the filtering query is restricted to the values themselves, which wouldn't let the user make a "free search query". This would clash with "like queries" and I wouldn't do it as enumeration.

+1

redmitry · 2024-10-29T10:42:06Z

for the "type": { "const": "enumeration" } this would restrict the queries just to the values that appear for each filtering term, which means that if you have values, then what you type in value for the filtering query is restricted to the values themselves, which wouldn't let the user make a "free search query". This would clash with "like queries" and I wouldn't do it as enumeration.

Didn't get it... Do we need explicit string values for the "ontology" or "alphanumeric" filters?
I think "enumeration" is the right word for the case. If we want regex textual filters (queries) - see no much sense in a list of "possible" values. Anyway we may always postpone this for other discussion.

+1

D.

gsfk · 2024-10-29T15:05:02Z

Regarding the target branch, no trouble to merge to develop first. When I wrote this pr, develop was trailing main (rather than the other way around)

mbaudis · 2024-10-29T22:32:09Z

+1

That being said: In the long run the same structure could remodel the current structure of filtering terms since it is basically turning the current structure on its head:

currently: a list of individual filtering terms with a (new) path indicating where they apply
this mode: a list of "1 term per target model path" with embedded values supported by the endpoint

When turning this around for the ontology terms one could do the same... e.g. instead of ~800 objects for histological_diagnosis.id we would have one object w/ a list of 800 CURIEs. Fewer bytes but more prescriptive...

add optional values field to response filtering terms

7f0f518

tb143 added the Filters scout label Oct 28, 2024

gsfk changed the base branch from main to develop October 29, 2024 15:01

Merge branch 'develop' into filtering-term-values

ad61afa

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add optional values field to response filtering terms #160

add optional values field to response filtering terms #160

gsfk commented Aug 20, 2024 •

edited

Loading

tb143 commented Oct 28, 2024

redmitry commented Oct 28, 2024

costero-e commented Oct 29, 2024 •

edited

Loading

redmitry commented Oct 29, 2024

gsfk commented Oct 29, 2024

mbaudis commented Oct 29, 2024 •

edited

Loading

add optional values field to response filtering terms #160

Are you sure you want to change the base?

add optional values field to response filtering terms #160

Conversation

gsfk commented Aug 20, 2024 • edited Loading

Motivation / use cases

Discussion

What if I don't like this PR?

tb143 commented Oct 28, 2024

redmitry commented Oct 28, 2024

costero-e commented Oct 29, 2024 • edited Loading

redmitry commented Oct 29, 2024

gsfk commented Oct 29, 2024

mbaudis commented Oct 29, 2024 • edited Loading

gsfk commented Aug 20, 2024 •

edited

Loading

costero-e commented Oct 29, 2024 •

edited

Loading

mbaudis commented Oct 29, 2024 •

edited

Loading