Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

jmespath: new 500 error scenario? #368

Open
colleenXu opened this issue Dec 25, 2024 · 11 comments
Open

jmespath: new 500 error scenario? #368

colleenXu opened this issue Dec 25, 2024 · 11 comments
Assignees
Labels

Comments

@colleenXu
Copy link

colleenXu commented Dec 25, 2024

@DylanWelzel @ctrl-schaff (based on discussion in this Slack thread)

While writing queries and testing x-bte annotation in biothings/biothings_explorer#904, I found a query that returns an error: {"code":500,"success":false,"error":"Internal Server Error","details":"bioactivity"}. Similar queries worked fine (different uniprot ID used in body and jmespath parameter).

click to see problematic query

curl --location --globoff 'https://mychem.info/v1/query?size=1000&fields=drugcentral.bioactivity%2Cdrugcentral.xrefs.umlscui%2Cdrugcentral.synonyms&jmespath_exclude_empty=true&always_list=drugcentral.bioactivity&jmespath=drugcentral.bioactivity%7C[%3F!action_type%20%20%26%26%20length(uniprot[%3Funiprot_id%3D%3D%27P29274%27])%20%3E%20%600%60]' \
--header 'Content-Type: application/json' \
--data '{
    "q": ["P29274"],
    "scopes": "drugcentral.bioactivity.uniprot.uniprot_id"
}'

Johnathan confirmed that this error could be reproduced locally, and a more specific error message was "KeyError: 'bioactivity'"

[ERROR tornado.application:1875] Uncaught exception POST /v1/query?size=1000&fields=drugcentral.bioactivity%2Cdrugcentral.xrefs.umlscui%2Cdrugcentral.synonyms&jmespath_exclude_empty=true&always_list=drugcentral.bioactivity&jmespath=drugcentral.bioactivity%7C[%3F!action_type%20%20%26%26%20length(uniprot[%3Funiprot_id%3D%3D%27P29274%27])%20%3E%20%600%60] (127.0.0.1)
    HTTPServerRequest(protocol='http', host='localhost:8000', method='POST', uri='/v1/query?size=1000&fields=drugcentral.bioactivity%2Cdrugcentral.xrefs.umlscui%2Cdrugcentral.synonyms&jmespath_exclude_empty=true&always_list=drugcentral.bioactivity&jmespath=drugcentral.bioactivity%7C[%3F!action_type%20%20%26%26%20length(uniprot[%3Funiprot_id%3D%3D%27P29274%27])%20%3E%20%600%60]', version='HTTP/1.1', remote_ip='127.0.0.1')
    Traceback (most recent call last):
      File "/home/jschaff/workspace/biothings/lib/python3.10/site-packages/tornado/web.py", line 1790, in _execute
        result = await result
      File "/home/jschaff/workspace/biothings/lib/python3.10/site-packages/biothings/web/handlers/query.py", line 204, in _method
        return await coro(*args, **kwargs)
      File "/home/jschaff/workspace/biothings/lib/python3.10/site-packages/biothings/web/handlers/query.py", line 264, in post
        result = await ensure_awaitable(self.pipeline.search(**self.args))
      File "/home/jschaff/workspace/biothings/lib/python3.10/site-packages/biothings/web/handlers/query.py", line 197, in ensure_awaitable
        return await obj
      File "/home/jschaff/workspace/biothings/lib/python3.10/site-packages/biothings/web/query/pipeline.py", line 103, in _
        return await func(*args, **kwargs)
      File "/home/jschaff/workspace/biothings/lib/python3.10/site-packages/biothings/web/query/pipeline.py", line 176, in search
        result = self.formatter.transform(response, **options)
      File "/home/jschaff/workspace/biothings/lib/python3.10/site-packages/biothings/web/query/formatter.py", line 200, in transform
        responses = [self.transform(res, **options) for res in response]
      File "/home/jschaff/workspace/biothings/lib/python3.10/site-packages/biothings/web/query/formatter.py", line 200, in <listcomp>
        responses = [self.transform(res, **options) for res in response]
      File "/home/jschaff/workspace/biothings/lib/python3.10/site-packages/biothings/web/query/formatter.py", line 253, in transform
        self._transform_hit(hit, options)
      File "/home/jschaff/workspace/biothings/lib/python3.10/site-packages/biothings/web/query/formatter.py", line 303, in _transform_hit
        self.trasform_jmespath(path, obj, doc, options)
      File "/home/jschaff/workspace/biothings/lib/python3.10/site-packages/biothings/web/query/formatter.py", line 469, in trasform_jmespath
        idx_to_remove = [i for i, _obj in enumerate(obj) if not _obj[target_field]]
      File "/home/jschaff/workspace/biothings/lib/python3.10/site-packages/biothings/web/query/formatter.py", line 469, in <listcomp>
        idx_to_remove = [i for i, _obj in enumerate(obj) if not _obj[target_field]]
    KeyError: 'bioactivity'

Based on our digging so far, I think this error occurs when this query meets all of this criteria:

  • retrieves a document with a specific structure (we think the problematic query gets stuck on this retrieved doc)
    1. drugcentral's value is an array of objects, rather than an object. Johnathan found <50 more documents with this structure.
    2. Some of those drugcentral objects have the bioactivity field and others don't.
  • the query has the parameter jmespath_exclude_empty=true. If you try taking it out, the query then returns without an error....but will keep the hits that didn't pass the criteria specified in jmespath (bioactivity is [] or null after jmespath processing). For BTE/retriever use, it's important that those non-matching hits are removed, and that's what jmespath_exclude_empty=true was supposed to do.
    • Notice the problematic document OEYIOHPDSNJKLS-UHFFFAOYSA-N has bioactivity [] or doesn't exist. So both jmespath criteria weren't met in a single bioactivity object, and we want to remove this hit entirely with jmespath_exclude_empty.
    • Use P05186 from extra info 5 as a positive control - the problematic doc will have a bioactivity object that meets the jmespath criteria - so it should be kept in the response.

I think the next steps are to:

  • review some more cases where drugcentral is an array. Is this data structured correctly (should it all be organized into 1 document)?
  • investigate how jmespath_exclude_empty=true behavior currently works and perhaps change it.
  • is there a way to remove individual drugcentral objects that don't have the bioactivity field? I don't know how to do this. filter=_exists_:drugcentral.bioactivity doesn't work (it removes entire hits only if the entire document lacks the bioactivity field)
@colleenXu
Copy link
Author

colleenXu commented Dec 25, 2024

The scope of this problem is not clear: the error will happen when...

  • querying for <=39 chemicals whose documents have that structure (drugcentral array where some objects have the bioactivity field and others don't)
  • querying for any uniprot IDs (and action_type value or absence) that end up retrieving >= 1 of those chemical documents

@colleenXu
Copy link
Author

colleenXu commented Dec 25, 2024

Extra info from our Slack discussion:

A: Trying to set all drugcentral values to an array of objects

Dylan recommended turning all drugcentral values into arrays by setting jmespath to drugcentral[].bioactivity[] | [? !action_type && length(uniprot[?uniprot_id=='P29274']) > `0`]. This doesn't seem to work: the query does return without error, but the jmespath logic doesn't seem to have been applied (there's bioactivity objects with the action_type field and with the uniprot_id =/= P29274).

drugcentral[].bioactivity[] query

curl --location --globoff \
  'https://mychem.info/v1/query?size=1000&fields=drugcentral.bioactivity%2Cdrugcentral.xrefs.umlscui%2Cdrugcentral.synonyms&jmespath_exclude_empty=true&always_list=drugcentral.bioactivity&jmespath=drugcentral%5B%5D.bioactivity%5B%5D%7C%5B%3F%20%21action_type%20%20%26%26%20length(uniprot%5B%3Funiprot_id%3D%3D%27P29274%27%5D)%20%3E%20%600%60%5D' \
  --header 'Content-Type: application/json' \
  --data '{
    "q": ["P29274"],
    "scopes": "drugcentral.bioactivity.uniprot.uniprot_id"
}'

I then tried doing always_list:drugcentral,drugcentral.bioactivity instead. It does turn all drugcentral values into arrays in the response, but the 500 error still happens.

B: this happens when querying with a chem ID too.

Compare with and without jmespath_exclude_empty.

curl --location --globoff 'https://mychem.info/v1/query?fields=drugcentral.bioactivity%2Cdrugcentral.xrefs%2Cdrugcentral.synonyms&size=1000&with_total=true&jmespath=drugcentral.bioactivity%7C[%3F!action_type]&always_list=drugcentral.bioactivity' \
--header 'Content-Type: application/json' \
--data '{
    "q": ["C0055578"],
    "scopes": "drugcentral.xrefs.umlscui"
}'

C: If the document's `drugcentral` field is an object that lacks the `bioactivity` field, no error happens

Retrieves this document.
dummy example that lacks filter=_exists_:drugcentral.bioactivity

Example query:

curl --location --globoff 'https://mychem.info/v1/query?fields=drugcentral.bioactivity%2Cdrugcentral.xrefs%2Cdrugcentral.synonyms&size=1000&with_total=true&jmespath=drugcentral.bioactivity%7C[%3F!action_type]&jmespath_exclude_empty=true&always_list=drugcentral.bioactivity' \
--header 'Content-Type: application/json' \
--data '{
    "q": ["C0071066"],
    "scopes": "drugcentral.xrefs.umlscui"
}'

D: if I change up the query (set action_type with jmespath/filter) but still retrieve the problematic document, the error will still happen

This query doesn't retrieve the problematic document. I set action_type value to AGONIST, and none of the problematic document's bioactivity objects have that value (so it doesn't pass the filter).

curl --location --globoff 'https://mychem.info/v1/query?size=1000&fields=drugcentral.bioactivity%2Cdrugcentral.xrefs.umlscui%2Cdrugcentral.synonyms&filter=drugcentral.bioactivity.action_type%3A(%22AGONIST%22)&jmespath_exclude_empty=true&always_list=drugcentral.bioactivity&jmespath=drugcentral.bioactivity%7C[%3Faction_type%3D%3D%60AGONIST%60%20%26%26%20length(uniprot[%3Funiprot_id%3D%3D%60P29274%60])%20%3E%20%600%60]' \
--header 'Content-Type: application/json' \
--data '{
    "q": ["P29274"],
    "scopes": "drugcentral.bioactivity.uniprot.uniprot_id"
}'

This query DOES retrieve the problematic document and errors out. I set action_type value to ANTAGONIST, and some of the problematic document's bioactivity objects have that value (so it passes the filter).

curl --location --globoff 'https://mychem.info/v1/query?size=1000&fields=drugcentral.bioactivity%2Cdrugcentral.xrefs.umlscui%2Cdrugcentral.synonyms&filter=drugcentral.bioactivity.action_type%3A(%22ANTAGONIST%22)&jmespath_exclude_empty=true&always_list=drugcentral.bioactivity&jmespath=drugcentral.bioactivity%7C[%3Faction_type%3D%3D%60ANTAGONIST%60%20%26%26%20length(uniprot[%3Funiprot_id%3D%3D%60P29274%60])%20%3E%20%600%60]' \
--header 'Content-Type: application/json' \
--data '{
    "q": ["P29274"],
    "scopes": "drugcentral.bioactivity.uniprot.uniprot_id"
}'

E: if I change up the uniprot_id but still retrieve the problematic document, the error will still happen

Uses P05186; its bioactivity object for the problematic document actually lacks the action_type field (first drugcentral element, bioactivity object idx 4). So It should return the problematic document as a hit if the error didn't happen.

curl --location --globoff 'https://mychem.info/v1/query?fields=drugcentral.bioactivity%2Cdrugcentral.xrefs.umlscui%2Cdrugcentral.synonyms&size=1000&always_list=drugcentral.bioactivity%2Cdrugcentral&jmespath_exclude_empty=true&jmespath=drugcentral.bioactivity%7C[%3F!action_type%20%26%26%20length(uniprot[%3Funiprot_id%3D%3D%27P05186%27])%20%3E%20%600%60]' \
--header 'Content-Type: application/json' \
--data '{
    "q": ["P05186"],
    "scopes": "drugcentral.bioactivity.uniprot.uniprot_id"
}'

@rjawesome
Copy link

rjawesome commented Dec 27, 2024

@colleenXu

try this jmespath: drugcentral | to_array(@)[] | merge(@, {"bioactivity":bioactivity|[? !action_type && length(uniprot[?uniprot_id=='P29274']) > `0`]}) | [? !!bioactivity]

(the target field here is "drugcentral", it converts it to an array and then edits the bioactivity field to remove noncompliant entries, finally it removes individual elements of drugcentral that do not have any bioactivity entries). this works for me combined with the jmespath_exclude_empty option

note: drugcentral[] doesn't convert to an array, it simply just flattens the array if it exists or otherwise returns null

@colleenXu
Copy link
Author

colleenXu commented Dec 28, 2024

@rjawesome

Interesting. Did you use certain websites/resources to figure this out? If so, it'd really help me if you linked them (I know very little syntax >.< but I'm listing what I know for future documentation-writing).

Even if there's a jmespath-built-in solution for this, I think it's still helpful to do the "next steps" I put in the opening issue (check the API documents, understand BioThings APIs' behavior for jmespath and jmespath_exclude_empty...)

@rjawesome
Copy link

@colleenXu
I used the JMESPath spec (https://jmespath.org/specification.html) with a bit of help from ChatGPT and also read some of the BioThings API code.

The main useful aspects from the spec: @ refers to the current object, merge can combine two objects, to_array is the correct way to convert anything to an array if it isn't already.

I'll try to send a summary of my investigations into jmespath/exclude_empty soon.

@rjawesome
Copy link

my understanding of jmespath behavior based on reading the code (someone else can correct me if I am wrong, code)

jmespath format: path.to.parent.field.target_field|jmespathquery
jmespath query is run on the target_field, in this example the bioactivity field inside drugcentral
NOTE: target field & parent field are not in jmespath syntax, it is just a dot-seperated path (ie. drugcentral[].bioactivity[] as part of the parent or target fields literally looks for the fields drugcentral[] and bioactivity[] rather than array conversion/flattenning)

jmespath exclude behavior (code):

  1. if the parent_field (drugcentral) is an array, then it will filter that array to remove any elements with empty target_field (bioactivity), and only remove the document from the result if the parent_field (drugcentral) is empty after this filtering
  2. if the parent_field (drugcentral) is not an array, then the document will be excluded from the result if the target_field (biaoctivity) is empty/null

my assumption of what the patch is (I don't have a local setup):

  • the code _obj[target_field] mentioned in the error (line), should probably be changed to _obj.get(target_field)
  • this would not throw an error if target_field doesn't exist, rather it would return None (a falsy value in python);
  • this fix would exclude any values from the result where the target_field (bioactivity) is missing (if jmespath_exclude_empty is enabled)

@colleenXu
Copy link
Author

colleenXu commented Jan 1, 2025

Update

I reviewed a list of documents where drugcentral's value may be an array (from @ctrl-schaff, lab Slack).

Here's my conclusions:

I also found some documents that would be good for testing.

Documents where some drugcentral items have bioactivity, some don't

All drugcentral items have bioactivity (but contents slightly diff), can see how BTE behaves with them

@colleenXu
Copy link
Author

@rjawesome and co...

I tested the jmespath string and it didn't seem to work. I used the query from the opening post and made a small change to fields (=drugcentral in case the full object was needed)...and then adjusted the jmespath value.

When I directly copied Rohan's jmespath string, it looked like jmespath didn't run on the response at all. The response had bioactivity objects with action_type fields and with uniprot IDs that weren't P29274.

Direct-copy query

curl --location --globoff 'https://mychem.info/v1/query?size=1000&fields=drugcentral&jmespath_exclude_empty=true&always_list=drugcentral.bioactivity&jmespath=drugcentral%20%7C%20to_array(%40)[]%20%7C%20merge(%40%2C%20{%22bioactivity%22%3Abioactivity%7C[%3F%20!action_type%20%26%26%20length(uniprot[%3Funiprot_id%3D%3D%27P29274%27])%20%3E%20%600%60]})%20%7C%20[%3F%20!!bioactivity]' \
--header 'Content-Type: application/json' \
--data '{
    "q": ["P29274"],
    "scopes": "drugcentral.bioactivity.uniprot.uniprot_id"
}'

I then tried removing spaces that may be interfering with execution (just there for readability). If the space between drugcentral and | is removed, a status 400 JMESPathTypeError occurs.

query, plus saved response with error message

curl --location --globoff 'https://mychem.info/v1/query?size=1000&fields=drugcentral&jmespath_exclude_empty=true&always_list=drugcentral.bioactivity&jmespath=drugcentral%7Cto_array(%40)[]%7Cmerge(%40%2C{%22bioactivity%22%3Abioactivity%7C[%3F!action_type%20%26%26%20length(uniprot[%3Funiprot_id%3D%3D%27P29274%27])%3E%600%60]})%7C[%3F!!bioactivity]%20%20' \
--header 'Content-Type: application/json' \
--data '{
    "q": ["P29274"],
    "scopes": "drugcentral.bioactivity.uniprot.uniprot_id"
}'

jmespath_error_response.json

@colleenXu
Copy link
Author

@DylanWelzel @ctrl-schaff

I saw the linked PR, and I'm wondering what the next steps are.

If the next steps involve running queries on a dev server that uses that PR, maybe it'd be helpful to do that in a meeting together (so I can give feedback on responses/we can adjust queries immediately)?

@rjawesome
Copy link

rjawesome commented Jan 1, 2025

@rjawesome and co...

I tested the jmespath string and it didn't seem to work. I used the query from the opening post and made a small change to fields (=drugcentral in case the full object was needed)...and then adjusted the jmespath value.

When I directly copied Rohan's jmespath string, it looked like jmespath didn't run on the response at all. The response had bioactivity objects with action_type fields and with uniprot IDs that weren't P29274.

Direct-copy query
I then tried removing spaces that may be interfering with execution (just there for readability). If the space between drugcentral and | is removed, a status 400 JMESPathTypeError occurs.

query, plus saved response with error message

whoops, I sent the wrong query. This was the one that was working for me (the merge function is mapped on all the drugcentral elements)
drugcentral|to_array(@)|map(&merge(@, {"bioactivity": bioactivity|[? !action_type && length(uniprot[?uniprot_id=='P29274']) > `0`]}), @)|[? !!bioactivity]

(command: curl --location --globoff 'https://mychem.info/v1/query?size=1000&fields=drugcentral.bioactivity%2Cdrugcentral.xrefs.umlscui%2Cdrugcentral.synonyms&jmespath=drugcentral%7Cto_array%28%40%29%7Cmap%28%26merge%28%40%2C%20%7B%22bioactivity%22%3A%20bioactivity%7C%5B%3F%20%21action_type%20%26%26%20length%28uniprot%5B%3Funiprot_id%3D%3D%27P29274%27%5D%29%20%3E%20%600%60%5D%7D%29%2C%20%40%29%7C%5B%3F%20%21%21bioactivity%5D&jmespath_exclude_empty=true' --header 'Content-Type: application/json' --data '{ "q": ["P29274"], "scopes": "drugcentral.bioactivity.uniprot.uniprot_id" }')

@colleenXu
Copy link
Author

colleenXu commented Jan 3, 2025

Update!

According to @DylanWelzel, MyChem has been updated to include #372. The other APIs haven't been updated yet.

I tested the following documents/queries and all worked as-expected (no 500 errors, response data is correct)! So I think the PR fully addresses this problem.

(I also tested x-bte annotation/integration with BTE and things looked mostly good there. Noticed one issue which I think is more biothings/mychem.info#191, so I'll describe it there.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants