Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

stringId values are not checked when a DwC export is constructed, resulting in server halting queries #5321

Open
grantfitzsimmons opened this issue Oct 9, 2024 · 0 comments
Labels
1 - Bug Incorrect behavior of the product 2 - Exporting Data Issues that are related to exporting data to DwC, GBIF, IPT, Web Portal, etc.

Comments

@grantfitzsimmons
Copy link
Member

Background

SAIAB created a DwCA definition and metadata pair for their National Fish Collection.

This is their mapping:
original_FC_export.xml.zip

Upon running this export (or using the RSS feed), the container would run out of memory and storage space, eventually erroring out. This didn't make a lot of sense, but I was able to recreate this despite the export ultimately being less than 14.2 MB compressed and <150 MB uncompressed. They even increased the amount of RAM accessible to the server to 32 GB.

After examining the SQL queries being run I found that the definition SAIAB provided had an invalid stringId defined:

<field stringId="7.accession.text4" oper="12" value="" isNot="false" isRelFld="false"/>

This should be

<field stringId="1,7.accession.text4" oper="12" value="" isNot="false" isRelFld="false"/>

Without this 1, in the XML, Specify constructs the query improperly and attempts to pull data from Accession without using the join from Collection Object.

When SAIAB reported this, their export mapping was creating an export with 61,336,980 rows. Each row in the export was duplicated by approximately the number of Accession records in the database, and this caused the export to endlessly grow far beyond what should have been included. We could have kept feeding it more and more memory, but just those two characters alone broke the export.

image

To Reproduce
Steps to reproduce the behavior:

  1. Create a new XML resource to capture a DwCA mapping
  2. Place the following into the the mapping resource:
    <?xml version="1.0" encoding="utf-8"?>
    <archive>
      <core rowType="http://rs.tdwg.org/dwc/terms/Occurrence">
        <queries>
          <query name="occurrence.csv" contextTableId="1">
             <field stringId="7.accession.text4" oper="12" value="" isNot="false" isRelFld="false"/>
          </query>
        </queries>
      </core>
    </archive>
  3. Create a new XML resource to capture the DwCA metadata
  4. Place the following into the metadata resource:
    <?xml version="1.0"?>
    <eml:eml
        packageId="doi:10.xxxx/eml.1.1" system="https://doi.org"
        xmlns:eml="https://eml.ecoinformatics.org/eml-2.2.0"
        xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
        xmlns:stmml="http://www.xml-cml.org/schema/stmml-1.1"
        xsi:schemaLocation="https://eml.ecoinformatics.org/eml-2.2.0 xsd/eml.xsd">
        
        <dataset>
            <title>Primary production of algal species from Southeast Alaska, 1990-2002</title>
            <creator id="https://orcid.org/0000-0003-0077-4738">
                <individualName>
                    <givenName>Matthew</givenName>
                    <givenName>B.</givenName>
                    <surName>Jones</surName>
                </individualName>
                <electronicMailAddress>[email protected]</electronicMailAddress>
                <userId directory="https://orcid.org">https://orcid.org/0000-0003-0077-4738</userId>
            </creator>
            <keywordSet>
                <keyword>biomass</keyword>
                <keyword>productivity</keyword>
            </keywordSet>
            <contact>
                <references>https://orcid.org/0000-0003-0077-4738</references>
            </contact>
        </dataset>
    </eml:eml>
  5. Click on your username to access the User Tools and click on Create DwC Archive
  6. Select first the DwCA mapping resource and then the metadata resource
  7. Watch as your Docker/server resources are absorbed until nothing is left. This can be observed in real time by running docker stats:
CONTAINER ID   NAME                         CPU %     MEM USAGE / LIMIT     MEM %     NET I/O           BLOCK I/O         PIDS
04ba654b7153   specify7-specify7-1          63.41%    437.7MiB / 13.15GiB   3.25%     50.4GB / 129MB    32.8GB / 48GB     5

This will eat all available storage space and RAM as well as hog the CPU.

Expected behavior
It should say "Hey, that stringId is invalid!"

Ultimately, once validation is added, #285 should be implemented.

Crash Report
The container simply exits when it runs out of space in some cases. When it crashes and stays alive, it says this:

OSError: [Errno 28] No space left on device

calhost/specify/" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/129.0.0.0 Safari/537.36" "-"
specify7-1         | [08/Oct/2024 19:02:19] "GET /notifications/messages/?since=2024-10-8+14%3A2%3A18 HTTP/1.0" 200 2
specify7-1         | [08/Oct/2024 19:39:39] [ERROR] [specifyweb.export.views:136] make_dwca failed: Traceback (most recent call last):
specify7-1         |   File "/opt/specify7/specifyweb/export/views.py", line 133, in do_export
specify7-1         |     make_dwca(collection, user, definition, path, eml=eml)
specify7-1         |   File "/opt/specify7/specifyweb/export/dwca.py", line 214, in make_dwca
specify7-1         |     query_to_csv(session, collection, user, query.tableid, query.get_field_specs(), path,
specify7-1         |   File "/opt/specify7/specifyweb/stored_queries/execution.py", line 219, in query_to_csv
specify7-1         |     csv_writer.writerow(encoded)
specify7-1         | OSError: [Errno 28] No space left on device

Reported By
Willem and Wesley at SAIAB

Additional context
queries_and_logs.zip

@grantfitzsimmons grantfitzsimmons added 1 - Bug Incorrect behavior of the product 2 - Exporting Data Issues that are related to exporting data to DwC, GBIF, IPT, Web Portal, etc. labels Oct 9, 2024
@grantfitzsimmons grantfitzsimmons changed the title stringId values are not checked when a DwC export is constructed stringId values are not checked when a DwC export is constructed, resulting in server halting queries Oct 9, 2024
@grantfitzsimmons grantfitzsimmons added this to the Grant's issue list milestone Oct 10, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
1 - Bug Incorrect behavior of the product 2 - Exporting Data Issues that are related to exporting data to DwC, GBIF, IPT, Web Portal, etc.
Projects
None yet
Development

No branches or pull requests

1 participant