Articles id parsing issue #22

nleguillarme · 2019-07-31T09:31:15Z

While iterating on articles resulting from a PubMed query, I noticed that some article ids have parsing issues.

For instance :
Query : ((Haliaeetus leucocephalus[Title/Abstract])) AND ((prey[Title/Abstract]) OR (diet[Title/Abstract]))

Returns (when printing first 10 results) :
pubmed_id = '22822430\n18959310\n21310968\n21295371\n20439737'
abstract = ('Bald eagles (Haliaeetus leucocephalus) are recovering from severe population declines...

M0rtenB · 2019-08-04T19:10:49Z

I too ran into this. The article titled "Premorbid IQ varies across different definitions of schizophrenia" returns .pubmed_id '17342225\n10435610\n1638332\n15474902\n14302768\n9403903\n16297601\n5009428\n6382590\n12597613\n3292568\n16221995\n10986554\n16946869\n1182406\n12414070\n16330717\n15066893\n16484093\n1931805\n10678506\n9223148\n16639153\n4752222\n10442433\n12379446'

mbullmanFHCRC · 2019-08-12T17:04:00Z

This is due to how getContent is parsing the XML. Looking at @M0rtenB 's example in XML, the Author's of "Premorbid IQ ..." seem to have included all the pubMed ID's for their citations.
<ReferenceList> <Reference> <Citation>Arch Gen Psychiatry. 1999 Aug;56(8):749-54</Citation> <ArticleIdList> <ArticleId IdType="pubmed">10435610</ArticleId> </ArticleIdList> </Reference> <Reference> <Citation>Br J Psychiatry. 1992 Jul;161:69-74</Citation> <ArticleIdList> <ArticleId IdType="pubmed">1638332</ArticleId> </ArticleIdList> </Reference> <Reference> <Citation>Schizophr Res. 2004 Dec 1;71(2-3):323-30</Citation> <ArticleIdList> <ArticleId IdType="pubmed">15474902</ArticleId> </ArticleIdList>

Most article's will only have a small articleID snippet (not every article ID for citations) which will look like this:
<ArticleIdList> <ArticleId IdType="pubmed">17342225</ArticleId> <ArticleId IdType="pmc">PMC1805734</ArticleId> </ArticleIdList>

article.py is using getContent() from helpers.py to grab this from the xml. getContent uses element.findall(path) to grab the results, then joins those results into a string broken by new lines (what you're seeing).

We could probably change _extractPubMedID to use
path = ".//PMID" instead of path = ".//ArticleId[@IdType='pubmed']", and I think that would work. Not sure if there's other gotchas in that solution though.

mbullmanFHCRC · 2019-08-12T17:07:18Z

@nleguillarme your example also uses citation articleIDs

iacopy · 2020-03-14T16:34:04Z

I too ran into this.

@idtype

This fix avoids returning also the IDs of cited papers (they are within the ReferenceList element of the xml). Fixes gijswobben#22 An alternative XPath to be used: path = ".//PubmedData/ArticleIdList/ArticleId[@idtype='pubmed']"

iacopy · 2020-03-22T21:30:25Z

@gijswobben @nleguillarme I made a pull request for this issue. Basically following @mbullmanFHCRC suggestions, actually.

multiple PMID's are getting parsed. Those other id's are likely the PMID's of cited articles in the article under consideration. resolves gijswobben#22

iacopy linked a pull request Mar 22, 2020 that will close this issue

Bugfix/article ID parsing issue #30

Open

7 tasks

This was referenced May 7, 2020

BugFix/pmid-parsing #31

Open

PMID parsing issue ritvikvipra/pymed#1

Merged

iacopy mentioned this issue Feb 25, 2022

Article ID parsing issue. iacopy/pymed#1

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Articles id parsing issue #22

Articles id parsing issue #22

nleguillarme commented Jul 31, 2019

M0rtenB commented Aug 4, 2019

mbullmanFHCRC commented Aug 12, 2019

mbullmanFHCRC commented Aug 12, 2019

iacopy commented Mar 14, 2020

iacopy commented Mar 22, 2020

Articles id parsing issue #22

Articles id parsing issue #22

Comments

nleguillarme commented Jul 31, 2019

M0rtenB commented Aug 4, 2019

mbullmanFHCRC commented Aug 12, 2019

mbullmanFHCRC commented Aug 12, 2019

iacopy commented Mar 14, 2020

iacopy commented Mar 22, 2020