Skip to content
This repository has been archived by the owner on Jun 26, 2020. It is now read-only.

Articles id parsing issue #22

Open
nleguillarme opened this issue Jul 31, 2019 · 5 comments · Fixed by ritvikvipra/pymed#1 · May be fixed by #30
Open

Articles id parsing issue #22

nleguillarme opened this issue Jul 31, 2019 · 5 comments · Fixed by ritvikvipra/pymed#1 · May be fixed by #30

Comments

@nleguillarme
Copy link

While iterating on articles resulting from a PubMed query, I noticed that some article ids have parsing issues.

For instance :
Query : ((Haliaeetus leucocephalus[Title/Abstract])) AND ((prey[Title/Abstract]) OR (diet[Title/Abstract]))

Returns (when printing first 10 results) :
pubmed_id = '22822430\n18959310\n21310968\n21295371\n20439737'
abstract = ('Bald eagles (Haliaeetus leucocephalus) are recovering from severe population declines...

@M0rtenB
Copy link

M0rtenB commented Aug 4, 2019

I too ran into this. The article titled "Premorbid IQ varies across different definitions of schizophrenia" returns .pubmed_id '17342225\n10435610\n1638332\n15474902\n14302768\n9403903\n16297601\n5009428\n6382590\n12597613\n3292568\n16221995\n10986554\n16946869\n1182406\n12414070\n16330717\n15066893\n16484093\n1931805\n10678506\n9223148\n16639153\n4752222\n10442433\n12379446'

@mbullmanFHCRC
Copy link

This is due to how getContent is parsing the XML. Looking at @M0rtenB 's example in XML, the Author's of "Premorbid IQ ..." seem to have included all the pubMed ID's for their citations.
<ReferenceList> <Reference> <Citation>Arch Gen Psychiatry. 1999 Aug;56(8):749-54</Citation> <ArticleIdList> <ArticleId IdType="pubmed">10435610</ArticleId> </ArticleIdList> </Reference> <Reference> <Citation>Br J Psychiatry. 1992 Jul;161:69-74</Citation> <ArticleIdList> <ArticleId IdType="pubmed">1638332</ArticleId> </ArticleIdList> </Reference> <Reference> <Citation>Schizophr Res. 2004 Dec 1;71(2-3):323-30</Citation> <ArticleIdList> <ArticleId IdType="pubmed">15474902</ArticleId> </ArticleIdList>

Most article's will only have a small articleID snippet (not every article ID for citations) which will look like this:
<ArticleIdList> <ArticleId IdType="pubmed">17342225</ArticleId> <ArticleId IdType="pmc">PMC1805734</ArticleId> </ArticleIdList>

article.py is using getContent() from helpers.py to grab this from the xml. getContent uses element.findall(path) to grab the results, then joins those results into a string broken by new lines (what you're seeing).

We could probably change _extractPubMedID to use
path = ".//PMID" instead of path = ".//ArticleId[@IdType='pubmed']", and I think that would work. Not sure if there's other gotchas in that solution though.

@mbullmanFHCRC
Copy link

@nleguillarme your example also uses citation articleIDs

@iacopy
Copy link

iacopy commented Mar 14, 2020

I too ran into this.

iacopy added a commit to iacopy/pymed that referenced this issue Mar 22, 2020
This fix avoids returning also the IDs of cited
papers
(they are within the ReferenceList element of the xml).

Fixes gijswobben#22

An alternative XPath to be used:
path = ".//PubmedData/ArticleIdList/ArticleId[@idtype='pubmed']"
@iacopy iacopy linked a pull request Mar 22, 2020 that will close this issue
7 tasks
@iacopy
Copy link

iacopy commented Mar 22, 2020

@gijswobben @nleguillarme I made a pull request for this issue. Basically following @mbullmanFHCRC suggestions, actually.

ritvikvipra pushed a commit to ritvikvipra/pymed that referenced this issue May 7, 2020
multiple PMID's are getting parsed. Those other id's are likely the PMID's
of cited articles in the article under consideration.
resolves gijswobben#22
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants