Python parser for PubMed Open-Access (OA) subset and MEDLINE XML repository. See wiki page on how to download and process dataset using the repository.
note
path
provided to function can be path to compressed or uncompressed xml file. We provide example files indata
folder.- for website parser, you should scrape with pause. Please see copyright notice because your IP can get blocked if you try to download in bulk.
Here are available parsers.
We created a simple parser for PubMed Open Access Subset where you can give
an XML path or string to the function called parse_pubmed_xml
which will return
a dictionary with the following information:
full_title
: article's titleabstract
: abstractjournal
: Journal namepmid
: Pubmed IDpmc
: Pubmed Central IDdoi
: DOI of the articlepublisher_id
: publisher IDauthor_list
: list of authors with affiliation keys in the following format
[['last_name_1', 'first_name_1', 'aff_key_1'],
['last_name_1', 'first_name_1', 'aff_key_2'],
['last_name_2', 'first_name_2', 'aff_key_1'], ...]
affiliation_list
: list of affiliation keys and affiliation strings in the following format
[['aff_key_1', 'affiliation_1'],
['aff_key_2', 'affiliation_2'], ...]
publication_year
: publication yearsubjects
: list of subjects listed in the article separated by semicolon. Sometimes, it only contains type of article, such as research article, review, proceedings, etc.
import pubmed_parser as pp
dict_out = pp.parse_pubmed_xml(path)
The function parse_pubmed_references
will process a Pubmed Open Access XML
file and return a list of the PMID it cites.
Each dictionary has keys as follows
pmid
: Pubmed ID of the articlepmc
: Pubmed Central ID of the articlearticle_title
: title of cited articlejournal
: journal namejournal_type
: type of journalpmid_cited
: Pubmed ID of article that article citesdoi_cited
: DOI of article that article citesyear
: Publication year as it appears in the reference (May include letter suffix, e.g. 2007a)
dicts_out = pp.parse_pubmed_references(path) # return list of dictionary
The function parse_pubmed_caption
can parse image captions from given path
to XML file. It will return reference index that you can refer back to actual
images. The function will return list of dictionary which has following keys
pmid
: Pubmed IDpmc
: Pubmed Central IDfig_caption
: string of captionfig_id
: reference id for figure (use to refer in XML article)fig_label
: label of the figuregraphic_ref
: reference to image file name provided from Pubmed OA
dicts_out = pp.parse_pubmed_caption(path) # return list of dictionary
For someone who might be interested in parsing the text surrounding
a citation, the library also provides that functionality.
You can use parse_pubmed_paragraph
to parse text and reference PMIDs.
This function will return a list of dictionaries, where each entry will have
following keys:
pmid
: Pubmed IDpmc
: Pubmed Central IDtext
: full text of the paragraphreference_ids
: list of reference code within that paragraph. This IDs can merge with output fromparse_pubmed_references
.section
: section of paragraph (e.g. Background, Discussion, Appendix, etc.)
dicts_out = pp.parse_pubmed_paragraph('data/6605965a.nxml', all_paragraph=False)
You can use parse_pubmed_table
to parse table from XML file. This function
will return list of dictionaries where each has following keys.
pmid
: Pubmed IDpmc
: Pubmed Central IDcaption
: caption of the tablelabel
: lable of the tabletable_columns
: list of column nametable_values
: list of values inside the tabletable_xml
: raw xml text of the table (return ifreturn_xml=True
)
dicts_out = pp.parse_pubmed_table('data/medline16n0902.xml.gz', return_xml=False)
Medline NML XML has a different XML format than PubMed Open Access.
The structure of XML files can be found in MEDLINE/PubMed DTD here.
You can use the function parse_medline_xml
to parse that format.
This function will return list of dictionaries, where each element contains:
pmid
: Pubmed IDpmc
: Pubmed Central IDdoi
: DOIother_id
: Other IDs found, each separated by;
title
: title of the articleabstract
: abstract of the articleauthors
: authors, each separated by;
mesh_terms
: list of MeSH terms with corresponding MeSH ID, each separated by;
e.g.'D000161:Acoustic Stimulation; D000328:Adult; ...
publication_types
: list of publication type list each separated by;
e.g.'D016428:Journal Article'
keywords
: list of keywords, each separated by;
chemical_list
: list of chemical terms, each separated by;
pubdate
: Publication date. Defaults to year information only.journal
: journal of the given papermedline_ta
: this is abbreviation of the journal namenlm_unique_id
: NLM unique identificationissn_linking
: ISSN linkage, typically use to link with Web of Science datasetcountry
: Country extracted from journal information fielddelete
: boolean ifFalse
means paper got updated so you might have two XMLs for the same paper. You can delete the record of deleted paper because it got updated.
dicts_out = pp.parse_medline_xml('data/medline16n0902.xml.gz',
year_info_only=False,
nlm_category=False, author_list=False) # return list of dictionary
To extract month and day information from PubDate, set year_info_only=True
.
We also allow parsing structured abstract and we can control display of each
section or label by changing nlm_category
argument.
Use parse_medline_grant_id
in order to parse MEDLINE grant IDs from XML file.
This will return a list of dictionaries, each containing
pmid
: Pubmed IDgrant_id
: Grant IDgrant_acronym
: Acronym of grantcountry
: Country where grant funding fromagency
: Grant agency
If no Grant ID is found, it will return None
You can use PubMed parser to parse XML file from E-Utilities
using parse_xml_web
. For this function, you can provide a single pmid
as an input and
get a dictionary with following keys
title
: titleabstract
: abstractjournal
: journalaffiliation
: affiliation of first authorauthors
: string of authors, separated by;
year
: Publication yearkeywords
: keywords or MESH terms of the article
dict_out = pp.parse_xml_web(pmid, save_xml=False)
The function parse_citation_web
allows you to get the citations to a given
PubMed ID or PubMed Central ID. This will return a dictionary which contains the following keys
pmc
: Pubmed Central IDpmid
: Pubmed IDdoi
: DOI of the articlen_citations
: number of citations for given articlespmc_cited
: list of PMCs that cite the given PMC
dict_out = pp.parse_citation_web(doc_id, id_type='PMC')
The function parse_outgoing_citation_web
allows you to get the articles a given
article cites, given a PubMed ID or PubMed Central ID. This will return a dictionary
which contains the following keys
n_citations
: number of cited articlesdoc_id
: the document identifier givenid_type
: the type of identifier given. Either 'PMID' or 'PMC'pmid_cited
: list of PMIDs cited by the article
dict_out = pp.parse_outgoing_citation_web(doc_id, id_type='PMID')
Identifiers should be passed as strings. PubMed Central ID's are default, and
should be passed as strings without the 'PMC' prefix. If no citations are
found, or if no article is found matching doc_id
in the indicated database,
it will return None
.
Clone the repository and install using pip
.
$ git clone https://github.com/titipata/pubmed_parser
$ pip install ./pubmed_parser
An example usage is shown as follows
import pubmed_parser as pp
path_xml = pp.list_xml_path('data') # list all xml paths under directory
pubmed_dict = pp.parse_pubmed_xml(path_xml[0]) # dictionary output
print(pubmed_dict)
{'abstract': u"Background Despite identical genotypes and ...",
'affiliation_list':
[['I1': 'Department of Biological Sciences, ...'],
['I2': 'Biology Department, Queens College, and the Graduate Center ...']],
'author_list':
[['Dennehy', 'John J', 'I1'],
['Dennehy', 'John J', 'I2'],
['Wang', 'Ing-Nang', 'I1']],
'full_title': u'Factors influencing lysis time stochasticity in bacteriophage \u03bb',
'journal': 'BMC Microbiology',
'pmc': '3166277',
'pmid': '21810267',
'publication_year': '2011',
'publisher_id': '1471-2180-11-174',
'subjects': 'Research Article'}
This is snippet to parse all Pubmed Open Access subset using PySpark 2.1
import os
import pubmed_parser as pp
from pyspark.sql import Row
path_all = pp.list_xml_path('/path/to/xml/folder/')
path_rdd = spark.sparkContext.parallelize(path_all, numSlices=10000)
parse_results_rdd = path_rdd.map(lambda x: Row(file_name=os.path.basename(x),
**pp.parse_pubmed_xml(x)))
pubmed_oa_df = parse_results_rdd.toDF() # Spark dataframe
pubmed_oa_df_sel = pubmed_oa_df[['full_title', 'abstract', 'doi',
'file_name', 'pmc', 'pmid',
'publication_year', 'publisher_id',
'journal', 'subjects']] # select columns
pubmed_oa_df_sel.write.parquet('pubmed_oa.parquet', mode='overwrite') # write dataframe
See scripts folder for more information.
and contributors
If you use this package, please cite it like this
Titipat Achakulvisut, Daniel E. Acuna (2015) "Pubmed Parser" http://github.com/titipata/pubmed_parser. http://doi.org/10.5281/zenodo.159504
Package is developed in Konrad Kording's Lab at the University of Pennsylvania
MIT License Copyright (c) 2015-2018 Titipat Achakulvisut, Daniel E. Acuna