Add medical corpora + pretrained models #5

piskvorky · 2017-12-02T16:27:33Z

The National Library of Medicine NLM license released a corpus of more than 27 million records with medical article metadata: ftp://ftp.ncbi.nlm.nih.gov/pubmed/baseline/.

Each record contains the article's abstract (a short paragraph with article summary, typically ~1k characters), its authors, title, affiliation, a list of article topics including keywords and chemical formulas, year of publication etc.

Add this PubMed corpus to gensim-data, including pre-trained semantic models on this data.

License instruction are here: ftp://ftp.ncbi.nlm.nih.gov/pubmed/updatefiles/README.txt (read carefully), along with the full metadata schema (DTD).

The text was updated successfully, but these errors were encountered:

piskvorky · 2017-12-02T17:08:21Z

Another related resource, the PubMed Central dataset: https://www.ncbi.nlm.nih.gov/pmc/tools/openftlist/
(found via http://deepdive.stanford.edu/opendata/)

Unlike the metadata above, this (smaller) dataset also contains the article full texts.

Around 360,000 medical articles with full text in total.

piskvorky · 2017-12-02T17:12:40Z

Another related free (non-commercial use) bio medical corpus, including full text: https://old.biomedcentral.com/about/datamining

philgooch · 2018-03-07T09:58:18Z

There's a bunch of word2vec models trained on PubMed data here, and these work well in gensim:

These are all unigram models though iirc

menshikh-iv · 2018-03-07T10:48:40Z

@philgooch thanks for the links! Have you any license information about it (can we add it to gensim-data and "re-distribute")?

Imshepherd · 2018-03-12T16:01:41Z

training in R.
https://github.com/Imshepherd/wordVectors-R-PubMed-Resourse

philgooch · 2018-03-12T16:18:47Z

@menshikh-iv The first set of models at http://evexdb.org/pmresources/vec-space-models/are CC-BY (see http://bio.nlplab.org/#license)

I'm waiting to hear back from the authors about the license for the other ones, I'll let you know as soon as I hear.

menshikh-iv · 2018-03-12T16:26:23Z

@philgooch great, we'll wait too 👍

philgooch · 2018-03-13T07:34:09Z

@menshikh-iv I just heard back from Billy Chiu who developed the models at

https://github.com/cambridgeltl/BioNLP-2016

He's just updated the ReadMe there to confirm that the models at https://drive.google.com/open?id=0BzMCqpcgEJgiUWs0ZnU0NlFTam8 are also made available under CC BY

piskvorky changed the title ~~Add PubMed medical corpus + pretrained models~~ Add medical corpora + pretrained models Dec 2, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add medical corpora + pretrained models #5

Add medical corpora + pretrained models #5

piskvorky commented Dec 2, 2017 •

edited

Loading

piskvorky commented Dec 2, 2017 •

edited

Loading

piskvorky commented Dec 2, 2017

philgooch commented Mar 7, 2018 •

edited

Loading

menshikh-iv commented Mar 7, 2018 •

edited

Loading

Imshepherd commented Mar 12, 2018

philgooch commented Mar 12, 2018

menshikh-iv commented Mar 12, 2018

philgooch commented Mar 13, 2018

Add medical corpora + pretrained models #5

Add medical corpora + pretrained models #5

Comments

piskvorky commented Dec 2, 2017 • edited Loading

piskvorky commented Dec 2, 2017 • edited Loading

piskvorky commented Dec 2, 2017

philgooch commented Mar 7, 2018 • edited Loading

menshikh-iv commented Mar 7, 2018 • edited Loading

Imshepherd commented Mar 12, 2018

philgooch commented Mar 12, 2018

menshikh-iv commented Mar 12, 2018

philgooch commented Mar 13, 2018

piskvorky commented Dec 2, 2017 •

edited

Loading

piskvorky commented Dec 2, 2017 •

edited

Loading

philgooch commented Mar 7, 2018 •

edited

Loading

menshikh-iv commented Mar 7, 2018 •

edited

Loading