Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add medical corpora + pretrained models #5

Open
piskvorky opened this issue Dec 2, 2017 · 8 comments
Open

Add medical corpora + pretrained models #5

piskvorky opened this issue Dec 2, 2017 · 8 comments

Comments

@piskvorky
Copy link
Owner

piskvorky commented Dec 2, 2017

The National Library of Medicine NLM license released a corpus of more than 27 million records with medical article metadata: ftp://ftp.ncbi.nlm.nih.gov/pubmed/baseline/.

Each record contains the article's abstract (a short paragraph with article summary, typically ~1k characters), its authors, title, affiliation, a list of article topics including keywords and chemical formulas, year of publication etc.

Add this PubMed corpus to gensim-data, including pre-trained semantic models on this data.

License instruction are here: ftp://ftp.ncbi.nlm.nih.gov/pubmed/updatefiles/README.txt (read carefully), along with the full metadata schema (DTD).

@piskvorky
Copy link
Owner Author

piskvorky commented Dec 2, 2017

Another related resource, the PubMed Central dataset: https://www.ncbi.nlm.nih.gov/pmc/tools/openftlist/
(found via http://deepdive.stanford.edu/opendata/)

Unlike the metadata above, this (smaller) dataset also contains the article full texts.

Around 360,000 medical articles with full text in total.

@piskvorky piskvorky changed the title Add PubMed medical corpus + pretrained models Add medical corpora + pretrained models Dec 2, 2017
@piskvorky
Copy link
Owner Author

Another related free (non-commercial use) bio medical corpus, including full text: https://old.biomedcentral.com/about/datamining

@philgooch
Copy link

philgooch commented Mar 7, 2018

There's a bunch of word2vec models trained on PubMed data here, and these work well in gensim:

These are all unigram models though iirc

@menshikh-iv
Copy link
Contributor

menshikh-iv commented Mar 7, 2018

@philgooch thanks for the links! Have you any license information about it (can we add it to gensim-data and "re-distribute")?

@Imshepherd
Copy link

@philgooch
Copy link

@menshikh-iv The first set of models at http://evexdb.org/pmresources/vec-space-models/are CC-BY (see http://bio.nlplab.org/#license)

I'm waiting to hear back from the authors about the license for the other ones, I'll let you know as soon as I hear.

@menshikh-iv
Copy link
Contributor

@philgooch great, we'll wait too 👍

@philgooch
Copy link

@menshikh-iv I just heard back from Billy Chiu who developed the models at

https://github.com/cambridgeltl/BioNLP-2016

He's just updated the ReadMe there to confirm that the models at https://drive.google.com/open?id=0BzMCqpcgEJgiUWs0ZnU0NlFTam8 are also made available under CC BY

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants