-
Notifications
You must be signed in to change notification settings - Fork 138
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add medical corpora + pretrained models #5
Comments
Another related resource, the PubMed Central dataset: https://www.ncbi.nlm.nih.gov/pmc/tools/openftlist/ Unlike the metadata above, this (smaller) dataset also contains the article full texts. Around 360,000 medical articles with full text in total. |
Another related free (non-commercial use) bio medical corpus, including full text: https://old.biomedcentral.com/about/datamining |
There's a bunch of word2vec models trained on PubMed data here, and these work well in gensim:
These are all unigram models though iirc |
@philgooch thanks for the links! Have you any license information about it (can we add it to gensim-data and "re-distribute")? |
training in R. |
@menshikh-iv The first set of models at http://evexdb.org/pmresources/vec-space-models/are CC-BY (see http://bio.nlplab.org/#license) I'm waiting to hear back from the authors about the license for the other ones, I'll let you know as soon as I hear. |
@philgooch great, we'll wait too 👍 |
@menshikh-iv I just heard back from Billy Chiu who developed the models at https://github.com/cambridgeltl/BioNLP-2016 He's just updated the ReadMe there to confirm that the models at https://drive.google.com/open?id=0BzMCqpcgEJgiUWs0ZnU0NlFTam8 are also made available under CC BY |
The National Library of Medicine NLM license released a corpus of more than 27 million records with medical article metadata: ftp://ftp.ncbi.nlm.nih.gov/pubmed/baseline/.
Each record contains the article's abstract (a short paragraph with article summary, typically ~1k characters), its authors, title, affiliation, a list of article topics including keywords and chemical formulas, year of publication etc.
Add this PubMed corpus to
gensim-data
, including pre-trained semantic models on this data.License instruction are here: ftp://ftp.ncbi.nlm.nih.gov/pubmed/updatefiles/README.txt (read carefully), along with the full metadata schema (DTD).
The text was updated successfully, but these errors were encountered: