ChemTextMining

Short description of repository

This repository contains code and additional resources for experimenting on disease entities extraction using conditional random fields. Structure is as follows:

webmd_corpus.json is a full corpus.
classification_pipe.py - entry point of program.
vocabularies - directory with used vocabularies in txt format, each line contains one vocabulary entity.
clustered words - directory with words clustered using brown clustering algorithm.
corpus_json - directory with datasets used in experiments.
word2vec - directory with word embeddings
remaining is utility code

Usage:

How to load vectors(to use them in your application):

     w2v = KeyedVectors.load_word2vec_model('patht/to/vectors', binary=True)

Install all the requirements.
```
     pip install -r requirements.txt
```
Also perl is need to be installed
Specify in "features" dictionary current token features from list of available features. Also it's necessary to define context size by setting k_prev(tokens to look before) and k_next(tokens to look forward) and features for each context token(in prev_features and next_features).
Run code:
```
   python classificaton_pipe.py
```

Output is in format:

 ...
 Average exact score:
   precision	recall	fscore

 Average weak score:
   precision	recall	fscore

Note: To use different word embedding vectors need to specify it in crf/features.py in load_w2v_model function.

Citing:

Miftahutdinov, Z., Tutubalina, E., Tropsha, A.: Identifying Disease-related Expressions in Reviews using Conditional Random Fields.

http://www.dialog-21.ru/media/3932/miftahutdinovzshetal.pdf

BibTex:

@inproceedings{miftahutdinov2017,
              title={Identifying Disease-related Expressions in Reviews using Conditional Random Fields},
              author={Miftahutdinov, Zulfat and Tutubalina, Elena and Tropsha, Alexander},
              booktitle={Proceedings of International Conference Dialog},
              volume={1},
              pages={155-167},
              year={2017}
}

Tutubalina, EV and Miftahutdinov, Z Sh and Nugmanov, RI and Madzhidov, TI and Nikolenko, SI and Alimova, IS and Tropsha, AE Using semantic analysis of texts for the identification of drugs with similar therapeutic effects.

link to paper

BibTex

@article{tutubalina2017using,
        title={Using semantic analysis of texts for the identification of drugs with similar therapeutic effects},
        author={Tutubalina, EV and Miftahutdinov, Z Sh and Nugmanov, RI and Madzhidov, TI and Nikolenko, SI and Alimova, IS and Tropsha, AE},
        journal={Russian Chemical Bulletin},
        volume={66},
        number={11},
        pages={2180--2189},
        year={2017},
        publisher={Springer}

}

Name		Name	Last commit message	Last commit date
Latest commit History 20 Commits
cadec_folds_5_tags		cadec_folds_5_tags
clustered_words/brown_clusters/brown_input-150		clustered_words/brown_clusters/brown_input-150
corpus_json		corpus_json
crf		crf
evaluation		evaluation
process_annotated_files		process_annotated_files
spell_checker		spell_checker
vocabularies		vocabularies
word2vec		word2vec
README.md		README.md
__init__.py		__init__.py
classification_pipe.py		classification_pipe.py
folds.py		folds.py
requirements.txt		requirements.txt
webmd_corpus.json		webmd_corpus.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

ChemTextMining

Short description of repository

This repository contains code and additional resources for experimenting on disease entities extraction using conditional random fields. Structure is as follows:

Usage:

Note: To use different word embedding vectors need to specify it in crf/features.py in load_w2v_model function.

Citing:

Miftahutdinov, Z., Tutubalina, E., Tropsha, A.: Identifying Disease-related Expressions in Reviews using Conditional Random Fields.

BibTex:

Tutubalina, EV and Miftahutdinov, Z Sh and Nugmanov, RI and Madzhidov, TI and Nikolenko, SI and Alimova, IS and Tropsha, AE Using semantic analysis of texts for the identification of drugs with similar therapeutic effects.

BibTex

About

Uh oh!

Releases

Packages

Uh oh!

Languages

dartrevan/ChemTextMining

Folders and files

Latest commit

History

Repository files navigation

ChemTextMining

Short description of repository

This repository contains code and additional resources for experimenting on disease entities extraction using conditional random fields. Structure is as follows:

Usage:

Note: To use different word embedding vectors need to specify it in crf/features.py in load_w2v_model function.

Citing:

Miftahutdinov, Z., Tutubalina, E., Tropsha, A.: Identifying Disease-related Expressions in Reviews using Conditional Random Fields.

BibTex:

Tutubalina, EV and Miftahutdinov, Z Sh and Nugmanov, RI and Madzhidov, TI and Nikolenko, SI and Alimova, IS and Tropsha, AE Using semantic analysis of texts for the identification of drugs with similar therapeutic effects.

BibTex

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages