- Word Embeddings are an important component of many language models capable of producing state of the art results in several NLP tasks;
- SparseNLP proposes an alternative approach for deriving word embeddings. In contrast with the tradicional dense vector representations, it creates sparse distributed representations (SDR) for each word; the representation shares the main idea of a word embedding, which was formulated by Firth in 1957: "a word is characterized by the company it keeps";
- Validation of the whole methodology is done by using these word embeddings as language models in several Natural Language Processing tasks;
sudo apt-get update
sudo apt-get install zip
sudo apt remove python3 sudo apt remove python wget https://repo.continuum.io/archive/Anaconda3-5.1.0-Linux-x86_64.sh bash Anaconda3-5.1.0-Linux-x86_64.sh rm Anaconda3-5.1.0-Linux-x86_64.sh
git clone https://github.com/avsilva/sparse-nlp.git
mkdir sparse-nlp/serializations mkdir sparse-nlp/serializations/sentences mkdir sparse-nlp/images mkdir sparse-nlp/logs mkdir sparse-nlp/datasets mkdir sparse-nlp/datasets/analogy mkdir sparse-nlp/datasets/similarity mkdir sparse-nlp/embeddings mkdir sparse-nlp/embeddings/glove.6B
mkdir wikiextractor mkdir wikiextractor/jsonfiles mkdir wikiextractor/jsonfiles/articles3
cd sparse-nlp
pip install -r requirements.txt conda install --yes --file requirements2.txt
python -m spacy download en
In order to use SparseNLP you need your data stored in a database table with 2 columns:
- id (int): primary key
- cleaned_text (str): text tokens for each sentence
-
nosetests --cover-package=.\sparsenlp --with-coverage --nologcapture -x
-
python -m unittest -v
(run just one class test)
-
python -m unittest -q tests.test_datacleaner.TestDataClean
-
py.test -q -s tests/test_datacleaner.py::TestDataClean
(run just one functional test)
- python -m unittest -q tests.test_datacleaner.TestDataClean.test_ingestfiles_json_to_dict
https://realpython.com/fast-flexible-pandas/ https://towardsdatascience.com/stop-using-pandas-and-start-using-spark-with-scala-f7364077c2e0 https://www.kdnuggets.com/2019/11/speed-up-pandas-4x.html
evaluate using word-embeddings-benchmarks train word2vec on wikidumps
- Extracting data
- Training Corpora Definition
- Corpora pre-processing
- Sentence tokenization
- Sentence vetorization
- Word to sentence database
- Cluster sentences
- Word fingerprint
- Text fingerprint
- Evaluation
- Trainning Word2Vec word embeedings
Wikipedia dumps from wikimedia 2018-01-01
github - attardi/wikiextractor
Document files contains a series of Wikipedia articles, represented each by an XML doc element:
<doc>...</doc>
<doc>...</doc>
...
<doc>...</doc>
The element doc has the following attributes:
- id, which identifies the document by means of a unique serial number
- url, which provides the URL of the original Wikipedia page. The content of a doc element consists of pure text, one paragraph per line.
Example:
<doc id="2" url="http://it.wikipedia.org/wiki/Harmonium">
Harmonium.
L'harmonium è uno strumento musicale azionato con una tastiera, detta manuale.
Sono stati costruiti anche alcuni harmonium con due manuali.
...
</doc>
Evaluation code repository: github - kudkudak/word-embeddings-benchmarks Evaluation methods: arxiv.org/abs/1702.02170
other alternative methods: github - mfaruqui/eval-word-vectors