Stopwords are commonly used words in a language that are often filtered out in natural language processing (NLP) and text mining tasks. Examples of stopwords in English include “the,” “is,” “in,” “and,” “a,” and "an".
Several Python packages such as ISO, spaCy, and NLTK provide lists of stopwords for various languages. The Venn diagram below represents the differences and overlaps between the stopwords provided by these packages.
The ISO stopwords collection is a comprehensive set of stopwords for multiple languages, following the ISO 639-1 language codes.
spaCy has built-in stopword lists for several languages.
NLTK provides a comprehensive list of stopwords for multiple languages.
The following are stats on compression rates of Reuters texts using three different sets of stopwords: ISO, SpaCy, and NLTK. The compression rate is calculated by removing stopwords from the text and comparing the number of words in the filtered text to the original text.
Mean: 0.46848585024735256
Median: 0.45714285714285713
Standard Deviation: 0.09172824822316185
Shapiro-Wilk test statistic: 0.8993187595609815
p-value: 2.5232375570871764e-49
The compression rates are not normally distributed.
Mean: 0.5834552313052037
Median: 0.5585585585585585
Standard Deviation: 0.10965667465841214
Shapiro-Wilk test statistic: 0.9216185461552716
p-value: 3.3622028531080556e-45
The compression rates are not normally distributed.
Mean: 0.6127916769823796
Median: 0.5876288659793815
Standard Deviation: 0.1080245729937558
Shapiro-Wilk test statistic: 0.905384897890509
p-value: 2.773996451055017e-48
The compression rates are not normally distributed.
To run the script, first install dependent libraries.
poetry install
Download en_core_web_sm
for spaCy stopwords.
poetry run python -m spacy download en_core_web_sm
Run the script as follows.
poetry run py .\stopwords\stopwords.py
For statistics, run the scripts as follows.
- ISO Stopwords
poetry run py .\stopwords\iso-stopwords.py
- SpaCy Stopwords
poetry run py .\stopwords\spacy-stopwords.py
- NLTK Stopwords
poetry run py .\stopwords\nltk-stopwords.py
Click below for the results.
sets = {
'100': iso_stopwords - spacy_stopwords - nltk_stopwords,
'010': spacy_stopwords - iso_stopwords - nltk_stopwords,
'001': nltk_stopwords - iso_stopwords - spacy_stopwords,
'110': (iso_stopwords & spacy_stopwords) - nltk_stopwords,
'101': (iso_stopwords & nltk_stopwords) - spacy_stopwords,
'011': (spacy_stopwords & nltk_stopwords) - iso_stopwords,
'111': iso_stopwords & spacy_stopwords & nltk_stopwords
}
To run a unit test, run the following command in your terminal.
poetry run py -m unittest