The goal of this library is to stream-line the process of defining natural language pre-processing pipelines, particularly in the context of educational resources.
This library can be installed either as a Nix Flake
input or as a Python library.
To install this package through pip
, it should be sufficient to run
pip install git+https://github.com/openeduhub/its-prep.git@main
Note that you will have to manually ensure that de_core_news_lg is installed, e.g. through
python -m spacy download de_core_news_lg
Include this Flake in the inputs of your Flake:
{
inputs = {
its-prep.url = url = "github:openeduhub/its-prep";
};
}
Then, simply apply the applied nixpkgs
overlay:
{
outputs = { self, nixpkgs, ... }:
let
system = "...";
pkgs = import nixpkgs {
inherit system;
overlays = [ self.inputs.its-prep.overlays.default ];
};
python-packages = py-pkgs: [ its-prep ];
local-python = pkgs.python3.withPackages python-packages;
in
{...};
}
Core to this library is the abstraction of natural language pre-processing as an orchestration of multiple steps that each take a tokenized document and return a subset of its tokens, based on some internal logic. We call these steps filter functions (of type its_prep.types.Filter
). A pre-processing pipeline is simply a collection of filters.
To ensure that during filtering, no necessary information is discarded (e.g. for filters that act on sentences), documents are represented through an internal, immutable format (its_prep.types.Document
) that always contains both the current subset of tokens and the full contents of the original document.
Thus, a filter is simply any function that takes a document of type Document
and returns another Document
.
its_prep.core
contains core functionality, such as the application of a pre-processing pipeline onto a document corpus.its_prep.specs
contains pre-defined pipelines inits_prep.specs.pipelines
and functionality to easily define custom filter functions inits_prep.specs.filters
.its_prep.spacy.props
contains various concrete NLP processing tasks, such as lemmatization or determination of POS tags. These are mostly used to compute properties to base the filters defined withits_prep.specs.filters
on.
In this section, we will be demonstrating the functionality by applying both a pre-defined processing pipeline, as well as multiple custom ones.
Some example texts, from the German Wikipedia page on Germany.
raw_docs = [
"Deutschland ist ein Bundesstaat in Mitteleuropa. Er hat 16 Bundesländer und ist als freiheitlich-demokratischer und sozialer Rechtsstaat verfasst. Die 1949 gegründete Bundesrepublik Deutschland stellt die jüngste Ausprägung des 1871 erstmals begründeten deutschen Nationalstaates dar. Bundeshauptstadt und Regierungssitz ist Berlin. Deutschland grenzt an neun Staaten, es hat Anteil an der Nord- und Ostsee im Norden sowie dem Bodensee und den Alpen im Süden. Es liegt in der gemäßigten Klimazone und verfügt über 16 National- und mehr als 100 Naturparks.",
"Das heutige Deutschland hat circa 84,4 Millionen Einwohner und zählt bei einer Fläche von 357.588 Quadratkilometern mit durchschnittlich 236 Einwohnern pro Quadratkilometer zu den dicht besiedelten Flächenstaaten. Die bevölkerungsreichste deutsche Stadt ist Berlin; weitere Metropolen mit mehr als einer Million Einwohnern sind Hamburg, München und Köln; der größte Ballungsraum ist das Ruhrgebiet. Frankfurt am Main ist als europäisches Finanzzentrum von globaler Bedeutung. Die Geburtenrate liegt bei 1,58 Kindern pro Frau (2021).",
]
In order to be able to process these text, we will first need to tokenize them in some way. This is necessary because all filters and pipelines essentially just decide which tokens to keep or discard. Note that as a corollary, this means that filters cannot modify or remove parts of tokens.
Tokenize the documents using spaCy
, displaying the tokens using their lemmatized form.
import its_prep.spacy as nlp
from its_prep import tokenize_documents
(docs := list(tokenize_documents(raw_docs, tokenize_fun=nlp.tokenize_as_lemmas)))
['Deutschland', 'sein', 'ein', 'Bundesstaat', 'in', 'Mitteleuropa', '--', 'er', 'haben', '16', 'Bundesland', 'und', 'sein', 'als', 'freiheitlich-demokratischer', 'und', 'sozial', 'Rechtsstaat', 'verfassen', '--', 'der', '1949', 'gegründet', 'Bundesrepublik', 'Deutschland', 'stellen', 'der', 'jung', 'Ausprägung', 'der', '1871', 'erstmals', 'begründet', 'deutsch', 'Nationalstaat', 'dar', '--', 'Bundeshauptstadt', 'und', 'Regierungssitz', 'sein', 'Berlin', '--', 'Deutschland', 'grenzen', 'an', 'neun', 'Staat', '--', 'es', 'haben', 'Anteil', 'an', 'der', 'Nord', 'und', 'Ostsee', 'in', 'Norden', 'sowie', 'der', 'Bodensee', 'und', 'der', 'Alpen', 'in', 'Süden', '--', 'es', 'liegen', 'in', 'der', 'gemäßigt', 'Klimazone', 'und', 'verfügen', 'über', '16', 'National', 'und', 'mehr', 'als', '100', 'Naturpark', '--'] ['der', 'heutig', 'Deutschland', 'haben', 'circa', '84,4', 'Million', 'Einwohner', 'und', 'zählen', 'bei', 'ein', 'Fläche', 'von', '357.588', 'Quadratkilometer', 'mit', 'durchschnittlich', '236', 'Einwohner', 'pro', 'Quadratkilometer', 'zu', 'der', 'dicht', 'besiedelt', 'Flächenstaate', '--', 'der', 'bevölkerungsreichste', 'deutsch', 'Stadt', 'sein', 'Berlin', '--', 'weit', 'Metropole', 'mit', 'mehr', 'als', 'ein', 'Million', 'Einwohner', 'sein', 'Hamburg', '--', 'München', 'und', 'Köln', '--', 'der', 'groß', 'Ballungsraum', 'sein', 'der', 'Ruhrgebiet', '--', 'Frankfurt', 'an', 'Main', 'sein', 'als', 'europäisch', 'Finanzzentrum', 'von', 'global', 'Bedeutung', '--', 'der', 'Geburtenrate', 'liegen', 'bei', '1,58', 'Kind', 'pro', 'Frau', '--', '2021', '--', '--']
A collection of common pipelines can be found in the specs.pipelines
sub-module, and a collection of common NLP steps, implemented in spaCy, can be found in spacy.props
.
import its_prep.spacy as nlp
from its_prep import pipelines
Next, we apply the poc_topic_modeling
pipeline, which aims to only extract data that is relevant to the semantic context of the given document. This is done by
- filtering based on unwanted universal POS tags (punctuation and white-space)
- filtering out stop words (as determined by
spaCy
) - filtering out lemmatized tokens which we expect to have no impact on the semantic context of the document in the context of learning resources
- filtering out particularly rare or frequent lemmatized tokens
Since we are only dealing with two documents here, we adjust the required interval for the document frequency in the last step to be unbounded in both directions, thus skipping this step.
pipelines.apply_poc_topic_modeling(docs, required_df_interval={})
['Deutschland', 'Bundesstaat', 'Mitteleuropa', '16', 'Bundesland', 'freiheitlich-demokratischer', 'sozial', 'Rechtsstaat', 'verfassen', '1949', 'gegründet', 'Bundesrepublik', 'Deutschland', 'stellen', 'jung', 'Ausprägung', '1871', 'erstmals', 'begründet', 'deutsch', 'Nationalstaat', 'dar', 'Bundeshauptstadt', 'Regierungssitz', 'Berlin', 'Deutschland', 'grenzen', 'Staat', 'Anteil', 'Nord', 'Ostsee', 'Norden', 'Bodensee', 'Alpen', 'Süden', 'liegen', 'gemäßigt', 'Klimazone', 'verfügen', '16', 'National', '100', 'Naturpark'] ['heutig', 'Deutschland', 'circa', '84,4', 'Million', 'Einwohner', 'zählen', 'Fläche', '357.588', 'Quadratkilometer', 'durchschnittlich', '236', 'Einwohner', 'pro', 'Quadratkilometer', 'dicht', 'besiedelt', 'Flächenstaate', 'bevölkerungsreichste', 'deutsch', 'Stadt', 'Berlin', 'Metropole', 'Million', 'Einwohner', 'Hamburg', 'München', 'Köln', 'groß', 'Ballungsraum', 'Ruhrgebiet', 'Frankfurt', 'Main', 'europäisch', 'Finanzzentrum', 'global', 'Bedeutung', 'Geburtenrate', 'liegen', '1,58', 'Kind', 'pro', 'Frau', '2021']
A pipeline is defined simply as a sequence of filtering functions that take a document as their argument and return a subset of that document. Thus, defining a custom pipeline is equivalent to defining a number of such filtering functions.
In the filters
sub-module, we have defined multiple factory functions that should make it much easier to define filters from NLP processing steps (e.g. those defined in spacy.props
).
We also import the apply_filters
helper function, which is a convenient way to apply a pipeline on a document corpus.
from its_prep import filters, apply_filters
Say we wanted to only return only the verbs in the given documents. This could be achieved through
only_verbs_pipeline = [filters.get_filter_by_property(nlp.get_upos, {"VERB"})]
We can apply the filters from our pipeline using the apply_filters
function
list(apply_filters(docs, only_verbs_pipeline))
['liegen', 'verfügen', 'grenzen', 'verfassen', 'stellen'] ['zählen', 'liegen']
Maybe we also want to filter out stop words. For this, we utilize filters.negated
, which modifies a given filter function such that its results will be removed, rather than kept:
non_stop_verbs_pipeline = only_verbs_pipeline + [
filters.negated(filters.get_filter_by_bool_fun(nlp.is_stop))
]
list(apply_filters(docs, non_stop_verbs_pipeline))
['liegen', 'verfügen', 'grenzen', 'verfassen', 'stellen'] ['zählen', 'liegen']
Finally, we could only include sentences that are at least 20 tokens long:
long_sents_pipeline = [filters.get_filter_by_subset_len(nlp.into_sentences, min_len=20)]
list(apply_filters(docs, long_sents_pipeline))
['Deutschland', 'grenzen', 'an', 'neun', 'Staat', '--', 'es', 'haben', 'Anteil', 'an', 'der', 'Nord', 'und', 'Ostsee', 'in', 'Norden', 'sowie', 'der', 'Bodensee', 'und', 'der', 'Alpen', 'in', 'Süden', '--'] ['der', 'heutig', 'Deutschland', 'haben', 'circa', '84,4', 'Million', 'Einwohner', 'und', 'zählen', 'bei', 'ein', 'Fläche', 'von', '357.588', 'Quadratkilometer', 'mit', 'durchschnittlich', '236', 'Einwohner', 'pro', 'Quadratkilometer', 'zu', 'der', 'dicht', 'besiedelt', 'Flächenstaate', '--']
And then only consider the non-stop verbs of those sentences:
list(apply_filters(docs, long_sents_pipeline + non_stop_verbs_pipeline))
['grenzen'] ['zählen']
Note that due to the internal document representation and the implementation of the processing steps with spaCy
, the order of these filters does not matter here; we could also first filter by non-stop verbs and then by long sentences, and still get the same result.
list(apply_filters(docs, non_stop_verbs_pipeline + long_sents_pipeline))
['grenzen'] ['zählen']
Finally, we could return the tokens as word-embeddings:
from its_prep import selected_properties
processed_docs = apply_filters(docs, non_stop_verbs_pipeline + long_sents_pipeline)
list(selected_properties(processed_docs, nlp.get_word_vectors))
[(array([ 8.0732e-01, 1.3723e+00, -5.3698e-01, -2.5742e+00, -3.2238e+00, 1.0097e+00, 4.4086e-01, -4.7211e-01, -4.0149e-01, -9.0866e-02, -5.7923e-01, -2.3407e-01, -1.3298e-01, -3.3136e-01, 5.9846e-01, 1.1656e+00, -9.2490e-02, 4.6380e-01, -9.3929e-01, -1.4071e+00, -9.8949e-01, 6.5838e-01, -6.9155e-02, -8.6060e-01, -5.1545e-01, 2.5759e+00, 1.4891e+00, 4.9595e-01, -9.8937e-01, -2.6891e+00, 2.5979e+00, -1.2663e-02, 1.0032e+00, -1.3726e+00, -2.8285e+00, -1.3964e+00, 1.2408e+00, -1.2980e+00, 2.4175e+00, -1.1108e-01, -2.7270e+00, 2.2405e+00, -1.1174e+00, -4.5039e-01, 2.0047e-01, -1.0562e+00, 9.7447e-01, -5.0296e-01, 3.5433e-01, 2.5035e-01, -1.3689e-03, -4.3711e-01, 1.7272e+00, 9.9553e-02, -1.7881e+00, 3.5289e+00, -1.9781e+00, 1.6410e+00, -6.6308e-01, -7.7131e-01, 2.2469e-01, -2.1614e+00, 1.5713e+00, -6.6373e-02, 2.2343e+00, 1.8830e-01, 3.1996e+00, -1.4663e+00, 2.7145e+00, 6.6561e-02, -1.8314e+00, 7.0784e-01, 3.4304e-01, -9.4780e-01, -2.0042e+00, 9.8858e-01, -1.6424e+00, -1.9604e+00, 4.6556e-01, -2.2256e-01, -6.6066e-01, -1.6972e-01, 2.1931e+00, 2.0852e+00, -4.3692e-01, -1.4755e-01, 6.1204e-01, 1.6330e+00, -1.2510e+00, -1.5489e+00, -1.7016e-01, -2.1122e+00, -1.6820e+00, -1.5734e+00, 1.3483e+00, -2.0566e+00, -1.5619e+00, -1.2128e+00, -5.0591e-01, 1.8785e+00, -7.6197e-01, 2.5904e+00, 4.3581e-02, -7.4048e-01, -1.7516e+00, 4.0088e-02, -3.1468e+00, 5.5322e-01, 9.1766e-01, 5.9854e-01, 8.9359e-02, 1.2874e+00, -3.8819e+00, 7.1359e-01, 1.8177e-02, 4.1450e-01, -5.0730e-03, -7.8042e-01, -4.2483e-01, 5.5403e-01, 4.7233e-01, -1.4523e+00, 4.0898e+00, 1.5544e+00, 1.0186e+00, 5.8939e-01, 9.7547e-01, 3.7954e-01, 2.1430e-01, 8.3903e-01, 2.4668e+00, -1.5191e+00, -8.0581e-01, -8.9338e-02, 8.9130e-01, 5.8869e-01, 3.1070e-01, 1.0650e+00, 7.1601e-01, 1.6895e+00, 3.2277e+00, -1.5874e+00, 5.5910e-01, -1.3833e+00, -1.4956e+00, -1.7087e+00, 1.4052e+00, -1.5890e-01, 6.7932e-01, -2.1383e+00, 4.0109e-01, 1.1472e+00, -1.9935e+00, 6.4681e-01, 4.7015e-01, -1.9115e+00, -1.3163e+00, -7.0171e-01, 9.4038e-01, -1.4081e+00, 4.1968e-01, 1.1788e-01, 2.3435e+00, -1.2764e+00, 1.8269e+00, 3.2451e+00, 2.3452e+00, 1.4542e+00, 3.1509e+00, 6.8695e-02, 8.3664e-01, 4.6488e-01, -1.9003e-02, -1.6531e+00, 1.8218e+00, -2.0497e-01, 1.9792e+00, -5.8036e-01, 7.1097e-01, 4.4142e-01, -1.6841e+00, 8.4975e-01, 7.5711e-01, -1.8469e+00, 1.0284e+00, -2.3016e+00, 1.1380e+00, 3.5043e-01, -1.0574e+00, 1.2962e+00, -5.5338e-01, -9.9845e-01, -9.5507e-01, -2.0559e+00, -9.4054e-01, -8.8928e-01, 4.3253e-01, -1.5722e+00, 1.2591e+00, -1.4304e+00, 2.5843e+00, -6.2330e-01, -8.9084e-01, -1.6198e+00, -1.2893e-01, -2.5204e-02, 1.2725e+00, 4.1957e+00, 3.9992e-01, 2.3022e-02, -6.6594e-01, 9.8386e-02, -5.3294e-01, -3.2918e+00, 2.7032e+00, -1.7077e+00, 2.2714e+00, -4.8674e-01, -8.1132e-01, 1.6060e-02, 1.1888e-01, -5.0967e-01, -9.2476e-02, 1.1875e+00, 2.1304e-01, 4.5259e-01, 2.2360e+00, -3.7786e-01, 5.5660e-01, 1.6240e+00, -2.1650e+00, 2.8050e-01, -2.5134e+00, -1.5938e+00, 4.9367e-01, 1.1155e+00, -1.6824e+00, 7.3233e-01, -1.1471e-01, 4.8059e-02, 7.3429e-01, -2.6168e+00, 1.9935e+00, -3.5876e-01, -4.9866e-01, -5.4019e-01, 2.5438e+00, -2.2357e+00, -1.7544e+00, 7.0646e-01, -2.5509e+00, -2.7299e+00, -1.0948e+00, 8.3585e-01, 2.4147e+00, -1.8732e-01, -1.4595e+00, -1.1231e+00, 1.0216e+00, 2.3719e+00, 2.0988e-02, -1.0418e+00, -7.4748e-01, -7.1275e-01, -1.2673e+00, -4.4957e-01, -1.3182e-01, 1.1091e+00, 1.1875e+00, -1.2288e+00, 3.0913e+00, -1.5496e-01, 1.3854e+00, 4.7156e-01, 5.2295e-01, 3.8921e+00, 1.9631e+00, -4.3928e-01, 7.1788e-01, 2.5766e+00, -1.8608e+00, 2.0194e+00, 1.5293e+00, 5.6542e-01, 1.6145e+00, -1.4601e+00, 1.1504e+00, -1.3161e+00, -1.4911e+00, -1.2753e+00, 9.3347e-01, -1.4765e+00, 1.4941e+00, -1.6196e+00, 2.9735e+00, 1.7396e+00, 1.2293e+00, -1.2182e+00, 4.1059e+00, 3.4958e+00], dtype=float32),), (array([ -0.65404 , 1.7489 , 1.3238 , 10.689 , -3.8354 , -3.4176 , 2.8458 , 6.9475 , 8.3795 , -9.7266 , 1.8059 , -7.0959 , 0.65468 , 4.8942 , 0.11935 , 4.2329 , 6.4581 , -10.205 , -2.4653 , 5.1304 , -1.5249 , -0.33113 , -8.9162 , 0.68117 , 0.82788 , 4.4263 , 2.7534 , -2.8395 , -1.9882 , -0.074579, 0.8485 , 0.32367 , 1.7496 , 0.53606 , -9.8885 , -6.2691 , -5.5006 , -14.412 , -3.3748 , -0.25372 , -1.9325 , 2.0588 , -3.4462 , 3.3861 , 0.77796 , 2.1809 , -5.0384 , -9.1702 , -6.5893 , 1.7999 , 3.087 , -1.6311 , 2.0203 , -3.5758 , 2.8109 , 1.7471 , 1.273 , 4.2303 , 2.2095 , -4.1308 , 0.43882 , -0.025106, 11.237 , 7.2951 , -3.6524 , -2.4187 , -5.1483 , 4.5378 , 4.4823 , 4.6328 , -6.2347 , 8.1855 , 1.1755 , 4.665 , 3.1928 , 7.0662 , -5.9556 , -0.049054, 4.5369 , 7.3604 , 3.2199 , 2.9817 , -0.91019 , -0.10366 , 3.7303 , 1.3639 , -6.3505 , 3.2573 , -3.0636 , -1.2138 , -0.55653 , 0.65678 , 1.386 , -4.523 , -4.4014 , -11.258 , -7.1273 , -3.1885 , -10.099 , -1.9632 , -3.6254 , 2.8138 , 1.5372 , 1.0629 , 9.1253 , -7.4737 , 1.6548 , -3.5463 , 1.1836 , -6.5327 , -0.79816 , -2.8768 , 11.611 , 8.4142 , 2.2678 , -2.2952 , 5.5559 , 3.1747 , -0.31059 , 6.462 , -7.1135 , 8.5853 , 2.5396 , -2.6323 , 4.646 , 3.4337 , 2.3632 , 1.4226 , -1.6031 , 0.80204 , 5.7058 , 0.503 , 3.568 , -3.7796 , 7.1354 , 2.3961 , -13.556 , 1.9282 , 5.6909 , 2.0506 , -3.1402 , 3.6114 , -8.8778 , 1.9274 , -3.2288 , 3.7178 , -4.4902 , 2.5428 , 6.5181 , 1.5672 , -6.4375 , -4.24 , -9.6795 , 6.336 , -0.92769 , -2.2876 , -1.8932 , 4.4791 , 7.1149 , 0.16114 , 6.8786 , 7.05 , -2.2451 , -1.6941 , -8.1168 , -3.1198 , -2.0878 , -0.96782 , -1.0722 , 3.6036 , -3.9225 , 3.4282 , -4.831 , -7.046 , 3.6809 , 3.26 , 1.1511 , 5.5712 , 0.46504 , -7.4492 , 3.6167 , 3.6889 , -2.4359 , 4.101 , -0.64437 , 1.0575 , 5.4622 , -2.3978 , -7.6296 , -1.5451 , -1.6866 , -3.224 , 1.8545 , -6.3787 , 6.178 , -4.2001 , 1.5448 , 10.733 , 5.1482 , 10.758 , 2.1271 , -3.1391 , -3.886 , -3.0535 , 4.441 , -8.5508 , -2.5373 , -0.55043 , -12.688 , 3.4997 , 5.4011 , 0.04654 , 5.4789 , 3.9713 , -0.91285 , 5.9462 , 4.0507 , -1.0129 , 2.4831 , -1.5431 , 1.6657 , -3.8428 , -7.2476 , -4.0296 , 0.45018 , 7.5467 , 2.2629 , 3.8569 , 5.4011 , -4.5573 , -4.4017 , -8.499 , -2.7771 , 3.5199 , -2.0077 , -3.9201 , 0.10954 , 0.49075 , 1.3402 , 0.1557 , 0.14562 , 0.24715 , -2.6996 , 0.63908 , 7.2157 , -5.6317 , 1.1188 , -0.55071 , -6.0426 , 1.4444 , -3.999 , 3.4556 , -1.7449 , -12.314 , -4.0322 , 8.7758 , 8.5515 , -1.6411 , 0.41293 , 10.413 , 9.4247 , -3.6229 , -4.8699 , -5.9972 , 2.8773 , 11.198 , -7.8787 , -4.5297 , -4.3373 , 5.0899 , 1.7842 , -0.24692 , -6.2276 , -3.3438 , -7.1623 , -8.7322 , -0.40021 , -1.9681 , 2.585 , 6.8649 , -7.3689 , 0.12819 , -5.8111 , -3.2717 , 5.4081 , -13.504 , 4.009 , 3.9505 , -3.2977 , -8.1697 , 1.6729 , 6.9818 , 0.14446 , -8.1925 , -0.92981 , 5.5912 , 5.5417 , -4.2281 , 4.0914 , 5.4836 ], dtype=float32),)]
Because the text analysis part of the spaCy
module can take a very long time, especially for large corpora, it can be helpful to store the results for later analyses (e.g. re-running the pipeline at a later date, modifying the pipeline, etc.). To do this, the its_prep.spacy.utils
sub-module offers two helper functions: save_caches
and load_caches
.
With save_caches
, we can efficiently store all of the analyzed texts for later use. The optional parameter file_prefix
lets us more easily identify the automatically created files.
from pathlib import Path
# save the caches to /tmp
nlp.utils.save_caches(Path("/tmp/"), file_prefix="its-prep-demo")
The example above created the following files:
ls /tmp | grep "its-prep-demo"
its-prep-demo_text_to_doc_cache_docs its-prep-demo_text_to_doc_cache_keys its-prep-demo_tokens_to_doc_cache_docs its-prep-demo_tokens_to_doc_cache_keys
At a later date, we can now load these cached intermediary results through the load_caches
function:
import its_prep.spacy.props as nlp
from pathlib import Path
# load the caches from /tmp
nlp.utils.load_caches(Path("/tmp/"), file_prefix="its-prep-demo")
The tokenize_as_words
/ tokenize_as_lemmas
functions provide optional functionality to merge named entities or noun chunks by setting the corresponding argument (merge_named_entities
and merge_noun_chunks
, respectively). These can be passed on to the functions within the tokenize_documents
helper:
list(tokenize_documents(raw_docs, tokenize_fun=nlp.tokenize_as_words, merge_noun_chunks=True))
['Deutschland', 'ist', 'ein Bundesstaat', 'in', 'Mitteleuropa', '.', 'Er', 'hat', '16 Bundesländer', 'und', 'ist', 'als', 'freiheitlich-demokratischer', 'und', 'sozialer', 'Rechtsstaat', 'verfasst', '.', 'Die 1949 gegründete Bundesrepublik Deutschland', 'stellt', 'die jüngste Ausprägung', 'des 1871 erstmals begründeten deutschen Nationalstaates', 'dar', '.', 'Bundeshauptstadt', 'und', 'Regierungssitz', 'ist', 'Berlin', '.', 'Deutschland', 'grenzt', 'an', 'neun Staaten', ',', 'es', 'hat', 'Anteil', 'an', 'der Nord- und Ostsee', 'im', 'Norden', 'sowie', 'dem Bodensee', 'und', 'den Alpen', 'im', 'Süden', '.', 'Es', 'liegt', 'in', 'der gemäßigten Klimazone', 'und', 'verfügt', 'über 16 National- und mehr als 100 Naturparks', '.'] ['Das heutige Deutschland', 'hat', 'circa 84,4 Millionen Einwohner', 'und', 'zählt', 'bei', 'einer Fläche', 'von', '357.588', 'Quadratkilometern', 'mit', 'durchschnittlich 236 Einwohnern', 'pro', 'Quadratkilometer', 'zu', 'den dicht besiedelten Flächenstaaten', '.', 'Die bevölkerungsreichste deutsche Stadt', 'ist', 'Berlin', ';', 'weitere Metropolen', 'mit', 'mehr als einer Million Einwohnern', 'sind', 'Hamburg', ',', 'München', 'und', 'Köln', ';', 'der größte Ballungsraum', 'ist', 'das Ruhrgebiet', '.', 'Frankfurt', 'am', 'Main', 'ist', 'als', 'europäisches Finanzzentrum', 'von', 'globaler Bedeutung', '.', 'Die Geburtenrate', 'liegt', 'bei', '1,58 Kindern', 'pro', 'Frau', '(', '2021', ')', '.']
list(tokenize_documents(raw_docs, tokenize_fun=nlp.tokenize_as_words, merge_named_entities=True))
['Deutschland', 'ist', 'ein', 'Bundesstaat', 'in', 'Mitteleuropa', '.', 'Er', 'hat', '16', 'Bundesländer', 'und', 'ist', 'als', 'freiheitlich-demokratischer', 'und', 'sozialer', 'Rechtsstaat', 'verfasst', '.', 'Die', '1949', 'gegründete', 'Bundesrepublik Deutschland', 'stellt', 'die', 'jüngste', 'Ausprägung', 'des', '1871', 'erstmals', 'begründeten', 'deutschen', 'Nationalstaates', 'dar', '.', 'Bundeshauptstadt', 'und', 'Regierungssitz', 'ist', 'Berlin', '.', 'Deutschland', 'grenzt', 'an', 'neun', 'Staaten', ',', 'es', 'hat', 'Anteil', 'an', 'der', 'Nord-', 'und', 'Ostsee', 'im', 'Norden', 'sowie', 'dem', 'Bodensee', 'und', 'den', 'Alpen', 'im', 'Süden', '.', 'Es', 'liegt', 'in', 'der', 'gemäßigten', 'Klimazone', 'und', 'verfügt', 'über', '16', 'National-', 'und', 'mehr', 'als', '100', 'Naturparks', '.'] ['Das', 'heutige', 'Deutschland', 'hat', 'circa', '84,4', 'Millionen', 'Einwohner', 'und', 'zählt', 'bei', 'einer', 'Fläche', 'von', '357.588', 'Quadratkilometern', 'mit', 'durchschnittlich', '236', 'Einwohnern', 'pro', 'Quadratkilometer', 'zu', 'den', 'dicht', 'besiedelten', 'Flächenstaaten', '.', 'Die', 'bevölkerungsreichste', 'deutsche', 'Stadt', 'ist', 'Berlin', ';', 'weitere', 'Metropolen', 'mit', 'mehr', 'als', 'einer', 'Million', 'Einwohnern', 'sind', 'Hamburg', ',', 'München', 'und', 'Köln', ';', 'der', 'größte', 'Ballungsraum', 'ist', 'das', 'Ruhrgebiet', '.', 'Frankfurt am Main', 'ist', 'als', 'europäisches', 'Finanzzentrum', 'von', 'globaler', 'Bedeutung', '.', 'Die', 'Geburtenrate', 'liegt', 'bei', '1,58', 'Kindern', 'pro', 'Frau', '(', '2021', ')', '.']
- Create additional filters:
- URLs
- numbers and years
- word-repetitions (already implemented in its-data)
- by word length
- named entities
- Allow for the use of different
spaCy
models. - Improve the document caching mechanism, as it is currently a bit unwieldy to use (see load cache, save cache). Additional problems with the current approach:
- If the cache is large, the overhead becomes very expensive (an actual database like
sqlite
would probably help). - Because the entire cache is stored in-memory, long-running sessions may run into issues.
- If the cache is large, the overhead becomes very expensive (an actual database like
- Split up composite words into smaller pieces (especially relevant for German, e.g. “Hochschulfach” -> “Hochschule” + “Fach”).
- Use spaCy’s Language.pipe in order to process batches of documents, as this is more efficient than processing them individually.