Natural Language Pre-Processing

The goal of this library is to stream-line the process of defining natural language pre-processing pipelines, particularly in the context of educational resources.

Installation

This library can be installed either as a Nix Flake input or as a Python library.

Through pip

To install this package through pip, it should be sufficient to run

pip install git+https://github.com/openeduhub/its-prep.git@main

Note that you will have to manually ensure that de_core_news_lg is installed, e.g. through

python -m spacy download de_core_news_lg

In another Nix Flake

Include this Flake in the inputs of your Flake:

{
  inputs = {
    its-prep.url = url = "github:openeduhub/its-prep";
  };
}

Then, simply apply the applied nixpkgs overlay:

{
  outputs = { self, nixpkgs, ... }:
    let
      system = "...";
      pkgs = import nixpkgs {
        inherit system;
        overlays = [ self.inputs.its-prep.overlays.default ];
      };
      
      python-packages = py-pkgs: [ its-prep ];
      local-python = pkgs.python3.withPackages python-packages;
    in
      {...};
}

Design

Core to this library is the abstraction of natural language pre-processing as an orchestration of multiple steps that each take a tokenized document and return a subset of its tokens, based on some internal logic. We call these steps filter functions (of type its_prep.types.Filter). A pre-processing pipeline is simply a collection of filters.

To ensure that during filtering, no necessary information is discarded (e.g. for filters that act on sentences), documents are represented through an internal, immutable format (its_prep.types.Document) that always contains both the current subset of tokens and the full contents of the original document. Thus, a filter is simply any function that takes a document of type Document and returns another Document.

Sub-Modules

its_prep.core contains core functionality, such as the application of a pre-processing pipeline onto a document corpus.
its_prep.specs contains pre-defined pipelines in its_prep.specs.pipelines and functionality to easily define custom filter functions in its_prep.specs.filters.
its_prep.spacy.props contains various concrete NLP processing tasks, such as lemmatization or determination of POS tags. These are mostly used to compute properties to base the filters defined with its_prep.specs.filters on.

Tutorial

In this section, we will be demonstrating the functionality by applying both a pre-defined processing pipeline, as well as multiple custom ones.

Data

Some example texts, from the German Wikipedia page on Germany.

raw_docs = [
    "Deutschland ist ein Bundesstaat in Mitteleuropa. Er hat 16 Bundesländer und ist als freiheitlich-demokratischer und sozialer Rechtsstaat verfasst. Die 1949 gegründete Bundesrepublik Deutschland stellt die jüngste Ausprägung des 1871 erstmals begründeten deutschen Nationalstaates dar. Bundeshauptstadt und Regierungssitz ist Berlin. Deutschland grenzt an neun Staaten, es hat Anteil an der Nord- und Ostsee im Norden sowie dem Bodensee und den Alpen im Süden. Es liegt in der gemäßigten Klimazone und verfügt über 16 National- und mehr als 100 Naturparks.",
    "Das heutige Deutschland hat circa 84,4 Millionen Einwohner und zählt bei einer Fläche von 357.588 Quadratkilometern mit durchschnittlich 236 Einwohnern pro Quadratkilometer zu den dicht besiedelten Flächenstaaten. Die bevölkerungsreichste deutsche Stadt ist Berlin; weitere Metropolen mit mehr als einer Million Einwohnern sind Hamburg, München und Köln; der größte Ballungsraum ist das Ruhrgebiet. Frankfurt am Main ist als europäisches Finanzzentrum von globaler Bedeutung. Die Geburtenrate liegt bei 1,58 Kindern pro Frau (2021).",
]

In order to be able to process these text, we will first need to tokenize them in some way. This is necessary because all filters and pipelines essentially just decide which tokens to keep or discard. Note that as a corollary, this means that filters cannot modify or remove parts of tokens.

Tokenize the documents using spaCy, displaying the tokens using their lemmatized form.

import its_prep.spacy as nlp
from its_prep import tokenize_documents

(docs := list(tokenize_documents(raw_docs, tokenize_fun=nlp.tokenize_as_lemmas)))

['Deutschland', 'sein', 'ein', 'Bundesstaat', 'in', 'Mitteleuropa', '--', 'er', 'haben', '16', 'Bundesland', 'und', 'sein', 'als', 'freiheitlich-demokratischer', 'und', 'sozial', 'Rechtsstaat', 'verfassen', '--', 'der', '1949', 'gegründet', 'Bundesrepublik', 'Deutschland', 'stellen', 'der', 'jung', 'Ausprägung', 'der', '1871', 'erstmals', 'begründet', 'deutsch', 'Nationalstaat', 'dar', '--', 'Bundeshauptstadt', 'und', 'Regierungssitz', 'sein', 'Berlin', '--', 'Deutschland', 'grenzen', 'an', 'neun', 'Staat', '--', 'es', 'haben', 'Anteil', 'an', 'der', 'Nord', 'und', 'Ostsee', 'in', 'Norden', 'sowie', 'der', 'Bodensee', 'und', 'der', 'Alpen', 'in', 'Süden', '--', 'es', 'liegen', 'in', 'der', 'gemäßigt', 'Klimazone', 'und', 'verfügen', 'über', '16', 'National', 'und', 'mehr', 'als', '100', 'Naturpark', '--']
['der', 'heutig', 'Deutschland', 'haben', 'circa', '84,4', 'Million', 'Einwohner', 'und', 'zählen', 'bei', 'ein', 'Fläche', 'von', '357.588', 'Quadratkilometer', 'mit', 'durchschnittlich', '236', 'Einwohner', 'pro', 'Quadratkilometer', 'zu', 'der', 'dicht', 'besiedelt', 'Flächenstaate', '--', 'der', 'bevölkerungsreichste', 'deutsch', 'Stadt', 'sein', 'Berlin', '--', 'weit', 'Metropole', 'mit', 'mehr', 'als', 'ein', 'Million', 'Einwohner', 'sein', 'Hamburg', '--', 'München', 'und', 'Köln', '--', 'der', 'groß', 'Ballungsraum', 'sein', 'der', 'Ruhrgebiet', '--', 'Frankfurt', 'an', 'Main', 'sein', 'als', 'europäisch', 'Finanzzentrum', 'von', 'global', 'Bedeutung', '--', 'der', 'Geburtenrate', 'liegen', 'bei', '1,58', 'Kind', 'pro', 'Frau', '--', '2021', '--', '--']

Pre-Defined Pipelines

A collection of common pipelines can be found in the specs.pipelines sub-module, and a collection of common NLP steps, implemented in spaCy, can be found in spacy.props.

import its_prep.spacy as nlp
from its_prep import pipelines

Next, we apply the poc_topic_modeling pipeline, which aims to only extract data that is relevant to the semantic context of the given document. This is done by

filtering based on unwanted universal POS tags (punctuation and white-space)
filtering out stop words (as determined by spaCy)
filtering out lemmatized tokens which we expect to have no impact on the semantic context of the document in the context of learning resources
filtering out particularly rare or frequent lemmatized tokens

Since we are only dealing with two documents here, we adjust the required interval for the document frequency in the last step to be unbounded in both directions, thus skipping this step.

pipelines.apply_poc_topic_modeling(docs, required_df_interval={})

['Deutschland', 'Bundesstaat', 'Mitteleuropa', '16', 'Bundesland', 'freiheitlich-demokratischer', 'sozial', 'Rechtsstaat', 'verfassen', '1949', 'gegründet', 'Bundesrepublik', 'Deutschland', 'stellen', 'jung', 'Ausprägung', '1871', 'erstmals', 'begründet', 'deutsch', 'Nationalstaat', 'dar', 'Bundeshauptstadt', 'Regierungssitz', 'Berlin', 'Deutschland', 'grenzen', 'Staat', 'Anteil', 'Nord', 'Ostsee', 'Norden', 'Bodensee', 'Alpen', 'Süden', 'liegen', 'gemäßigt', 'Klimazone', 'verfügen', '16', 'National', '100', 'Naturpark']
['heutig', 'Deutschland', 'circa', '84,4', 'Million', 'Einwohner', 'zählen', 'Fläche', '357.588', 'Quadratkilometer', 'durchschnittlich', '236', 'Einwohner', 'pro', 'Quadratkilometer', 'dicht', 'besiedelt', 'Flächenstaate', 'bevölkerungsreichste', 'deutsch', 'Stadt', 'Berlin', 'Metropole', 'Million', 'Einwohner', 'Hamburg', 'München', 'Köln', 'groß', 'Ballungsraum', 'Ruhrgebiet', 'Frankfurt', 'Main', 'europäisch', 'Finanzzentrum', 'global', 'Bedeutung', 'Geburtenrate', 'liegen', '1,58', 'Kind', 'pro', 'Frau', '2021']

Custom Pipelines

A pipeline is defined simply as a sequence of filtering functions that take a document as their argument and return a subset of that document. Thus, defining a custom pipeline is equivalent to defining a number of such filtering functions.

In the filters sub-module, we have defined multiple factory functions that should make it much easier to define filters from NLP processing steps (e.g. those defined in spacy.props).

We also import the apply_filters helper function, which is a convenient way to apply a pipeline on a document corpus.

from its_prep import filters, apply_filters

Say we wanted to only return only the verbs in the given documents. This could be achieved through

only_verbs_pipeline = [filters.get_filter_by_property(nlp.get_upos, {"VERB"})]

We can apply the filters from our pipeline using the apply_filters function

list(apply_filters(docs, only_verbs_pipeline))

['liegen', 'verfügen', 'grenzen', 'verfassen', 'stellen']
['zählen', 'liegen']

Maybe we also want to filter out stop words. For this, we utilize filters.negated, which modifies a given filter function such that its results will be removed, rather than kept:

non_stop_verbs_pipeline = only_verbs_pipeline + [
    filters.negated(filters.get_filter_by_bool_fun(nlp.is_stop))
]

list(apply_filters(docs, non_stop_verbs_pipeline))

['liegen', 'verfügen', 'grenzen', 'verfassen', 'stellen']
['zählen', 'liegen']

Finally, we could only include sentences that are at least 20 tokens long:

long_sents_pipeline = [filters.get_filter_by_subset_len(nlp.into_sentences, min_len=20)]

list(apply_filters(docs, long_sents_pipeline))

['Deutschland', 'grenzen', 'an', 'neun', 'Staat', '--', 'es', 'haben', 'Anteil', 'an', 'der', 'Nord', 'und', 'Ostsee', 'in', 'Norden', 'sowie', 'der', 'Bodensee', 'und', 'der', 'Alpen', 'in', 'Süden', '--']
['der', 'heutig', 'Deutschland', 'haben', 'circa', '84,4', 'Million', 'Einwohner', 'und', 'zählen', 'bei', 'ein', 'Fläche', 'von', '357.588', 'Quadratkilometer', 'mit', 'durchschnittlich', '236', 'Einwohner', 'pro', 'Quadratkilometer', 'zu', 'der', 'dicht', 'besiedelt', 'Flächenstaate', '--']

And then only consider the non-stop verbs of those sentences:

list(apply_filters(docs, long_sents_pipeline + non_stop_verbs_pipeline))

['grenzen']
['zählen']

Note that due to the internal document representation and the implementation of the processing steps with spaCy, the order of these filters does not matter here; we could also first filter by non-stop verbs and then by long sentences, and still get the same result.

list(apply_filters(docs, non_stop_verbs_pipeline + long_sents_pipeline))

['grenzen']
['zählen']

Finally, we could return the tokens as word-embeddings:

from its_prep import selected_properties
processed_docs = apply_filters(docs, non_stop_verbs_pipeline + long_sents_pipeline)
list(selected_properties(processed_docs, nlp.get_word_vectors))

[(array([ 8.0732e-01,  1.3723e+00, -5.3698e-01, -2.5742e+00, -3.2238e+00,
        1.0097e+00,  4.4086e-01, -4.7211e-01, -4.0149e-01, -9.0866e-02,
       -5.7923e-01, -2.3407e-01, -1.3298e-01, -3.3136e-01,  5.9846e-01,
        1.1656e+00, -9.2490e-02,  4.6380e-01, -9.3929e-01, -1.4071e+00,
       -9.8949e-01,  6.5838e-01, -6.9155e-02, -8.6060e-01, -5.1545e-01,
        2.5759e+00,  1.4891e+00,  4.9595e-01, -9.8937e-01, -2.6891e+00,
        2.5979e+00, -1.2663e-02,  1.0032e+00, -1.3726e+00, -2.8285e+00,
       -1.3964e+00,  1.2408e+00, -1.2980e+00,  2.4175e+00, -1.1108e-01,
       -2.7270e+00,  2.2405e+00, -1.1174e+00, -4.5039e-01,  2.0047e-01,
       -1.0562e+00,  9.7447e-01, -5.0296e-01,  3.5433e-01,  2.5035e-01,
       -1.3689e-03, -4.3711e-01,  1.7272e+00,  9.9553e-02, -1.7881e+00,
        3.5289e+00, -1.9781e+00,  1.6410e+00, -6.6308e-01, -7.7131e-01,
        2.2469e-01, -2.1614e+00,  1.5713e+00, -6.6373e-02,  2.2343e+00,
        1.8830e-01,  3.1996e+00, -1.4663e+00,  2.7145e+00,  6.6561e-02,
       -1.8314e+00,  7.0784e-01,  3.4304e-01, -9.4780e-01, -2.0042e+00,
        9.8858e-01, -1.6424e+00, -1.9604e+00,  4.6556e-01, -2.2256e-01,
       -6.6066e-01, -1.6972e-01,  2.1931e+00,  2.0852e+00, -4.3692e-01,
       -1.4755e-01,  6.1204e-01,  1.6330e+00, -1.2510e+00, -1.5489e+00,
       -1.7016e-01, -2.1122e+00, -1.6820e+00, -1.5734e+00,  1.3483e+00,
       -2.0566e+00, -1.5619e+00, -1.2128e+00, -5.0591e-01,  1.8785e+00,
       -7.6197e-01,  2.5904e+00,  4.3581e-02, -7.4048e-01, -1.7516e+00,
        4.0088e-02, -3.1468e+00,  5.5322e-01,  9.1766e-01,  5.9854e-01,
        8.9359e-02,  1.2874e+00, -3.8819e+00,  7.1359e-01,  1.8177e-02,
        4.1450e-01, -5.0730e-03, -7.8042e-01, -4.2483e-01,  5.5403e-01,
        4.7233e-01, -1.4523e+00,  4.0898e+00,  1.5544e+00,  1.0186e+00,
        5.8939e-01,  9.7547e-01,  3.7954e-01,  2.1430e-01,  8.3903e-01,
        2.4668e+00, -1.5191e+00, -8.0581e-01, -8.9338e-02,  8.9130e-01,
        5.8869e-01,  3.1070e-01,  1.0650e+00,  7.1601e-01,  1.6895e+00,
        3.2277e+00, -1.5874e+00,  5.5910e-01, -1.3833e+00, -1.4956e+00,
       -1.7087e+00,  1.4052e+00, -1.5890e-01,  6.7932e-01, -2.1383e+00,
        4.0109e-01,  1.1472e+00, -1.9935e+00,  6.4681e-01,  4.7015e-01,
       -1.9115e+00, -1.3163e+00, -7.0171e-01,  9.4038e-01, -1.4081e+00,
        4.1968e-01,  1.1788e-01,  2.3435e+00, -1.2764e+00,  1.8269e+00,
        3.2451e+00,  2.3452e+00,  1.4542e+00,  3.1509e+00,  6.8695e-02,
        8.3664e-01,  4.6488e-01, -1.9003e-02, -1.6531e+00,  1.8218e+00,
       -2.0497e-01,  1.9792e+00, -5.8036e-01,  7.1097e-01,  4.4142e-01,
       -1.6841e+00,  8.4975e-01,  7.5711e-01, -1.8469e+00,  1.0284e+00,
       -2.3016e+00,  1.1380e+00,  3.5043e-01, -1.0574e+00,  1.2962e+00,
       -5.5338e-01, -9.9845e-01, -9.5507e-01, -2.0559e+00, -9.4054e-01,
       -8.8928e-01,  4.3253e-01, -1.5722e+00,  1.2591e+00, -1.4304e+00,
        2.5843e+00, -6.2330e-01, -8.9084e-01, -1.6198e+00, -1.2893e-01,
       -2.5204e-02,  1.2725e+00,  4.1957e+00,  3.9992e-01,  2.3022e-02,
       -6.6594e-01,  9.8386e-02, -5.3294e-01, -3.2918e+00,  2.7032e+00,
       -1.7077e+00,  2.2714e+00, -4.8674e-01, -8.1132e-01,  1.6060e-02,
        1.1888e-01, -5.0967e-01, -9.2476e-02,  1.1875e+00,  2.1304e-01,
        4.5259e-01,  2.2360e+00, -3.7786e-01,  5.5660e-01,  1.6240e+00,
       -2.1650e+00,  2.8050e-01, -2.5134e+00, -1.5938e+00,  4.9367e-01,
        1.1155e+00, -1.6824e+00,  7.3233e-01, -1.1471e-01,  4.8059e-02,
        7.3429e-01, -2.6168e+00,  1.9935e+00, -3.5876e-01, -4.9866e-01,
       -5.4019e-01,  2.5438e+00, -2.2357e+00, -1.7544e+00,  7.0646e-01,
       -2.5509e+00, -2.7299e+00, -1.0948e+00,  8.3585e-01,  2.4147e+00,
       -1.8732e-01, -1.4595e+00, -1.1231e+00,  1.0216e+00,  2.3719e+00,
        2.0988e-02, -1.0418e+00, -7.4748e-01, -7.1275e-01, -1.2673e+00,
       -4.4957e-01, -1.3182e-01,  1.1091e+00,  1.1875e+00, -1.2288e+00,
        3.0913e+00, -1.5496e-01,  1.3854e+00,  4.7156e-01,  5.2295e-01,
        3.8921e+00,  1.9631e+00, -4.3928e-01,  7.1788e-01,  2.5766e+00,
       -1.8608e+00,  2.0194e+00,  1.5293e+00,  5.6542e-01,  1.6145e+00,
       -1.4601e+00,  1.1504e+00, -1.3161e+00, -1.4911e+00, -1.2753e+00,
        9.3347e-01, -1.4765e+00,  1.4941e+00, -1.6196e+00,  2.9735e+00,
        1.7396e+00,  1.2293e+00, -1.2182e+00,  4.1059e+00,  3.4958e+00],
      dtype=float32),), (array([ -0.65404 ,   1.7489  ,   1.3238  ,  10.689   ,  -3.8354  ,
        -3.4176  ,   2.8458  ,   6.9475  ,   8.3795  ,  -9.7266  ,
         1.8059  ,  -7.0959  ,   0.65468 ,   4.8942  ,   0.11935 ,
         4.2329  ,   6.4581  , -10.205   ,  -2.4653  ,   5.1304  ,
        -1.5249  ,  -0.33113 ,  -8.9162  ,   0.68117 ,   0.82788 ,
         4.4263  ,   2.7534  ,  -2.8395  ,  -1.9882  ,  -0.074579,
         0.8485  ,   0.32367 ,   1.7496  ,   0.53606 ,  -9.8885  ,
        -6.2691  ,  -5.5006  , -14.412   ,  -3.3748  ,  -0.25372 ,
        -1.9325  ,   2.0588  ,  -3.4462  ,   3.3861  ,   0.77796 ,
         2.1809  ,  -5.0384  ,  -9.1702  ,  -6.5893  ,   1.7999  ,
         3.087   ,  -1.6311  ,   2.0203  ,  -3.5758  ,   2.8109  ,
         1.7471  ,   1.273   ,   4.2303  ,   2.2095  ,  -4.1308  ,
         0.43882 ,  -0.025106,  11.237   ,   7.2951  ,  -3.6524  ,
        -2.4187  ,  -5.1483  ,   4.5378  ,   4.4823  ,   4.6328  ,
        -6.2347  ,   8.1855  ,   1.1755  ,   4.665   ,   3.1928  ,
         7.0662  ,  -5.9556  ,  -0.049054,   4.5369  ,   7.3604  ,
         3.2199  ,   2.9817  ,  -0.91019 ,  -0.10366 ,   3.7303  ,
         1.3639  ,  -6.3505  ,   3.2573  ,  -3.0636  ,  -1.2138  ,
        -0.55653 ,   0.65678 ,   1.386   ,  -4.523   ,  -4.4014  ,
       -11.258   ,  -7.1273  ,  -3.1885  , -10.099   ,  -1.9632  ,
        -3.6254  ,   2.8138  ,   1.5372  ,   1.0629  ,   9.1253  ,
        -7.4737  ,   1.6548  ,  -3.5463  ,   1.1836  ,  -6.5327  ,
        -0.79816 ,  -2.8768  ,  11.611   ,   8.4142  ,   2.2678  ,
        -2.2952  ,   5.5559  ,   3.1747  ,  -0.31059 ,   6.462   ,
        -7.1135  ,   8.5853  ,   2.5396  ,  -2.6323  ,   4.646   ,
         3.4337  ,   2.3632  ,   1.4226  ,  -1.6031  ,   0.80204 ,
         5.7058  ,   0.503   ,   3.568   ,  -3.7796  ,   7.1354  ,
         2.3961  , -13.556   ,   1.9282  ,   5.6909  ,   2.0506  ,
        -3.1402  ,   3.6114  ,  -8.8778  ,   1.9274  ,  -3.2288  ,
         3.7178  ,  -4.4902  ,   2.5428  ,   6.5181  ,   1.5672  ,
        -6.4375  ,  -4.24    ,  -9.6795  ,   6.336   ,  -0.92769 ,
        -2.2876  ,  -1.8932  ,   4.4791  ,   7.1149  ,   0.16114 ,
         6.8786  ,   7.05    ,  -2.2451  ,  -1.6941  ,  -8.1168  ,
        -3.1198  ,  -2.0878  ,  -0.96782 ,  -1.0722  ,   3.6036  ,
        -3.9225  ,   3.4282  ,  -4.831   ,  -7.046   ,   3.6809  ,
         3.26    ,   1.1511  ,   5.5712  ,   0.46504 ,  -7.4492  ,
         3.6167  ,   3.6889  ,  -2.4359  ,   4.101   ,  -0.64437 ,
         1.0575  ,   5.4622  ,  -2.3978  ,  -7.6296  ,  -1.5451  ,
        -1.6866  ,  -3.224   ,   1.8545  ,  -6.3787  ,   6.178   ,
        -4.2001  ,   1.5448  ,  10.733   ,   5.1482  ,  10.758   ,
         2.1271  ,  -3.1391  ,  -3.886   ,  -3.0535  ,   4.441   ,
        -8.5508  ,  -2.5373  ,  -0.55043 , -12.688   ,   3.4997  ,
         5.4011  ,   0.04654 ,   5.4789  ,   3.9713  ,  -0.91285 ,
         5.9462  ,   4.0507  ,  -1.0129  ,   2.4831  ,  -1.5431  ,
         1.6657  ,  -3.8428  ,  -7.2476  ,  -4.0296  ,   0.45018 ,
         7.5467  ,   2.2629  ,   3.8569  ,   5.4011  ,  -4.5573  ,
        -4.4017  ,  -8.499   ,  -2.7771  ,   3.5199  ,  -2.0077  ,
        -3.9201  ,   0.10954 ,   0.49075 ,   1.3402  ,   0.1557  ,
         0.14562 ,   0.24715 ,  -2.6996  ,   0.63908 ,   7.2157  ,
        -5.6317  ,   1.1188  ,  -0.55071 ,  -6.0426  ,   1.4444  ,
        -3.999   ,   3.4556  ,  -1.7449  , -12.314   ,  -4.0322  ,
         8.7758  ,   8.5515  ,  -1.6411  ,   0.41293 ,  10.413   ,
         9.4247  ,  -3.6229  ,  -4.8699  ,  -5.9972  ,   2.8773  ,
        11.198   ,  -7.8787  ,  -4.5297  ,  -4.3373  ,   5.0899  ,
         1.7842  ,  -0.24692 ,  -6.2276  ,  -3.3438  ,  -7.1623  ,
        -8.7322  ,  -0.40021 ,  -1.9681  ,   2.585   ,   6.8649  ,
        -7.3689  ,   0.12819 ,  -5.8111  ,  -3.2717  ,   5.4081  ,
       -13.504   ,   4.009   ,   3.9505  ,  -3.2977  ,  -8.1697  ,
         1.6729  ,   6.9818  ,   0.14446 ,  -8.1925  ,  -0.92981 ,
         5.5912  ,   5.5417  ,  -4.2281  ,   4.0914  ,   5.4836  ],
      dtype=float32),)]

Persistent Storage

Because the text analysis part of the spaCy module can take a very long time, especially for large corpora, it can be helpful to store the results for later analyses (e.g. re-running the pipeline at a later date, modifying the pipeline, etc.). To do this, the its_prep.spacy.utils sub-module offers two helper functions: save_caches and load_caches.

With save_caches, we can efficiently store all of the analyzed texts for later use. The optional parameter file_prefix lets us more easily identify the automatically created files.

from pathlib import Path
# save the caches to /tmp
nlp.utils.save_caches(Path("/tmp/"), file_prefix="its-prep-demo")

The example above created the following files:

ls /tmp | grep "its-prep-demo"

its-prep-demo_text_to_doc_cache_docs
its-prep-demo_text_to_doc_cache_keys
its-prep-demo_tokens_to_doc_cache_docs
its-prep-demo_tokens_to_doc_cache_keys

At a later date, we can now load these cached intermediary results through the load_caches function:

import its_prep.spacy.props as nlp
from pathlib import Path
# load the caches from /tmp
nlp.utils.load_caches(Path("/tmp/"), file_prefix="its-prep-demo")

Merging of named entities / noun chunks

The tokenize_as_words / tokenize_as_lemmas functions provide optional functionality to merge named entities or noun chunks by setting the corresponding argument (merge_named_entities and merge_noun_chunks, respectively). These can be passed on to the functions within the tokenize_documents helper:

list(tokenize_documents(raw_docs, tokenize_fun=nlp.tokenize_as_words, merge_noun_chunks=True))

['Deutschland', 'ist', 'ein Bundesstaat', 'in', 'Mitteleuropa', '.', 'Er', 'hat', '16 Bundesländer', 'und', 'ist', 'als', 'freiheitlich-demokratischer', 'und', 'sozialer', 'Rechtsstaat', 'verfasst', '.', 'Die 1949 gegründete Bundesrepublik Deutschland', 'stellt', 'die jüngste Ausprägung', 'des 1871 erstmals begründeten deutschen Nationalstaates', 'dar', '.', 'Bundeshauptstadt', 'und', 'Regierungssitz', 'ist', 'Berlin', '.', 'Deutschland', 'grenzt', 'an', 'neun Staaten', ',', 'es', 'hat', 'Anteil', 'an', 'der Nord- und Ostsee', 'im', 'Norden', 'sowie', 'dem Bodensee', 'und', 'den Alpen', 'im', 'Süden', '.', 'Es', 'liegt', 'in', 'der gemäßigten Klimazone', 'und', 'verfügt', 'über 16 National- und mehr als 100 Naturparks', '.']
['Das heutige Deutschland', 'hat', 'circa 84,4 Millionen Einwohner', 'und', 'zählt', 'bei', 'einer Fläche', 'von', '357.588', 'Quadratkilometern', 'mit', 'durchschnittlich 236 Einwohnern', 'pro', 'Quadratkilometer', 'zu', 'den dicht besiedelten Flächenstaaten', '.', 'Die bevölkerungsreichste deutsche Stadt', 'ist', 'Berlin', ';', 'weitere Metropolen', 'mit', 'mehr als einer Million Einwohnern', 'sind', 'Hamburg', ',', 'München', 'und', 'Köln', ';', 'der größte Ballungsraum', 'ist', 'das Ruhrgebiet', '.', 'Frankfurt', 'am', 'Main', 'ist', 'als', 'europäisches Finanzzentrum', 'von', 'globaler Bedeutung', '.', 'Die Geburtenrate', 'liegt', 'bei', '1,58 Kindern', 'pro', 'Frau', '(', '2021', ')', '.']

list(tokenize_documents(raw_docs, tokenize_fun=nlp.tokenize_as_words, merge_named_entities=True))

['Deutschland', 'ist', 'ein', 'Bundesstaat', 'in', 'Mitteleuropa', '.', 'Er', 'hat', '16', 'Bundesländer', 'und', 'ist', 'als', 'freiheitlich-demokratischer', 'und', 'sozialer', 'Rechtsstaat', 'verfasst', '.', 'Die', '1949', 'gegründete', 'Bundesrepublik Deutschland', 'stellt', 'die', 'jüngste', 'Ausprägung', 'des', '1871', 'erstmals', 'begründeten', 'deutschen', 'Nationalstaates', 'dar', '.', 'Bundeshauptstadt', 'und', 'Regierungssitz', 'ist', 'Berlin', '.', 'Deutschland', 'grenzt', 'an', 'neun', 'Staaten', ',', 'es', 'hat', 'Anteil', 'an', 'der', 'Nord-', 'und', 'Ostsee', 'im', 'Norden', 'sowie', 'dem', 'Bodensee', 'und', 'den', 'Alpen', 'im', 'Süden', '.', 'Es', 'liegt', 'in', 'der', 'gemäßigten', 'Klimazone', 'und', 'verfügt', 'über', '16', 'National-', 'und', 'mehr', 'als', '100', 'Naturparks', '.']
['Das', 'heutige', 'Deutschland', 'hat', 'circa', '84,4', 'Millionen', 'Einwohner', 'und', 'zählt', 'bei', 'einer', 'Fläche', 'von', '357.588', 'Quadratkilometern', 'mit', 'durchschnittlich', '236', 'Einwohnern', 'pro', 'Quadratkilometer', 'zu', 'den', 'dicht', 'besiedelten', 'Flächenstaaten', '.', 'Die', 'bevölkerungsreichste', 'deutsche', 'Stadt', 'ist', 'Berlin', ';', 'weitere', 'Metropolen', 'mit', 'mehr', 'als', 'einer', 'Million', 'Einwohnern', 'sind', 'Hamburg', ',', 'München', 'und', 'Köln', ';', 'der', 'größte', 'Ballungsraum', 'ist', 'das', 'Ruhrgebiet', '.', 'Frankfurt am Main', 'ist', 'als', 'europäisches', 'Finanzzentrum', 'von', 'globaler', 'Bedeutung', '.', 'Die', 'Geburtenrate', 'liegt', 'bei', '1,58', 'Kindern', 'pro', 'Frau', '(', '2021', ')', '.']

Potential Future Improvements

Create additional filters:
- URLs
- numbers and years
- word-repetitions (already implemented in its-data)
- by word length
- named entities
Allow for the use of different spaCy models.
Improve the document caching mechanism, as it is currently a bit unwieldy to use (see load cache, save cache). Additional problems with the current approach:
- If the cache is large, the overhead becomes very expensive (an actual database like sqlite would probably help).
- Because the entire cache is stored in-memory, long-running sessions may run into issues.
Split up composite words into smaller pieces (especially relevant for German, e.g. “Hochschulfach” -> “Hochschule” + “Fach”).
Use spaCy’s Language.pipe in order to process batches of documents, as this is more efficient than processing them individually.

Name		Name	Last commit message	Last commit date
Latest commit History 87 Commits
docs		docs
its_prep		its_prep
test		test
.gitignore		.gitignore
README.org		README.org
docs.nix		docs.nix
flake.lock		flake.lock
flake.nix		flake.nix
make_docs.sh		make_docs.sh
overlays.nix		overlays.nix
python-lib.nix		python-lib.nix
requirements.txt		requirements.txt
setup.py		setup.py
shell.nix		shell.nix

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Natural Language Pre-Processing

Installation

Through pip

In another Nix Flake

Design

Sub-Modules

Tutorial

Data

Pre-Defined Pipelines

Custom Pipelines

Persistent Storage

Merging of named entities / noun chunks

Potential Future Improvements

About

Releases

Packages

Languages

openeduhub/its-prep

Folders and files

Latest commit

History

Repository files navigation

Natural Language Pre-Processing

Installation

Through pip

In another Nix Flake

Design

Sub-Modules

Tutorial

Data

Pre-Defined Pipelines

Custom Pipelines

Persistent Storage

Merging of named entities / noun chunks

Potential Future Improvements

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages