Skip to content

A Bayesian approach to metadata prediction in education

Notifications You must be signed in to change notification settings

openeduhub/its-jointprobability

Repository files navigation

its-jointprobability

A Bayesian approach to generating metadata for educational materials.

This project is primarily intended to be used as a microservice through the nix package. Additionally, it includes some CLI utilities in order to (re-) train the model for some data (data not included).

The metadata fields that are currently being predicted are:

  • School discipline
  • University discipline
  • Resource type
  • Target audience
  • Educational context
  • Topic (following the WLO topic tree)

Usage

Service

With Nix, no further installation is required to run the microservice. Simply run the following command:

nix run github:openeduhub/its-jointprobability

or optionally, with CUDA support:

nix run "github:openeduhub/its-jointprobability#with-cuda"

If the package has been installed locally, the service is also available as its-jointprobability from the command line.

For more information on configuration options, see

nix run github:openeduhub/its-jointprobability -- --help

Once started, see the Swagger UI for documentation on the service. It is located on http://localhost:8080/docs by default.

Model Training

To retrain the model under some data, use the included retrain-model CLI tool, e.g. through

nix run "github:openeduhub/its-jointprobability#retrain-model" -- <path/to/data-dir>

or, highly recommended, with CUDA:

nix run "github:openeduhub/its-jointprobability#retrain-model-with-cuda" -- <path/to/data-dir>

The utility will look for train_data and train_labels, which are assumed to files that can be loaded through torch.load. These should be (float-type) torch.Tensor objects with the following content:

train_data_labeled
a two-dimensional Tensor where the first dimension corresponds to the individual documents to use for training and the second dimensions contains each document’s content, encoded through their Bag-of-words representation.
train_targets
a two-dimensional Tensor where the first dimension corresponds to the individual documents to use for training and the second dimension encodes whether each document belongs to each discipline (1.0 if it does, 0.0 otherwise).

Once the data has been loaded, the topic model will be trained (this will take a long time) and saved within the set directory under prodslda. If this file already exists, this step is skipped.

Finally, the Bayesian classification model is trained and saved under classification. At this point, some quality metrics will be computed for the model on the training data. If test_data_labeled and test_targets are present in the given directory (analogous to the training data), these quality metrics will also be computed for this testing data.

Multiple data directories

It is possible to provide multiple data directories to the training process. This will result in the data being automatically concatenate (combining shared target fields, if they have the exact same name).

Note, however, that any missing fields will be assumed to be filled with non-assignments. While this is no issue for training, as such documents will not be “observed” by the model, it can be highly problematic for evaluation on a separate data set. This is because for the separate data set, any non-assignment (even if all categories are missing) are considered to be correct. Thus we recommend running evaluation (with the --eval-only flag of the retrain-model executable) is run separately for each data source.

As a Python Library

When doing larger scale analysis, using the model through a REST API may not be very convenient, especially because of the lack of proper parallelization of batches and thus much higher hardware utilization than would be possible.

For such use-cases, using the Python library directly is recommended. See Python Library for details on how to install the library through the provided nixpkgs overlay. Alternatively, using pip may also work.

For Inference

To load a pre-trained model, e.g. from https://gitlab.gwdg.de/jopitz/its-jointprobability-model, load the corresponding ProdSLDA_{kwargs|pyro|state}.pt files using its_jointprobability.data.load_model:

from pathlib import Path
import torch
from its_jointprobability.models.prodslda import ProdSLDA
from its_jointprobability.data import load_model

model_path = Path("../its-jointprobability-model")
# use CUDA if it is available
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = load_model(ProdSLDA, model_path, device=device)

print(model)

Now, we can run inference on arbitrary texts by simply using the predict_from_texts method of the model:

texts = [
    "Der Satz des Pythagoras lautet: a^2 + b^2 = c^2",
    "Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat.",
]

predictions = list(
    model.predict_from_texts(*texts, tokens=model.vocab, num_samples=1000)
)
posterior sample: 100% 4/4 [00:03<00:00,  1.16it/s]
posterior sample: 100% 19/19 [00:16<00:00,  1.15it/s]
from pprint import pprint

# print the most relevant predictions for the university discipline
print("Most relevant")
print("-------------")
for text, prediction in zip(texts, predictions):
    print(text)
    pprint(
        sorted(
            prediction["properties.ccm:oeh_taxonid_university"],
            key=lambda x: x.baseline_diff,
            reverse=True,
        )[:5]
    )
    print()
    
# print the least relevant predictions for the university discipline
print("Least relevant")
print("--------------")
for text, prediction in zip(texts, predictions):
    print(text)
    pprint(
        sorted(
            prediction["properties.ccm:oeh_taxonid_university"],
            key=lambda x: x.baseline_diff,
            reverse=False,
        )[:5]
    )
    print()
Most relevant
-------------
Der Satz des Pythagoras lautet: a^2 + b^2 = c^2
[Prediction_Score(id='http://w3id.org/openeduhub/vocabs/hochschulfaechersystematik/n4', name='Mathematik, Naturwissenschaften', mean_prob=0.3017423450946808, median_prob=0.22666868567466736, baseline_diff=0.20117199420928955, prob_interval=[9.771261102287099e-05, 0.5759228467941284]),
 Prediction_Score(id='http://w3id.org/openeduhub/vocabs/hochschulfaechersystematik/n42', name='Studienbereich Biologie', mean_prob=0.4470081627368927, median_prob=0.39806926250457764, baseline_diff=0.16034385561943054, prob_interval=[1.2577129382407293e-07, 0.9183201789855957]),
 Prediction_Score(id='http://w3id.org/openeduhub/vocabs/hochschulfaechersystematik/n78', name='Musik, Musikwissenschaft', mean_prob=0.17592881619930267, median_prob=0.021890051662921906, baseline_diff=0.1290142834186554, prob_interval=[5.838738204033689e-08, 0.35911470651626587]),
 Prediction_Score(id='http://w3id.org/openeduhub/vocabs/hochschulfaechersystematik/n3', name='Rechts-, Wirtschafts- und Sozialwissenschaften', mean_prob=0.3752068281173706, median_prob=0.24832701683044434, baseline_diff=0.08998405933380127, prob_interval=[4.34091225542943e-06, 0.8183690309524536]),
 Prediction_Score(id='http://w3id.org/openeduhub/vocabs/hochschulfaechersystematik/n181', name='Wirtschaftspädagogik', mean_prob=0.11366435885429382, median_prob=1.3311614566191565e-07, baseline_diff=0.0890774130821228, prob_interval=[1.2732709377409786e-14, 0.05412125959992409])]

Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat.
[Prediction_Score(id='http://w3id.org/openeduhub/vocabs/hochschulfaechersystematik/n78', name='Musik, Musikwissenschaft', mean_prob=0.34734925627708435, median_prob=0.1575605869293213, baseline_diff=0.3004347085952759, prob_interval=[2.159974201276782e-06, 0.8068367838859558]),
 Prediction_Score(id='http://w3id.org/openeduhub/vocabs/hochschulfaechersystematik/n181', name='Wirtschaftspädagogik', mean_prob=0.1526513695716858, median_prob=8.964435416203287e-09, baseline_diff=0.12806442379951477, prob_interval=[1.6043836342400915e-16, 0.20163895189762115]),
 Prediction_Score(id='http://w3id.org/openeduhub/vocabs/hochschulfaechersystematik/n4', name='Mathematik, Naturwissenschaften', mean_prob=0.21768686175346375, median_prob=0.1302599310874939, baseline_diff=0.11711650341749191, prob_interval=[7.24967912901775e-06, 0.40643513202667236]),
 Prediction_Score(id='http://w3id.org/openeduhub/vocabs/hochschulfaechersystematik/n42', name='Studienbereich Biologie', mean_prob=0.39756515622138977, median_prob=0.31484803557395935, baseline_diff=0.11090084910392761, prob_interval=[3.0698060982103925e-07, 0.8499296307563782]),
 Prediction_Score(id='http://w3id.org/openeduhub/vocabs/hochschulfaechersystematik/n41', name='Studienbereich Pharmazie', mean_prob=0.05953460931777954, median_prob=7.371593405822697e-17, baseline_diff=0.057155806571245193, prob_interval=[0.0, 1.3084762031212449e-05])]

Least relevant
--------------
Der Satz des Pythagoras lautet: a^2 + b^2 = c^2
[Prediction_Score(id='http://w3id.org/openeduhub/vocabs/hochschulfaechersystematik/n270', name='Berufs- und Wirtschaftspädagogik', mean_prob=0.06220154091715813, median_prob=1.6106645489344373e-05, baseline_diff=-0.14374108612537384, prob_interval=[8.124513973910286e-13, 0.017431309446692467]),
 Prediction_Score(id='http://w3id.org/openeduhub/vocabs/hochschulfaechersystematik/n33', name='Erziehungswissenschaften', mean_prob=0.03038620389997959, median_prob=0.005670872982591391, baseline_diff=-0.04712875187397003, prob_interval=[2.174835032064948e-07, 0.03624638915061951]),
 Prediction_Score(id='http://w3id.org/openeduhub/vocabs/hochschulfaechersystematik/n9', name='Kunst, Kunstwissenschaft', mean_prob=0.0051702880300581455, median_prob=0.000635757518466562, baseline_diff=-0.025345630943775177, prob_interval=[1.3144643844498205e-06, 0.003312920220196247]),
 Prediction_Score(id='http://w3id.org/openeduhub/vocabs/hochschulfaechersystematik/n1', name='Geisteswissenschaften', mean_prob=0.0008881543180905282, median_prob=0.00011942853598156944, baseline_diff=-0.02490326389670372, prob_interval=[6.228675353270319e-09, 0.0005768005503341556]),
 Prediction_Score(id='http://w3id.org/openeduhub/vocabs/hochschulfaechersystematik/n021', name='Betriebswirtschaftslehre', mean_prob=8.811481961856771e-07, median_prob=1.423402801875584e-11, baseline_diff=-0.005961479153484106, prob_interval=[5.207006093644107e-23, 1.990181575095562e-09])]

Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat.
[Prediction_Score(id='http://w3id.org/openeduhub/vocabs/hochschulfaechersystematik/n270', name='Berufs- und Wirtschaftspädagogik', mean_prob=0.022455664351582527, median_prob=7.832729352230672e-06, baseline_diff=-0.1834869682788849, prob_interval=[5.903206395929672e-11, 0.0027998590376228094]),
 Prediction_Score(id='http://w3id.org/openeduhub/vocabs/hochschulfaechersystematik/n33', name='Erziehungswissenschaften', mean_prob=0.03551540523767471, median_prob=0.007764234207570553, baseline_diff=-0.04199954867362976, prob_interval=[9.803572709188302e-08, 0.048915356397628784]),
 Prediction_Score(id='http://w3id.org/openeduhub/vocabs/hochschulfaechersystematik/n1', name='Geisteswissenschaften', mean_prob=0.00694312946870923, median_prob=0.0003854742681141943, baseline_diff=-0.01884828880429268, prob_interval=[1.6843619832229706e-08, 0.004069851711392403]),
 Prediction_Score(id='http://w3id.org/openeduhub/vocabs/hochschulfaechersystematik/n9', name='Kunst, Kunstwissenschaft', mean_prob=0.013417928479611874, median_prob=0.002003240864723921, baseline_diff=-0.017097990959882736, prob_interval=[1.8533414447574614e-07, 0.01231613103300333]),
 Prediction_Score(id='http://w3id.org/openeduhub/vocabs/hochschulfaechersystematik/n30', name='Studienbereich Wirtschaftswissenschaften', mean_prob=0.0024260585196316242, median_prob=9.023374150274321e-05, baseline_diff=-0.00681944889947772, prob_interval=[3.9053535516586635e-08, 0.0010033001890406013])]

REST API

Ping

Once the service has started, we can ping it to check that it is responding to requests:

curl -i -X GET http://localhost:8080/_ping
HTTP/1.1 200 OK
date: Mon, 15 Jan 2024 15:34:05 GMT
server: uvicorn
content-length: 4
content-type: application/json

null

Predictions

With the /predict endpoint, we can send a text to the model. For readability, we only ask for the seven most relevant categories for each metadata field.

In addition to the identifiers of the predicted metadata, we also get some diagnostics that help us understand whether this is a relevant match (in principle, all categories are always returned). Specifically, we gain two point-estimates (mean and median) for the probability of the category belonging to the given text, according to the model. We also get the difference to the “baseline” (i.e. an empty text) and a credibility interval (by default 80%) on said probability.

In the example below, we get only one relevant school discipline, which is also the one we would be expecting for the text (Mathematics). Because the text is relatively short, the probability of this fit is still relatively low. We also get a strong match against primary school (indicated by the large negative difference to the baseline probability), which is also what we would expect, given that Pythagoras’ Theorem is usually covered in middle school and above.

curl -X 'POST' \
  'http://localhost:8080/predict' \
  -H 'Content-Type: application/json' \
  -d '{
  "text": "Der Satz des Pythagoras lautet: a^2 + b^2 = c^2. Er wird benutzt, um die Hypotenuse eines rechtwinkligen Dreiecks zu berechnen.",
  "num_predictions": "7"
}'
properties.ccm:curriculum
--------------------------------------------------------------------
                          mean_prob  median_prob  baseline_diff               prob_interval
name                                                                                       
Lineare Funktionen            0.004        0.000          0.004  [1.52056e-24, 4.45108e-10]
Mönche und Nonnen             0.004        0.000          0.003  [1.61239e-20, 6.26612e-09]
Französisch                   0.002        0.000          0.000  [1.34615e-13, 1.68961e-06]
Tabellenkalkulation           0.002        0.000          0.002  [2.21923e-28, 1.70549e-11]
Zahlenräume                   0.002        0.000          0.001            [0, 1.28007e-27]
Spektakuläre Experimente      0.001        0.000          0.001  [1.91237e-27, 2.25978e-15]
Leben im Nahraum              0.001        0.000          0.000  [9.12031e-27, 1.95862e-13]

properties.ccm:educationalcontext
--------------------------------------------------------------------
                    mean_prob  median_prob  baseline_diff               prob_interval
name                                                                                 
Sekundarstufe I         0.843        0.863          0.119        [0.769369, 0.958703]
Sekundarstufe II        0.629        0.644         -0.218         [0.46794, 0.802874]
Primarstufe             0.147        0.135         -0.252       [0.0376089, 0.211655]
Hochschule              0.066        0.052         -0.059     [0.00459312, 0.0968557]
Erwachsenenbildung      0.023        0.017         -0.232     [0.00256588, 0.0320851]
Berufliche Bildung      0.016        0.012         -0.078     [0.00312885, 0.0224867]
Förderschule            0.000        0.000         -0.035  [2.41985e-06, 0.000409124]

properties.ccm:educationalintendedenduserrole
--------------------------------------------------------------------
            mean_prob  median_prob  baseline_diff               prob_interval
name                                                                         
Lerner/in       0.942        0.955         -0.029         [0.91781, 0.994154]
Lehrer/in       0.700        0.739          0.260        [0.531951, 0.976919]
Eltern          0.035        0.025         -0.284     [0.00220645, 0.0518708]
Berater/in      0.000        0.000         -0.015  [1.14619e-07, 3.63456e-05]
Verwaltung      0.000        0.000         -0.002  [1.83219e-14, 3.44305e-10]
andere          0.000        0.000         -0.000  [8.35377e-20, 4.59975e-13]
Autor/in        0.000        0.000         -0.000  [1.82161e-36, 3.29023e-27]

properties.ccm:oeh_lrt
--------------------------------------------------------------------
                                            mean_prob  median_prob  baseline_diff            prob_interval
name                                                                                                      
Arbeitsblatt                                    0.579        0.580          0.487     [0.367812, 0.850669]
Material                                        0.422        0.411          0.107     [0.151927, 0.602568]
Veranschaulichung, Schaubild und Tafelbild      0.141        0.034          0.134  [0.000135522, 0.236069]
Unterrichtsbaustein                             0.137        0.091         -0.330    [0.0100464, 0.216822]
Übungsmaterial                                  0.119        0.103         -0.047    [0.0114243, 0.167106]
Tool                                            0.112        0.057         -0.306  [0.000500912, 0.182542]
Webseite                                        0.066        0.045         -0.183  [0.00644342, 0.0925276]

properties.ccm:oeh_taxonid_university
--------------------------------------------------------------------
                                          mean_prob  median_prob  baseline_diff               prob_interval
name                                                                                                       
Mathematik, Naturwissenschaften               0.914        0.958          0.800        [0.870101, 0.999261]
Rhythmik                                      0.722        0.993          0.721               [0.115429, 1]
Studienbereich Biologie                       0.180        0.060          0.042     [0.000252873, 0.335375]
Studienbereich Wirtschaftswissenschaften      0.094        0.011          0.076     [1.35853e-07, 0.110211]
Mathematik                                    0.092        0.009          0.091    [8.14909e-08, 0.0948108]
Angewandte Sprachwissenschaft                 0.049        0.000          0.049  [5.39418e-16, 0.000943426]
Kunst, Kunstwissenschaft                      0.035        0.001         -0.059    [3.1506e-07, 0.00998027]

properties.ccm:taxonid
--------------------------------------------------------------------
            mean_prob  median_prob  baseline_diff             prob_interval
name                                                                       
Mathematik      0.435        0.405          0.432     [0.0369892, 0.723186]
Physik          0.066        0.040          0.055   [0.00364529, 0.0971498]
Geschichte      0.030        0.022         -0.018   [0.00295668, 0.0419527]
Allgemein       0.023        0.019         -0.098    [0.0036331, 0.0323578]
Chemie          0.023        0.015         -0.008    [0.0011269, 0.0312806]
Deutsch         0.020        0.015          0.005   [0.00349125, 0.0284315]
Informatik      0.015        0.010         -0.005  [0.000822825, 0.0213549]

Note that these predictions are stochastic, so another run on the same text may yield slightly different predictions:

curl -X 'POST' \
  'http://localhost:8080/predict' \
  -H 'Content-Type: application/json' \
  -d '{
  "text": "Der Satz des Pythagoras lautet: a^2 + b^2 = c^2. Er wird benutzt, um die Hypotenuse eines rechtwinkligen Dreiecks zu berechnen.",
  "num_predictions": "7"
}'
properties.ccm:curriculum
--------------------------------------------------------------------
                              mean_prob  median_prob  baseline_diff               prob_interval
name                                                                                           
Deutschland 1949 - 1990           0.007        0.000          0.007  [7.30618e-28, 2.68237e-12]
Theater und Theatergruppen        0.004        0.000          0.004  [2.31128e-31, 4.15018e-14]
Der Energiebegriff                0.003        0.000          0.003  [6.32023e-28, 1.12052e-14]
Zusammenarbeiten und Teilen       0.002        0.000          0.002  [2.11253e-25, 2.17002e-10]
Schalenmodell nach Bohr           0.002        0.000          0.002  [1.56748e-37, 7.54874e-16]
Biologie                          0.002        0.001         -0.004   [7.38205e-06, 0.00228089]
Lineare Funktionen                0.002        0.000          0.002    [1.41112e-34, 1.662e-15]

properties.ccm:educationalcontext
--------------------------------------------------------------------
                    mean_prob  median_prob  baseline_diff              prob_interval
name                                                                                
Sekundarstufe I         0.782        0.802          0.058       [0.704339, 0.939061]
Sekundarstufe II        0.653        0.661         -0.193       [0.493701, 0.802266]
Hochschule              0.233        0.178          0.107      [0.0196969, 0.376932]
Primarstufe             0.102        0.090         -0.297      [0.0260522, 0.153769]
Erwachsenenbildung      0.024        0.014         -0.231    [0.00172697, 0.0320572]
Berufliche Bildung      0.016        0.011         -0.078    [0.00125454, 0.0242766]
Elementarbereich        0.000        0.000         -0.083  [1.6613e-07, 8.04676e-05]

properties.ccm:educationalintendedenduserrole
--------------------------------------------------------------------
            mean_prob  median_prob  baseline_diff               prob_interval
name                                                                         
Lerner/in       0.956        0.968         -0.016        [0.934812, 0.997192]
Lehrer/in       0.692        0.746          0.251        [0.486368, 0.975209]
Eltern          0.030        0.021         -0.288     [0.00175327, 0.0414335]
Berater/in      0.000        0.000         -0.015  [8.44776e-12, 1.31934e-08]
Verwaltung      0.000        0.000         -0.002  [3.16274e-18, 3.39316e-12]
andere          0.000        0.000         -0.000  [2.77208e-22, 6.01499e-14]
Autor/in        0.000        0.000         -0.000   [1.9371e-29, 2.45001e-23]

properties.ccm:oeh_lrt
--------------------------------------------------------------------
                                    mean_prob  median_prob  baseline_diff            prob_interval
name                                                                                              
Arbeitsblatt                            0.618        0.625          0.527     [0.392966, 0.893472]
Material                                0.358        0.321          0.043      [0.065947, 0.51737]
Tool                                    0.179        0.109         -0.239  [0.000929782, 0.306906]
Übungsmaterial                          0.120        0.091         -0.046     [0.0104841, 0.18181]
Webseite                                0.076        0.053         -0.173    [0.0038096, 0.113826]
Unterrichtsbaustein                     0.058        0.035         -0.409  [0.00196568, 0.0824456]
Dokumente und textbasierte Inhalte      0.053        0.040          0.015  [0.00732069, 0.0795324]

properties.ccm:oeh_taxonid_university
--------------------------------------------------------------------
                                          mean_prob  median_prob  baseline_diff             prob_interval
name                                                                                                     
Mathematik, Naturwissenschaften               0.751        0.815          0.637      [0.562018, 0.997231]
Studienbereich Wirtschaftswissenschaften      0.133        0.023          0.115   [1.04581e-05, 0.231388]
Sozialwissenschaften/Soziologie               0.098        0.020          0.093   [1.40615e-05, 0.132832]
Musik, Musikwissenschaft                      0.063        0.005         -0.085  [4.07849e-06, 0.0648019]
Berufs- und Wirtschaftspädagogik              0.054        0.001         -0.004    [3.787e-08, 0.0368767]
Ingenieurwissenschaften                       0.046        0.024          0.024  [0.000129855, 0.0680821]
Mathematik                                    0.038        0.002          0.038  [9.47249e-08, 0.0228752]

properties.ccm:taxonid
--------------------------------------------------------------------
            mean_prob  median_prob  baseline_diff              prob_interval
name                                                                        
Mathematik      0.727        0.809          0.724       [0.530948, 0.998777]
Allgemein       0.033        0.026         -0.088    [0.00670422, 0.0469497]
Geschichte      0.025        0.019         -0.023    [0.00497792, 0.0347081]
Chemie          0.011        0.006         -0.020   [0.000441712, 0.0152536]
Biologie        0.009        0.004         -0.046   [0.000291854, 0.0112419]
Informatik      0.008        0.004         -0.012   [0.000223382, 0.0103708]
Physik          0.005        0.003         -0.006  [9.54949e-05, 0.00643221]

To reduce this variance, we can increase the number of samples being drawn for the prediction. Note that the computation time is roughly proportional to the number of such samples. By default, 500 samples are drawn.

curl -X 'POST' \
  'http://localhost:8080/predict' \
  -H 'Content-Type: application/json' \
  -d '{
  "text": "Der Satz des Pythagoras lautet: a^2 + b^2 = c^2. Er wird benutzt, um die Hypotenuse eines rechtwinkligen Dreiecks zu berechnen.",
  "num_predictions": "7",
  "num_samples": "10000"
}'
properties.ccm:curriculum
--------------------------------------------------------------------
                                   mean_prob  median_prob  baseline_diff               prob_interval
name                                                                                                
Feste, Bräuche und Traditionen         0.051        0.000          0.050             [0, 4.6837e-08]
Zeitgemäße Bildung                     0.007        0.000          0.007            [0, 1.67043e-19]
Der Mensch                             0.005        0.000          0.004  [5.89068e-25, 1.21774e-06]
Grundlagen der Bruchrechnung           0.004        0.000          0.004            [0, 1.03122e-21]
Katastrophenvorsorge                   0.004        0.000          0.004            [0, 3.00837e-21]
Basketball                             0.003        0.000          0.003            [0, 1.33857e-21]
Abiturprüfung Sport (schriftlich)      0.002        0.000          0.002            [0, 6.99884e-21]

properties.ccm:educationalcontext
--------------------------------------------------------------------
                    mean_prob  median_prob  baseline_diff               prob_interval
name                                                                                 
Sekundarstufe I         0.753        0.795          0.029        [0.606005, 0.973296]
Sekundarstufe II        0.655        0.686         -0.191        [0.480043, 0.904731]
Primarstufe             0.155        0.109         -0.244       [0.0118979, 0.251827]
Erwachsenenbildung      0.050        0.022         -0.205    [2.27408e-05, 0.0732303]
Berufliche Bildung      0.038        0.020         -0.055    [0.000561618, 0.0558285]
Hochschule              0.027        0.011         -0.098    [5.40175e-05, 0.0340922]
Elementarbereich        0.001        0.000         -0.082  [1.60651e-08, 0.000269974]

properties.ccm:educationalintendedenduserrole
--------------------------------------------------------------------
            mean_prob  median_prob  baseline_diff               prob_interval
name                                                                         
Lerner/in       0.965        0.983         -0.007        [0.958688, 0.999439]
Lehrer/in       0.618        0.672          0.178        [0.353411, 0.988062]
Eltern          0.100        0.062         -0.219      [0.00136432, 0.147967]
Berater/in      0.000        0.000         -0.015  [6.40684e-11, 5.44838e-05]
Verwaltung      0.000        0.000         -0.002  [1.05878e-30, 5.00077e-10]
andere          0.000        0.000         -0.000  [4.97662e-23, 1.11391e-11]
Autor/in        0.000        0.000         -0.000  [2.30167e-35, 7.11535e-20]

properties.ccm:oeh_lrt
--------------------------------------------------------------------
                                            mean_prob  median_prob  baseline_diff             prob_interval
name                                                                                                       
Arbeitsblatt                                    0.570        0.594          0.479      [0.286182, 0.911795]
Material                                        0.307        0.258         -0.007      [0.0267569, 0.49869]
Tool                                            0.160        0.074         -0.259   [8.81572e-05, 0.267056]
Unterrichtsbaustein                             0.155        0.098         -0.313     [0.0015225, 0.249671]
Übungsmaterial                                  0.092        0.062         -0.074    [0.00354717, 0.142811]
Veranschaulichung, Schaubild und Tafelbild      0.078        0.004          0.072  [1.20357e-08, 0.0623764]
Wiki (dynamisch)                                0.072        0.034          0.041    [5.27908e-05, 0.10611]

properties.ccm:oeh_taxonid_university
--------------------------------------------------------------------
                                                mean_prob  median_prob  baseline_diff               prob_interval
name                                                                                                             
Mathematik, Naturwissenschaften                     0.650        0.796          0.536        [0.227785, 0.999996]
Mathematik                                          0.321        0.060          0.321     [5.73261e-10, 0.882252]
Rechts-, Wirtschafts- und Sozialwissenschaften      0.177        0.018         -0.089     [2.15329e-08, 0.380878]
Studienbereich Biologie                             0.123        0.001         -0.015    [3.77699e-14, 0.0951025]
Musik, Musikwissenschaft                            0.051        0.000         -0.098   [1.07858e-13, 0.00483353]
Kulturwissenschaften i.e.S.                         0.049        0.000          0.047  [1.72526e-19, 3.54679e-05]
Berufs- und Wirtschaftspädagogik                    0.048        0.000         -0.010   [7.80917e-21, 0.00240386]

properties.ccm:taxonid
--------------------------------------------------------------------
            mean_prob  median_prob  baseline_diff             prob_interval
name                                                                       
Mathematik      0.557        0.572          0.554      [0.263817, 0.986559]
Allgemein       0.030        0.021         -0.091   [0.00150001, 0.0440911]
Physik          0.030        0.014          0.019  [5.22861e-05, 0.0422167]
Informatik      0.027        0.013          0.007   [2.71841e-05, 0.036847]
Politik         0.013        0.004         -0.012  [0.000213433, 0.0158215]
Geschichte      0.010        0.007         -0.038  [0.000417541, 0.0148008]
Deutsch         0.009        0.005         -0.006  [0.000213155, 0.0119638]

Second run, for comparison

curl -X 'POST' \
  'http://localhost:8080/predict' \
  -H 'Content-Type: application/json' \
  -d '{
  "text": "Der Satz des Pythagoras lautet: a^2 + b^2 = c^2. Er wird benutzt, um die Hypotenuse eines rechtwinkligen Dreiecks zu berechnen.",
  "num_predictions": "7",
  "num_samples": "10000"
}'
properties.ccm:curriculum
--------------------------------------------------------------------
                    mean_prob  median_prob  baseline_diff               prob_interval
name                                                                                 
Mathematik              0.009        0.000          0.009            [0, 3.29932e-16]
Der Mensch              0.009        0.000          0.008   [2.11193e-20, 2.0933e-07]
Fußball                 0.007        0.000          0.007            [0, 2.45432e-22]
Biologie                0.006        0.000         -0.001  [1.09593e-09, 0.000707899]
Verben                  0.003        0.000          0.003            [0, 3.47931e-20]
Der Energiebegriff      0.003        0.000          0.003             [0, 2.4176e-20]
Halbleiterphysik        0.003        0.000          0.002            [0, 3.07845e-21]

properties.ccm:educationalcontext
--------------------------------------------------------------------
                    mean_prob  median_prob  baseline_diff               prob_interval
name                                                                                 
Sekundarstufe I         0.772        0.814          0.048          [0.65022, 0.96792]
Sekundarstufe II        0.632        0.649         -0.214        [0.441311, 0.851648]
Primarstufe             0.150        0.106         -0.249      [0.00712552, 0.251011]
Berufliche Bildung      0.047        0.015         -0.046    [0.000188598, 0.0487077]
Hochschule              0.042        0.014         -0.084    [4.59368e-06, 0.0537946]
Erwachsenenbildung      0.037        0.021         -0.218    [0.000368963, 0.0550708]
Elementarbereich        0.001        0.000         -0.082  [1.11833e-08, 0.000524434]

properties.ccm:educationalintendedenduserrole
--------------------------------------------------------------------
            mean_prob  median_prob  baseline_diff               prob_interval
name                                                                         
Lerner/in       0.967        0.979         -0.005        [0.950385, 0.999286]
Lehrer/in       0.673        0.731          0.233        [0.460879, 0.982141]
Eltern          0.088        0.053         -0.231     [0.000421389, 0.131495]
Berater/in      0.000        0.000         -0.015  [5.40033e-12, 1.07408e-05]
andere          0.000        0.000         -0.000  [6.24496e-24, 1.51063e-11]
Verwaltung      0.000        0.000         -0.002  [5.30929e-19, 4.94847e-10]
Autor/in        0.000        0.000         -0.000  [2.88373e-38, 2.66643e-18]

properties.ccm:oeh_lrt
--------------------------------------------------------------------
                                            mean_prob  median_prob  baseline_diff             prob_interval
name                                                                                                       
Arbeitsblatt                                    0.527        0.527          0.435      [0.228836, 0.834901]
Material                                        0.287        0.240         -0.027     [0.0239176, 0.471234]
Unterrichtsbaustein                             0.200        0.130         -0.268    [0.00591302, 0.327798]
Tool                                            0.125        0.053         -0.294   [7.91584e-05, 0.195609]
Übungsmaterial                                  0.093        0.059         -0.073   [0.000901718, 0.146028]
Veranschaulichung, Schaubild und Tafelbild      0.074        0.003          0.067  [1.85385e-07, 0.0530017]
Wiki (dynamisch)                                0.071        0.031          0.040   [0.000267269, 0.101591]

properties.ccm:oeh_taxonid_university
--------------------------------------------------------------------
                                                mean_prob  median_prob  baseline_diff              prob_interval
name                                                                                                            
Mathematik, Naturwissenschaften                     0.663        0.791          0.549       [0.300775, 0.999953]
Mathematik                                          0.219        0.010          0.219    [1.21539e-10, 0.538809]
Rechts-, Wirtschafts- und Sozialwissenschaften      0.138        0.016         -0.127     [1.8065e-08, 0.235994]
Ingenieurwissenschaften                             0.058        0.006          0.036   [5.91469e-08, 0.0646542]
Studienbereich Biologie                             0.043        0.000         -0.095  [7.74011e-18, 0.00934125]
Wirtschaftspädagogik                                0.033        0.000          0.013  [3.4414e-17, 0.000167364]
Physik, Astronomie                                  0.033        0.000          0.032  [4.74921e-13, 0.00430223]

properties.ccm:taxonid
--------------------------------------------------------------------
            mean_prob  median_prob  baseline_diff              prob_interval
name                                                                        
Mathematik      0.596        0.631          0.593       [0.318359, 0.996269]
Physik          0.031        0.014          0.020   [0.000159168, 0.0440346]
Informatik      0.029        0.009          0.009   [0.000112398, 0.0329312]
Allgemein       0.027        0.018         -0.094   [0.000895834, 0.0379717]
Geschichte      0.011        0.007         -0.037   [0.000312991, 0.0160785]
Politik         0.007        0.004         -0.018   [0.000209067, 0.0103545]
Deutsch         0.007        0.004         -0.008  [0.000137099, 0.00993397]

You may notice that the probabilities for some other, less fitting, categories, are still relatively high. This is because the text is relatively short, so the model cannot conclude that e.g. a particular resource type does not fit. This behavior becomes more extreme the shorter the given text is. Essentially, the model has been given too little data to decide for or against any one category. This can also be seen in low differences to the baseline probabilities and large credibility interval.

For an even more extreme example, see the following, empty text, which corresponds to the baseline, when no information is given.

curl -X 'POST' \
  'http://localhost:8080/predict' \
  -H 'Content-Type: application/json' \
  -d '{
  "text": "",
  "num_predictions": "10",
  "num_samples": "10000"
}'
properties.ccm:curriculum
--------------------------------------------------------------------
                                  mean_prob  median_prob  baseline_diff               prob_interval
name                                                                                               
Biologie                              0.015        0.001          0.008    [6.95826e-08, 0.0125325]
Europa                                0.003        0.000          0.003            [0, 1.20335e-23]
Musik                                 0.003        0.000          0.003  [5.00691e-28, 8.26063e-06]
Zukunfts- und Berufsorientierung      0.002        0.000          0.001  [9.29485e-32, 4.90239e-09]
Medienbildung                         0.002        0.000         -0.015  [5.95743e-14, 1.04324e-05]
Deutsch als Zweitsprache              0.001        0.000          0.001  [1.23266e-30, 7.75407e-10]
Atmosphäre und Ozon                   0.001        0.000         -0.002            [0, 1.76859e-25]
Klimaschutz                           0.001        0.000          0.001            [0, 1.03162e-21]
Literatur, Film und Theater           0.001        0.000          0.001            [0, 4.92385e-22]
Leben im Nahraum                      0.001        0.000          0.000            [0, 1.92357e-21]

properties.ccm:educationalcontext
--------------------------------------------------------------------
                    mean_prob  median_prob  baseline_diff              prob_interval
name                                                                                
Sekundarstufe II        0.808        0.838         -0.038       [0.701678, 0.985372]
Sekundarstufe I         0.758        0.786          0.035        [0.65239, 0.926188]
Primarstufe             0.344        0.319         -0.055       [0.119995, 0.505382]
Erwachsenenbildung      0.286        0.236          0.031       [0.034297, 0.475032]
Berufliche Bildung      0.144        0.101          0.050     [0.00405194, 0.239091]
Hochschule              0.126        0.062          0.001    [7.52242e-05, 0.187657]
Elementarbereich        0.096        0.030          0.013    [9.52448e-05, 0.149316]
Förderschule            0.027        0.005         -0.009   [7.66281e-07, 0.0230399]
Fortbildung             0.008        0.001         -0.006  [4.53161e-07, 0.00403007]
Fernunterricht          0.000        0.000          0.000           [0, 1.15427e-33]

properties.ccm:educationalintendedenduserrole
--------------------------------------------------------------------
            mean_prob  median_prob  baseline_diff               prob_interval
name                                                                         
Lerner/in       0.960        0.974         -0.012        [0.938322, 0.997681]
Lehrer/in       0.447        0.460          0.007       [0.0191275, 0.717772]
Eltern          0.334        0.309          0.016       [0.0248098, 0.533923]
Berater/in      0.008        0.002         -0.007   [4.46296e-07, 0.00792235]
Verwaltung      0.001        0.000         -0.001  [3.18113e-10, 0.000396609]
andere          0.001        0.000          0.001  [2.39438e-12, 1.51568e-05]
Autor/in        0.000        0.000         -0.000  [3.25992e-27, 2.48229e-11]

properties.ccm:oeh_lrt
--------------------------------------------------------------------
                                      mean_prob  median_prob  baseline_diff            prob_interval
name                                                                                                
Tool                                      0.466        0.433          0.047   [0.00549888, 0.822198]
Unterrichtsbaustein                       0.373        0.330         -0.094    [0.0284641, 0.570718]
Material                                  0.310        0.275         -0.004    [0.0293215, 0.509958]
Webseite                                  0.238        0.191         -0.011    [0.00607043, 0.39278]
Übungsmaterial                            0.167        0.105          0.001  [0.000265896, 0.253408]
Unterrichtsidee                           0.110        0.074         -0.120  [0.000552793, 0.180784]
Dokumente und textbasierte Inhalte        0.071        0.051          0.032    [0.0014501, 0.108782]
Erklärvideo und gefilmtes Experiment      0.062        0.046         -0.007  [0.00364017, 0.0913832]
Arbeitsblatt                              0.058        0.036         -0.033  [0.00319444, 0.0881288]
Wiki (dynamisch)                          0.056        0.023          0.025  [0.000394075, 0.071668]

properties.ccm:oeh_taxonid_university
--------------------------------------------------------------------
                                                mean_prob  median_prob  baseline_diff              prob_interval
name                                                                                                            
Rechts-, Wirtschafts- und Sozialwissenschaften      0.382        0.294          0.116     [3.85055e-05, 0.77593]
Studienbereich Biologie                             0.235        0.052          0.097    [3.16341e-10, 0.540553]
Mathematik, Naturwissenschaften                     0.130        0.048          0.016    [3.27426e-08, 0.212614]
Musik, Musikwissenschaft                            0.082        0.004         -0.066   [6.95389e-22, 0.0650433]
Kunst, Kunstwissenschaft                            0.075        0.007         -0.019   [1.63543e-07, 0.0759903]
Berufs- und Wirtschaftspädagogik                    0.075        0.000          0.017   [3.79406e-17, 0.0134142]
Ingenieurwissenschaften                             0.053        0.003          0.031   [1.26806e-09, 0.0330567]
Erziehungswissenschaften                            0.046        0.002         -0.000   [2.08993e-08, 0.0305853]
Studienbereich Wirtschaftswissenschaften            0.027        0.000          0.010  [1.02171e-12, 0.00966656]
Geisteswissenschaften                               0.023        0.002          0.015   [8.90152e-08, 0.0166815]

properties.ccm:taxonid
--------------------------------------------------------------------
               mean_prob  median_prob  baseline_diff             prob_interval
name                                                                          
Allgemein          0.170        0.139          0.048    [0.00465706, 0.264466]
Medienbildung      0.063        0.027         -0.003  [1.34531e-05, 0.0889896]
Geschichte         0.061        0.038          0.012   [0.00165531, 0.0902035]
Informatik         0.045        0.022          0.025  [0.000350364, 0.0612668]
Kunst              0.043        0.015          0.017  [9.86683e-05, 0.0535098]
Chemie             0.040        0.015          0.008   [2.53988e-05, 0.050257]
Biologie           0.036        0.018         -0.019    [0.000471913, 0.04963]
Sport              0.024        0.003          0.013  [1.53912e-05, 0.0175441]
Physik             0.019        0.008          0.008  [1.39617e-05, 0.0249099]
Politik            0.019        0.010         -0.007  [0.000189545, 0.0268863]

The individual probabilities of the categories do not add up to 1. This is intended, as assigning a text multiple (or no) relevant categories is often desired. As an example, take the following paragraph taken from the German Wikipedia page on Germany. This is mostly about the history of Germany, but because it also covers relatively recent developments, it may also be relevant to politics.

curl -X 'POST' \
  'http://localhost:8080/predict' \
  -H 'Content-Type: application/json' \
  -d '{
  "text": "Die rasche Entwicklung vom Agrar- zum Industriestaat vollzog sich während der Gründerzeit in der zweiten Hälfte des 19. Jahrhunderts. Nach dem Ersten Weltkrieg wurde 1918 die Monarchie abgeschafft und die demokratische Weimarer Republik konstituiert. Ab 1933 führte die nationalsozialistische Diktatur zu politischer und rassistischer Verfolgung und gipfelte in der Ermordung von sechs Millionen Juden und Angehörigen anderer Minderheiten wie Sinti und Roma. Der vom NS-Staat 1939 begonnene Zweite Weltkrieg endete 1945 mit der Niederlage der Achsenmächte. Das von den Siegermächten besetzte Land wurde 1949 geteilt, nachdem bereits 1945 seine Ostgebiete teils unter polnische, teils sowjetische Verwaltungshoheit gestellt worden waren. Der Gründung der Bundesrepublik als demokratischer westdeutscher Teilstaat mit Westbindung am 23. Mai 1949 folgte die Gründung der sozialistischen DDR am 7. Oktober 1949 als ostdeutscher Teilstaat unter sowjetischer Hegemonie. Die innerdeutsche Grenze war nach dem Berliner Mauerbau (ab 13. August 1961) abgeriegelt. Nach der friedlichen Revolution in der DDR 1989 erfolgte die Lösung der deutschen Frage durch die Wiedervereinigung beider Landesteile am 3. Oktober 1990, womit auch die Außengrenzen Deutschlands als endgültig anerkannt wurden. Durch den Beitritt der fünf ostdeutschen Länder sowie die Wiedervereinigung von Ost- und West-Berlin zur heutigen Bundeshauptstadt zählt die Bundesrepublik Deutschland seit 1990 sechzehn Bundesländer.",
  "num_predictions": "7",
  "num_samples": "10000"
}'
properties.ccm:curriculum
--------------------------------------------------------------------
                                  mean_prob  median_prob  baseline_diff               prob_interval
name                                                                                               
Landeskunde und Interkulturelles      0.041        0.000          0.041  [6.21123e-31, 3.21748e-05]
Mittelalter                           0.038        0.000          0.038            [0, 2.25536e-18]
Industrialisierung Deutschland        0.038        0.000          0.038            [0, 3.08135e-14]
Französisch                           0.035        0.000          0.033  [4.82749e-22, 3.90352e-05]
Politische Bildung                    0.033        0.000          0.033            [0, 2.15741e-07]
Salzbildung                           0.030        0.000          0.030            [0, 2.28999e-17]
Lineare Funktionen                    0.030        0.000          0.029            [0, 4.50846e-19]

properties.ccm:educationalcontext
--------------------------------------------------------------------
                    mean_prob  median_prob  baseline_diff               prob_interval
name                                                                                 
Sekundarstufe II        0.907        0.950          0.060        [0.875189, 0.998498]
Sekundarstufe I         0.891        0.938          0.168        [0.828054, 0.999977]
Primarstufe             0.115        0.051         -0.284     [3.47755e-07, 0.176567]
Hochschule              0.066        0.013         -0.059      [6.3634e-06, 0.075665]
Berufliche Bildung      0.056        0.027         -0.038     [0.00022739, 0.0802722]
Erwachsenenbildung      0.043        0.019         -0.212    [0.000123594, 0.0597053]
Förderschule            0.001        0.000         -0.034  [1.21849e-17, 6.89022e-06]

properties.ccm:educationalintendedenduserrole
--------------------------------------------------------------------
            mean_prob  median_prob  baseline_diff               prob_interval
name                                                                         
Lehrer/in       0.947        0.986          0.507         [0.939395, 0.99997]
Lerner/in       0.924        0.974         -0.048        [0.901521, 0.999996]
Eltern          0.190        0.092         -0.128      [0.00044816, 0.331704]
andere          0.000        0.000          0.000  [3.16326e-26, 6.32949e-12]
Verwaltung      0.000        0.000         -0.002  [4.70935e-27, 2.93218e-11]
Berater/in      0.000        0.000         -0.015   [2.14941e-19, 1.3184e-08]
Autor/in        0.000        0.000         -0.000            [0, 3.04704e-21]

properties.ccm:oeh_lrt
--------------------------------------------------------------------
                                      mean_prob  median_prob  baseline_diff             prob_interval
name                                                                                                 
Material                                  0.227        0.158         -0.087   [0.000817452, 0.387017]
Audio                                     0.181        0.103          0.139    [4.9827e-05, 0.300948]
Erklärvideo und gefilmtes Experiment      0.110        0.046          0.041   [4.11858e-05, 0.172665]
Wiki (dynamisch)                          0.106        0.033          0.076   [6.08679e-05, 0.157179]
Radio, Podcastfolge und Interview         0.099        0.014          0.086    [5.22148e-07, 0.11908]
Webseite                                  0.076        0.024         -0.173   [4.03759e-05, 0.103026]
Unterrichtsbaustein                       0.067        0.017         -0.401  [6.88132e-05, 0.0723821]

properties.ccm:oeh_taxonid_university
--------------------------------------------------------------------
                                                                mean_prob  median_prob  baseline_diff            prob_interval
name                                                                                                                          
Geisteswissenschaften                                               0.996        1.000          0.988            [0.999898, 1]
Studienbereich Geschichte                                           0.792        0.998          0.791            [0.492632, 1]
Kulturwissenschaften i.e.S.                                         0.665        0.974          0.664             [0.02861, 1]
Ethnologie                                                          0.627        0.972          0.625          [0.00125713, 1]
Rechts-, Wirtschafts- und Sozialwissenschaften                      0.308        0.076          0.042   [5.2882e-13, 0.813004]
Allgemeine und vergleichende Literatur- und Sprachwissenschaft      0.284        0.005          0.283  [2.53114e-19, 0.898727]
Studienbereich Psychologie                                          0.260        0.000          0.257            [0, 0.990646]

properties.ccm:taxonid
--------------------------------------------------------------------
                         mean_prob  median_prob  baseline_diff               prob_interval
name                                                                                      
Geschichte                   0.999        1.000          0.950               [0.998776, 1]
Politik                      0.177        0.129          0.152      [0.00627711, 0.283361]
Allgemein                    0.039        0.017         -0.083    [0.000100265, 0.0530203]
Deutsch                      0.006        0.001         -0.009   [3.26931e-08, 0.00546786]
Ethik                        0.005        0.001         -0.005   [4.55594e-07, 0.00262509]
Interkulturelle Bildung      0.003        0.000          0.002  [7.85056e-12, 0.000292363]
Gesundheit                   0.002        0.000          0.002  [1.12038e-22, 2.23171e-09]

Notes / Limitations

RAM Usage

The service requires roughly 4GB of RAM to operate. This usage should be roughly static with time, though queries will momentarily increase the RAM usage – proportionally to the number of samples used (up to a maximum, when batching kicks in).

Cutoffs & Interpretation of Results

Because of the nature of the model, it can be difficult to decide on which predictions shall be counted as actually being predicted to be assigned. This is especially true for categories where a very large or small amount of data points where observed, as the model will essentially replicate these biases in the data. This is why we additionally provide the difference in means to the baseline probabilities (i.e. predictions where the text is empty) – a larger difference, both positive and negative, indicates a stronger prediction, regardless of the underlying base frequencies. However, a lower difference in means to the baseline may also be a very certain prediction that just so happens to be around the baseline, which is why it can also be helpful to consider the probability credibility interval – a narrower interval indicates higher certainty, whereas a wider one indicates lower certainty.

Hierarchical Metadata

While the model can technically predict some hierarchical metadata (i.e. oeh_lrt and curriculum), these hierarchies are currently flattened, such that any information stemming from the hierarchies is discarded. This may be dealt with at a later date.

Model Details and Possible Improvements

The model is based on the example implementation of the ProdLDA model (arXiv:1703.01488) in Pyro, utilizing black-box variational inference. We modified this unsupervised topic model by introducing a linear relationship between the assigned topic mixture and each category of each metadata field to be predicted. Individual categories between different metadata fields are modeled to be independent.

This could be improved in various ways in the future:

  1. More topics and larger neural networks. Due to the large size of the newest training data, it may be beneficial to increase the current choice of 500 topics and shallow neural networks with a hidden layer size of 1000, e.g. doubling both or adding additional hidden layers to the encoder.
  2. Dependencies between categories (within individual metadata fields and between them) could be modeled. This could improve performance, especially when doing classification on partially labeled data (e.g. some categories or whole metadata fields are already given).
  3. The relationship between topic mixture and metadata field categories is drawn from a global, unchanging distribution. Similarly to the variational parameters of the topic mixture, this relationship could instead be drawn from a document-specific distribution through a neural network, thus increasing the expressiveness of the model.
    • Additionally, it may be worthwhile to also try “inverting” the relationship between topics and targets, i.e. draw targets based on document content first, then draw topics based on targets and document content. This could result in more stable prediction results, as the quality of predicted targets are no longer as closely linked to the quality of the predicted topics.
  4. The number of topics to be estimated is currently a fixed hyper-parameter. Using a non-parametric hierarchical Dirichlet process (HDP) model instead would allow for a data-specific choice of the number of topics.
  5. Introduce information about the hierarchies of the categories, where relevant (currently learning resource type and topic). This should result in more specificity in the predictions and better quality in general. Possibly relevant for this: Hierarchically supervised latent Dirichlet allocation.
  6. Rather than using the (naive, if pre-processed) bag-of-words representations of documents, utilize modern vectorization methods instead (see Topic Modeling for Short Texts with Auxiliary Word Embeddings or Probabilistic Topic Modelling with Transformer Representations).
  7. Utilize additional metadata for classification, e.g. the mimetype or source.
  8. Utilize the metadata that is assigned to the collections, such as their school discipline or educational context.
  9. Because predictions of probabilistic models are not just points, but rather a whole probability distribution over the entire probability space, we are currently “throwing away” a lot of potential information. It might be interesting, for example, to visualize the entire distribution over each metadatum’s categories, in order to convey the certainty of predictions (a “broader” distribution shape implies less certainty, a “sharper” distribution more).
  10. We currently use Python’s builtin pickle for saving / exporting model parameters. Loading these files is generally considered to be unsafe, as they could execute arbitrary Python code. An alternative could be safetensors.
  11. Texts given processed through the REST-API do not run through the same pre-processing pipeline as the training data (repeated tokens are not removed, for example). Instead, they are simply tokenized and filtered according to the set of tokens in the training data. Thus, general prediction results may be slightly worse than expected, even for texts that appear in the training data.
  12. The memory usage of training / evaluation is directly proportional to the size of the data

Installation (through Nix Flakes)

Add this repository to your Flake inputs. This may look like this:

{
  inputs = {
    its-jointprobability = {
      url = "github:openeduhub/its-jointprobability";
      # optional
      # can reduce the total size when installing the application, but may
      # cause problems due to breaking changes in some dependencies
      nixpkgs.follows = "nixpkgs"; 
    };
  };
}

Standalone Application

The micro-service is provided both as a nixpkgs overlay and as an output (packages.${system}.its-jointprobability). Thus, it may be included through

{
  outputs =
    {
      self,
      nixpkgs,
      its-jointprobability,
      ...
    }:
    let
      system = "x86_64-linux";
      pkgs = nixpkgs.legacyPackages.${system}.extend
        its-jointprobability.overlays.default;
    in
    { };
}

Python Library

The Python library is also provided as an overlay. Please not that this requires a version of nixpkgs later than 02b8c7ddb7fe956871fa65466bf8a30fa69ec078, from 2024-03-14 (i.e. nixos-24.05 or later, or nixpkgs-unstable / nixos-unstable).

{
  outputs =
    {
      self,
      nixpkgs,
      its-jointprobability,
      ...
    }:
    let
      system = "x86_64-linux";
      pkgs = nixpkgs.legacyPackages.${system}.extend
        its-jointprobability.overlays.python-lib;

      my-python = pkgs.python3.withPackages (
        py-pkgs: with py-pkgs; [
          # some examples
          pandas
          numpy
          # this library
          its-jointprobability
        ]
      );
    in
    { };
}

About

A Bayesian approach to metadata prediction in education

Resources

Stars

Watchers

Forks

Packages

No packages published