The code was tested with python==3.7.3
.
The required libraries are listed in requirements.txt.
- Download datasets from here:
- Unpack to
<DATA_DIR>
. - In
config/config_*.ini
, changePaths.data_dir
value to<DATA_DIR>
.
-
Each dataset contains the following files:
train.json
,dev.json
andtest.json
. -
Each file contains N samples, one sample per line.
-
Each sample is a dict with the following important keys:
- text: original document.
- title: original title.
- text_processed: normalized title + document (lower case, no stopwords, no punctuation). Used as input to the model.
- label: list of relevant labels, where each label is a string. Model target.
-
The rest of the keys is legacy from original datasets and can be useful for running other baselines.
Each dataset contains to additional files, required for training ontology.json
and taxonomy.txt
:
Ontology:
JSONL
file, each line describes a single label (key-value mapping):- label: string with label identifier.
- title: label title in natural language.
- definition: label definition in natural language.
- txt: normalized title + definition.
- level: level of the label in a the label tree.
Taxonomy:
TXT
file, each line contains space-separated labels, where first label is a parent and the rest are children.
The model is initialized with GloVe embeddings (840B tokens, 2.2M vocab, cased, 300d vectors).
Please download from the official website and put next to the <DATA_DIR>
(the exact path must be specified in config["Paths"]["glove_model"]
).
Paths, training hyper-parameters and other model configurations depends on a dataset and are specified in corresponding config files: config/config_*.ini
.
To run the training, use the following command:
python main.py --config CONFIG --name NAME
CONFIG
: path to a config fileNAME
: model name prefix
The code evaluates a validation loss after each epoch and save the best model to ./models/
directory.
Note that the model's code was designed to be trained using GPU acceleration and there is no CPU support.
The script implements beam search algorithm starting from given prefixes (label refinement task).
The prefixes are constructed from labels of level < LEVEL
assigned to a test instance.
To perform predictions using a trained model, use the following command:
python predict.py --config CONFIG --model MODEL --level LEVEL --output OUTPUT
CONFIG
: path to a config fileMODEL
: path to a trained modelLEVEL
: for label refinement task: level from which the prediction starts. For example, whenLEVEL==2
, the model is provided with path prefixes of length 1 and start predicting labels from level 2. For predicting from scratch (without prefixes), setLEVEL
to 1.OUTPUT
: path stub for output files.
The script will generate two files: <OUTPUT>-labels.npy
and <OUTPUT>-scores.npy
with top-1000 predicted labels and their scores, respectively.
The script evaluates model predictions generate by predict.py
module. It can also be used for evaluation of other baseline methods which produce output in the same format
(AttentionXML, MATCH).
The scripts calculate the following metrics:
Precision@k
(k
= 1, 3, 5)NDCG@k
(k
= 1, 3, 5)
To run the evaluation, use the following command:
python evaluation.py --testset TESTSET --pred PRED --ontology ONTOLOGY --level LEVEL
TESTET
: path totest.json
filePRED
: path to<OUTPUT>-labels.npy
file (see above)ONTOLOGY
: path toontology.json
fileLEVEL
: for label refinement task: only consider labels of level >=LEVEL
. For all labels, setLEVEL
to 1.
For the full method description and experimental results please refer to our paper:
Natalia Ostapuk, Julien Audiffren, Ljiljana Dolamic, Alain Mermoud, and Philippe Cudre-Mauroux. 2024. Follow the Path: Hierarchy-Aware Extreme Multi-Label Completion for Semantic Text Tagging. In Proceedings of the ACM Web Conference 2024 (WWW ’24), May 13–17, 2024, Singapore, Singapore.