HECTOR

Requirements

The code was tested with python==3.7.3.

The required libraries are listed in requirements.txt.

Data

Download datasets from here:
- MAG-CS
- PubMed
- EURLex
Unpack to <DATA_DIR>.
In config/config_*.ini, change Paths.data_dir value to <DATA_DIR>.

Data format

Datasets:

Each dataset contains the following files: train.json, dev.json and test.json.
Each file contains N samples, one sample per line.
Each sample is a dict with the following important keys:
- text: original document.
- title: original title.
- text_processed: normalized title + document (lower case, no stopwords, no punctuation). Used as input to the model.
- label: list of relevant labels, where each label is a string. Model target.
The rest of the keys is legacy from original datasets and can be useful for running other baselines.

Additional files:

Each dataset contains to additional files, required for training ontology.json and taxonomy.txt:

Ontology:

JSONL file, each line describes a single label (key-value mapping):
- label: string with label identifier.
- title: label title in natural language.
- definition: label definition in natural language.
- txt: normalized title + definition.
- level: level of the label in a the label tree.

Taxonomy:

TXT file, each line contains space-separated labels, where first label is a parent and the rest are children.

GloVe embeddings

The model is initialized with GloVe embeddings (840B tokens, 2.2M vocab, cased, 300d vectors). Please download from the official website and put next to the <DATA_DIR> (the exact path must be specified in config["Paths"]["glove_model"]).

Training and Evaluation

Paths, training hyper-parameters and other model configurations depends on a dataset and are specified in corresponding config files: config/config_*.ini.

Training

To run the training, use the following command:

python main.py --config CONFIG --name NAME

CONFIG: path to a config file
NAME: model name prefix

The code evaluates a validation loss after each epoch and save the best model to ./models/ directory.

Note that the model's code was designed to be trained using GPU acceleration and there is no CPU support.

Prediction

The script implements beam search algorithm starting from given prefixes (label refinement task). The prefixes are constructed from labels of level < LEVEL assigned to a test instance.

To perform predictions using a trained model, use the following command:

python predict.py --config CONFIG --model MODEL --level LEVEL --output OUTPUT

CONFIG: path to a config file
MODEL: path to a trained model
LEVEL: for label refinement task: level from which the prediction starts. For example, when LEVEL==2, the model is provided with path prefixes of length 1 and start predicting labels from level 2. For predicting from scratch (without prefixes), set LEVEL to 1.
OUTPUT: path stub for output files.

The script will generate two files: <OUTPUT>-labels.npy and <OUTPUT>-scores.npy with top-1000 predicted labels and their scores, respectively.

Evaluation

The script evaluates model predictions generate by predict.py module. It can also be used for evaluation of other baseline methods which produce output in the same format (AttentionXML, MATCH).

The scripts calculate the following metrics:

Precision@k (k = 1, 3, 5)
NDCG@k (k = 1, 3, 5)

To run the evaluation, use the following command:

python evaluation.py --testset TESTSET --pred PRED --ontology ONTOLOGY --level LEVEL

TESTET: path to test.json file
PRED: path to <OUTPUT>-labels.npy file (see above)
ONTOLOGY: path to ontology.json file
LEVEL: for label refinement task: only consider labels of level >= LEVEL. For all labels, set LEVEL to 1.

References

For the full method description and experimental results please refer to our paper:

Natalia Ostapuk, Julien Audiffren, Ljiljana Dolamic, Alain Mermoud, and Philippe Cudre-Mauroux. 2024. Follow the Path: Hierarchy-Aware Extreme Multi-Label Completion for Semantic Text Tagging. In Proceedings of the ACM Web Conference 2024 (WWW ’24), May 13–17, 2024, Singapore, Singapore.

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
configs		configs
LICENSE		LICENSE
README.md		README.md
data_loader.py		data_loader.py
evaluation.py		evaluation.py
logger.py		logger.py
loss.py		loss.py
main.py		main.py
model.py		model.py
optimizer.py		optimizer.py
predict.py		predict.py
requirements.txt		requirements.txt
train.py		train.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

HECTOR

Requirements

Data

Data format

Datasets:

Additional files:

GloVe embeddings

Training and Evaluation

Training

Prediction

Evaluation

References

About

Releases

Packages

Languages

License

eXascaleInfolab/HECTOR

Folders and files

Latest commit

History

Repository files navigation

HECTOR

Requirements

Data

Data format

Datasets:

Additional files:

GloVe embeddings

Training and Evaluation

Training

Prediction

Evaluation

References

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages