Skip to content

A library to evaluate TREC-like runs with TREC-like qrels. Implements similarity of rankings, ttest between runs etc…

Notifications You must be signed in to change notification settings

Riozhcd/pytrec_eval

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

42 Commits
 
 
 
 
 
 

Repository files navigation

pytrec_eval

A (tiny) library to evaluate TREC-like runs by using TREC-like qrels. Implements Kendall's tau similarity of rankings, t-test between runs etcetera. Moreover, it is capable to output metrics for evaluating classification and clustering algorithms. For the moment the implemented metrics are the following (partitioned by task):

  • Document Retrival: Average Precision (AP), Normalized Discounted Cumulative Gain (NDCG), Precision, Recall, Precision@k.
  • Classification: precision, recall, precision (multi-topic), recall (multi-topic), accuracy (multi-topic), exact match ratio, retrieval f-score.
  • Clustering: purity, nmi, randin index, f-score.

Loading Data

  • Use the class TrecRun to load a TREC-run from a file

run = pytrec_eval.TrecRun(<fileName>)

  • you can also use loadAll to load all TREC-runs contained in a list of file names.

runs = pytrec_eval.loadAll(['run1', 'run2', 'run3'])

  • use the class QRels to load the qrels from a file

qrels = pytrec_eval.QRels(<fileName>)

Evaluation

  • Use evaluate (resp., evaluateAll) to evaluate a run (resp., list of runs)

pytrec_eval.evaluate(run, qrels, metrics)

where metrics is a list of functions computing some metrics, for example, [pytrec_eval.avgPrec, pytrec_eval.ndcg].

  • It is possible to compute the ranking of a list of runs by using pytrec_eval.rankRuns as follows:

ranking = pytrec_eval(<list of TrecRuns>, qrels, measure)

where measure is a function that computes some metrics. A ranking is a list of pairs (TrecRun, score) ordered by decreasing score.

pytrec_eval can also creates plots:

  • The function plotDifferenceFromAvg plots an histogram highlighting the performance of a run for each topic. It is possible to save the plot into a file by using the optional parameter outputFile.

pytrec_eval.plotDifferenceFromAvg(trecRun, qrels, pytrec_eval.ndcg, outputFile='./ndcg.pdf', showPlot=True)

  • The function plotRecallPrecision plots the recall/precision curve of a given run. It is also possible to have multiple recall/precision curves, one for each topic, by setting perQuery = True.

pytrec_eval.plotRecallPrecision(trecRun, qrels, perQuery=True, outputFile='./recall-precision.pdf', showPlot=False)

  • The function plotRecallPrecisionAll plots the recall/precision curves of all runs contained in the input list of runs.

pytrec_eval.plotRecallPrecisionAll([run0, run1, run2], qrels, outputFile='./recall-precision-all.pdf', showPlot=False)

Statistical Tools

  • pytrec_eval features a function that computes the Student's t-test between a run and a list of other runs, for example,

pValues = pytrec_eval.ttest(run0, [run1, run2], qrels, pytrec_eval.ndcg)

returns a dictionary mapping the name of run1 to the p-value obtained by comparing the NDCG score of run1 to the NDCG score of run0, and the name of run2 to the p-value obtained by comparing the NDCG score of run2 to the NDCG score of run0.

  • Given two rankings it is possible to compute the Kendall's tau correlation between them as follows:

tau = pytrec_eval.rankSimilarity(ranking0, ranking1)

where ranking0 and ranking1 are lists of pairs (TrecRun, score) ordered by decreasing score.

For more functions/details, please check the documentation strings in the Python source code.

Requirements

About

A library to evaluate TREC-like runs with TREC-like qrels. Implements similarity of rankings, ttest between runs etc…

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 100.0%