Skip to content

Latest commit

 

History

History
28 lines (22 loc) · 1.74 KB

dumpindex-reference.md

File metadata and controls

28 lines (22 loc) · 1.74 KB

The tables below show the numbers we get using DumpIndex

no stopwords (default)

Collection # of docs # of docs with content total terms unique terms
Disk12 741,676 741,675 219,327,909 950,715
AQUAINT 1,031,455 1,031,326 317,703,234 966,882
Disk45 528,030 528,030 174,540,587 923,435
WT2G 245,715 245,679 181,774,134 1,653,144
WT10G 1,688,402 1,688,290 752,326,031 7,532,682
Gov2 25,172,934 25,170,664 17,343,119,816 64,672,382
ClueWeb09b 50,220,189 50,220,159 31,270,685,466 127,464,531
ClueWeb12-B13 52,249,039 52,238,521 30,617,038,149 201,838,374
ClueWeb12 731,705,088 731,556,725 428,628,865,985 1,364,074,229

keep stopwords (with option -keepstopwords)

Collection # of docs # of docs with content total terms unique terms
Disk12 741,676 741,675 307,973,285 950,716
AQUAINT 1,031,455 1,031,326 444,541,585 966,883
Disk45 528,030 528,030 251,357,057 923,437
WT2G 245,715 245,679 249,819,453 1,653,144
WT10G 1,688,402 1,688,290 988,159,521 7,532,682
Gov2 25,172,934 25,170,664 21,831,927,015 64,672,382