This repository contains the training data used for training the doc-classify
document classifier. The document classifier is used to prefilter the million+ documents obtained through an extremely broad, high-recall approach to a more manageable subset of suspected linguistic documents, which can then be used as input to the igt-detect
IGT instance detector.
The serial application of these two classifiers is crucial in extending the coverage of IGT instances for the ODIN database.
This repository consists of two files:
This document contains a list of doc_id 1|0
pairs.
1
corresponds to a linguistic document0
corresponds to a non-linguistic document.
This document contains a list of doc_id {URL}
pairs, where the URL is the link to the original PDF.
Only links are provided, as the copyrights for these documents are retained by the documents' copyright holders.
The process we followed for our experiments was:
- Download all the PDFs contained in the URL list.
- Run PDFlib TET to convert PDF data to TETML.
(pdfminer is also supported by our code, but we used TET in our experiments) - Use
freki
to convert the TETML/pdfminer output to thefreki
output format. - Train the
doc-classify
classifier using the training labels provided here. - Apply the trained model to remaining documents.