`doc-classify` data

This repository contains the training data used for training the doc-classify document classifier. The document classifier is used to prefilter the million+ documents obtained through an extremely broad, high-recall approach to a more manageable subset of suspected linguistic documents, which can then be used as input to the igt-detect IGT instance detector.

The serial application of these two classifiers is crucial in extending the coverage of IGT instances for the ODIN database.

What's Contained Here

This repository consists of two files:

`doc_labels.txt`

This document contains a list of doc_id 1|0 pairs.

1 corresponds to a linguistic document
0 corresponds to a non-linguistic document.

`doc_urls.txt`

This document contains a list of doc_id {URL} pairs, where the URL is the link to the original PDF.

Only links are provided, as the copyrights for these documents are retained by the documents' copyright holders.

How to Use This Data For Replication

The process we followed for our experiments was:

Download all the PDFs contained in the URL list.
Run PDFlib TET to convert PDF data to TETML.
(pdfminer is also supported by our code, but we used TET in our experiments)
Use freki to convert the TETML/pdfminer output to the freki output format.
Train the doc-classify classifier using the training labels provided here.
Apply the trained model to remaining documents.

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
LICENSE		LICENSE
README.md		README.md
doc_labels.txt		doc_labels.txt
doc_urls.txt.gz		doc_urls.txt.gz
languages.txt		languages.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

`doc-classify` data

What's Contained Here

`doc_labels.txt`

`doc_urls.txt`

How to Use This Data For Replication

About

Uh oh!

Releases

Packages

License

xigt/doc-classify-data

Folders and files

Latest commit

History

Repository files navigation

doc-classify data

What's Contained Here

doc_labels.txt

doc_urls.txt

How to Use This Data For Replication

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

`doc-classify` data

`doc_labels.txt`

`doc_urls.txt`

Packages