Skip to content

Latest commit

 

History

History
28 lines (22 loc) · 716 Bytes

README.md

File metadata and controls

28 lines (22 loc) · 716 Bytes

Corpus description

Selected fair-use licenced pdf documents generated with a variety of tools.

PDF documents picked from the 0000 container of the CC-MAIN-2021-31 project by Common Crawl.

Terms of use for commoncrawl

https://commoncrawl.org/terms-of-use

Download the corpus

https://downloads.digitalcorpora.org/corpora/files/CC-MAIN-2021-31-PDF-UNTRUNCATED/zipfiles/0000-0999/

Currently in the corpus:

  • pdftex
  • abbyy fine reader
  • illustrator pdf library 15
  • illustrator pdf library 7
  • preview macos quartz pdfcontent
  • canva
  • chromium
  • pages macos quartz pdfcontent
  • word macos quartz pdfcontent
  • word 2016 standalone
  • word adobepdf maker
  • indesign adobe pdf library 15
  • pdftex