-
Notifications
You must be signed in to change notification settings - Fork 8
/
README.Rmd
165 lines (120 loc) · 9.36 KB
/
README.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
---
output: github_document
---
```{r setup, include = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>"
)
if(!requireNamespace('rtika')){
install.packages('rtika',
repos = 'https://cloud.r-project.org')
}
library('rtika')
if(is.na(tika_jar())){
install_tika()
}
```
# rtika
***Extract text or metadata from over a thousand file types.***
[![R-CMD-check](https://github.com/ropensci/rtika/workflows/R-CMD-check/badge.svg)](https://github.com/ropensci/rtika/actions/)
[![ROpenSci](https://badges.ropensci.org/191_status.svg)](https://github.com/ropensci/software-review/issues/191/)
[![Coverage status](https://codecov.io/gh/ropensci/rtika/branch/master/graph/badge.svg)](https://app.codecov.io/github/ropensci/rtika?branch=master)
[![Cranlogs Downloads](https://cranlogs.r-pkg.org/badges/rtika)](https://CRAN.R-project.org/package=rtika)
>Apache Tika is a content detection and analysis framework, written in Java, stewarded at the Apache Software Foundation. It detects and extracts metadata and text from over a thousand different file types, and as well as providing a Java library, has server and command-line editions suitable for use from other programming languages ...
>For most of the more common and popular formats, Tika then provides content extraction, metadata extraction and language identification capabilities. (From <https://en.wikipedia.org/wiki/Apache_Tika>, accessed Jan 18, 2018)
This is an R interface to the Tika software.
## Installation
To start, you need R and `Java 8` or `OpenJDK 1.8`. Higher versions work. To check your version, run the command `java -version` from a terminal.
Get Java installation tips at <https://www.java.com/en/download/> or <https://openjdk.org/install/>.
Because the `rJava` package is ***not*** required, installation is simple. You can cut and paste the following snippet:
```{r, eval=FALSE}
install.packages('rtika', repos = 'https://cloud.r-project.org')
library('rtika')
# You need to install the Apache Tika .jar once.
install_tika()
```
Read an [introductory article](https://docs.ropensci.org/rtika/articles/rtika_introduction.html) at https://docs.ropensci.org/rtika/articles/rtika_introduction.html.
## Key Features
* `tika_text()` to extract plain text.
* `tika_xml()` and `tika_html()` to get a structured XHMTL rendition.
* `tika_json()` to get metadata as `.json`, with XHMTL content.
* `tika_json_text()` to get metadata as `.json`, with plain text content.
* `tika()` is the main function the others above inherit from.
* `tika_fetch()` to download files with a file extension matching the Content-Type.
## Supported File Types
Tika parses and extracts text or metadata from over one thousand digital formats, including:
* Portable Document Format (`.pdf`)
* Microsoft Office document formats (Word, PowerPoint, Excel, etc.)
* Rich Text Format (`.rtf`)
* Electronic Publication Format (`.epub`)
* Image formats (`.jpeg`, `.png`, etc.)
* Mail formats (`.mbox`, Outlook)
* HyperText Markup Language (`.html`)
* XML and derived formats (`.xml`, etc.)
* Compression and packaging formats (`.gzip`, `.rar`, etc.)
* OpenDocument Format
* iWorks document formats
* WordPerfect document formats
* Text formats
* Feed and Syndication formats
* Help formats
* Audio formats
* Video formats
* Java class files and archives
* Source code
* CAD formats
* Font formats
* Scientific formats
* Executable programs and libraries
* Crypto formats
For a list of MIME types, look for the "Supported Formats" page here: https://tika.apache.org/
## Get Plain Text
**The `rtika` package processes batches of documents efficiently**, so I recommend batches.
Currently, the `tika()` parsers take a tiny bit of time to spin up, and that will get annoying with hundreds of separate calls to the functions.
```{r}
# Test files
batch <- c(
system.file("extdata", "jsonlite.pdf", package = "rtika"),
system.file("extdata", "curl.pdf", package = "rtika"),
system.file("extdata", "table.docx", package = "rtika"),
system.file("extdata", "xml2.pdf", package = "rtika"),
system.file("extdata", "R-FAQ.html", package = "rtika"),
system.file("extdata", "calculator.jpg", package = "rtika"),
system.file("extdata", "tika.apache.org.zip", package = "rtika")
)
# batches are best, and can also be piped with magrittr.
text <- tika_text(batch)
# text has one string for each document:
length(text)
# A snippet:
cat(substr(text[1], 54, 190))
```
To learn more and find out how to extract structured text and metadata, read the vignette: https://docs.ropensci.org/rtika/articles/rtika_introduction.html.
## Enhancements
Tika also can interact with the Tesseract OCR program on some Linux variants, to extract plain text from images of text. If `tesseract-ocr` is installed, Tika should automatically locate and use it for images and PDFs that contain images of text. However, this does not seem to work on OS X or Windows. To try on Linux, first follow the [Tesseract installation instructions](https://github.com/tesseract-ocr/tesseract/wiki). The next time Tika is run, it should work. For a different approach, I suggest [`tesseract`](https://github.com/ropensci/tesseract) package by @jeroen, which is a specialized R interface.
The Apache Tika community welcomes your feedback. Issues regarding the R interface should be raised at the [`rTika` Github Issue Tracker](https://github.com/ropensci/tesseract). If you are confident the issue concerns Tika or one of its underlying parsers, use the [Tika Bugtracking System](https://issues.apache.org/jira/projects/TIKA/issues).
## Using the Tika App Directly
If your project or package needs to use the Tika App `.jar`, you can include `rTika` as a dependency and call the `rtika::tika_jar()` function to get the path to the Tika app installed on the system.
## Similar R Packages
The are a number of specialized parsers that overlap in functionality. For example, the [`pdftools`](https://github.com/ropensci/pdftools) package extracts metadata and text from PDF files, the [`antiword`](https://github.com/ropensci/antiword) package extracts text from recent versions of Word, and the [`epubr`](https://github.com/ropensci/epubr) package by @leonawicz processes `epub` files. These packages do not depend on Java, while `rTika` does.
The big difference between Tika and a specialized parser is that Tika integrates dozens of specialist libraries maintained by the Apache Foundation. Apache Tika processes over a thousand file types and multiple versions of each. This eases the processing of digital archives that contain unpredictable files. For example, researchers use Tika to process archives from court cases, governments, or the Internet Archive that span multiple years. These archives frequently contain diverse formats and multiple versions of each format. Because Tika finds the matching parser for each individual file, is well suited to diverse sets of documents. In general, the parsing quality is good and consistently so. In contrast, specialized parsers may only work with a particular version of a file, or require extra tinkering.
On the other hand, a specialized library can offer more control and features when it comes to structured data and formatting. For example, the [`tabulizer`](https://github.com/ropensci/tabulizer) package by @leeper and @tpaskhalis includes bindings to the 'Tabula PDF Table Extractor Library'. Because PDF files store tables as a series of positions with no obvious boundaries between data cells, extracting a `data.frame` or `matrix` requires heuristics and customization which that package provides. To be fair to Tika, there are some formats where `rtika` will extract data as table-like XML. For example, with Word and Excel documents, Tika extracts simple tables as XHTML data that can be turned into a tabular `data.frame` using the `rvest::html_table()` function.
## History
In September 2017, github.com user *kyusque* released `tikaR`, which uses the `rJava` package to interact with Tika (See: <https://github.com/kyusque/tikaR>).
As of writing, it provided similar text and metadata extraction, but only `xml` output.
Back in March 2012, I started a similar project to interface with Apache Tika.
My code also used low-level functions from the `rJava` package.
I halted development after discovering that the Tika command line interface (CLI) was easier to use.
My empty repository is at <https://r-forge.r-project.org/projects/r-tika/>.
I chose to finally develop this package after getting excited by Tika's new 'batch processor' module, written in Java.
The batch processor has very good efficiency when processing tens of thousands of documents.
Further, it is not too slow for a single document either, and handles errors gracefully.
Connecting `R` to the Tika batch processor turned out to be relatively simple, because the `R` code is simple.
It uses the CLI to point Tika to the files. Simplicity, along with continuous testing, should ease integration.
I anticipate that some researchers will need plain text output, while others will want `json` output.
Some will want multiple processing threads to speed things up.
These features are now implemented in `rtika`, although apparently not in `tikaR` yet.
## Code of Conduct
Please note that this project is released with a [Contributor Code of Conduct](https://github.com/ropensci/rtika/blob/master/CONDUCT.md). By participating in this project you agree to abide by its terms.
[![ropensci\_footer](http://ropensci.org/public_images/github_footer.png)](https://ropensci.org)