Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable custom Tika Parser #498

Closed
wants to merge 12 commits into from
179 changes: 178 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -622,7 +622,8 @@ Here is a list of Local FS settings (under `fs.` prefix)`:
| `fs.continue_on_error` | `false` | [Continue on File Permission Error](#continue-on-error) (from 2.3) |
| `fs.pdf_ocr` | `true` | [Run OCR on PDF documents](#ocr-integration) (from 2.3) |
| `fs.indexed_chars` | `100000.0` | [Extracted characters](#extracted-characters) |
| `fs.checksum` | `null` | [File Checksum](#file-checksum) |
| `fs.checksum` | `null` | [File Checksum](#file-checksum)
| `fs.custom_tika_parsers` | `null` | [Custom Tika Parsers](#custom-tika-parsers) |

#### Root directory

Expand Down Expand Up @@ -1198,6 +1199,182 @@ to compute the checksum, such as `MD5` or `SHA-1`.
}
```

#### Custom Tika Parsers

It might occur that one or more existing Tika parsers do not provide the intended information, or just do not exist.
This setting allows to use a custom parser instead.
The parsers must be provided as a .jar, but does not need to be on any classpath.
Note that this is an array. Here an example for just one:

```json
{
"name": "test",
"fs": {
"custom_tika_parsers": [
{
"class_name": "org.me.MyParser",
"path_to_jar": "/some/full/path/to/myParser-0.0.1-SNAPSHOT.jar",
"mime_types": ["application/dns", "or-another-mimetype-from-tika"]
}
]
}
}
```

Some info about creating a custom parser is available [here](https://tika.apache.org/1.17/parser_guide.html)
Or use a existing parser as a blueprint. Make sure to choose the correct branch.
At the time of this writing fscrawler uses Tika 1.17, while on github the main Tika branch is 2.x.
The parsers from ["branch_1x"](https://github.com/apache/tika/tree/branch_1x/tika-parsers/src/main/java/org/apache/tika/parser) should work fine.

To build the custom parser separately, a pom file can be derived from the tika-parsers [pom.xml](https://github.com/apache/tika/blob/branch_1x/tika-parsers/pom.xml).
Probably a lot can be left out. Here is an example which requires fontbox.
(The exclusions are copied 1:1 from fscrawler's pom.xml, to be on the safe side)

<details><summary>Example pom.xml</summary>
<p>


```
<project xmlns="http://maven.apache.org/POM/4.0.0">
<modelVersion>4.0.0</modelVersion>

<groupId>org.me</groupId>
<artifactId>myParser</artifactId>
<version>0.0.1-SNAPSHOT</version>

<properties>
<maven.compiler.source>1.8</maven.compiler.source>
<maven.compiler.target>1.8</maven.compiler.target>
<project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
<tika.version>1.17</tika.version>
<fontbox.version>2.0.8</fontbox.version>
</properties>

<build>
<sourceDirectory>src</sourceDirectory>
<!--<testSourceDirectory>test</testSourceDirectory>-->
<plugins>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-javadoc-plugin</artifactId>
<version>3.0.0-M1</version>
<configuration>
<doclint>all,-missing,-accessibility</doclint>
<quiet>true</quiet>
</configuration>
</plugin>
</plugins>
</build>

<dependencies>
<dependency>
<groupId>org.apache.tika</groupId>
<artifactId>tika-parsers</artifactId>
<version>${tika.version}</version>
<exclusions>
<!-- Not Apache2 License compatible -->
<exclusion>
<groupId>edu.ucar</groupId>
<artifactId>netcdf</artifactId>
</exclusion>
<!-- Not Apache2 License compatible -->
<exclusion>
<groupId>edu.ucar</groupId>
<artifactId>cdm</artifactId>
</exclusion>
<!-- Not Apache2 License compatible -->
<exclusion>
<groupId>edu.ucar</groupId>
<artifactId>httpservices</artifactId>
</exclusion>
<!-- Not Apache2 License compatible -->
<exclusion>
<groupId>edu.ucar</groupId>
<artifactId>grib</artifactId>
</exclusion>
<!-- Not Apache2 License compatible -->
<exclusion>
<groupId>edu.ucar</groupId>
<artifactId>netcdf4</artifactId>
</exclusion>
<!-- Not Apache2 License compatible -->
<exclusion>
<groupId>com.uwyn</groupId>
<artifactId>jhighlight</artifactId>
</exclusion>
<!-- ES core already has these -->
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ES Core -> FSCrawler Core ?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I copied the excludes from fscrawler's root pom.xml. The line is still there at the time of this writing. Can it be removed, perhaps?

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's true. Just remove it here but I will also remove this in the future from the main pom.xml indeed.

<exclusion>
<groupId>org.ow2.asm</groupId>
<artifactId>asm-debug-all</artifactId>
</exclusion>
<exclusion>
<groupId>commons-logging</groupId>
<artifactId>commons-logging-api</artifactId>
</exclusion>
<!-- Must be removed because it conflicts with Jersey (another JaxRS implementation) -->
<exclusion>
<groupId>org.apache.cxf</groupId>
<artifactId>cxf-rt-rs-client</artifactId>
</exclusion>
</exclusions>
</dependency>
<dependency>
<groupId>org.apache.pdfbox</groupId>
<artifactId>fontbox</artifactId>
<version>${fontbox.version}</version>
<exclusions>
<!-- Not Apache2 License compatible -->
<exclusion>
<groupId>edu.ucar</groupId>
<artifactId>netcdf</artifactId>
</exclusion>
<!-- Not Apache2 License compatible -->
<exclusion>
<groupId>edu.ucar</groupId>
<artifactId>cdm</artifactId>
</exclusion>
<!-- Not Apache2 License compatible -->
<exclusion>
<groupId>edu.ucar</groupId>
<artifactId>httpservices</artifactId>
</exclusion>
<!-- Not Apache2 License compatible -->
<exclusion>
<groupId>edu.ucar</groupId>
<artifactId>grib</artifactId>
</exclusion>
<!-- Not Apache2 License compatible -->
<exclusion>
<groupId>edu.ucar</groupId>
<artifactId>netcdf4</artifactId>
</exclusion>
<!-- Not Apache2 License compatible -->
<exclusion>
<groupId>com.uwyn</groupId>
<artifactId>jhighlight</artifactId>
</exclusion>
<!-- ES core already has these -->
<exclusion>
<groupId>org.ow2.asm</groupId>
<artifactId>asm-debug-all</artifactId>
</exclusion>
<exclusion>
<groupId>commons-logging</groupId>
<artifactId>commons-logging-api</artifactId>
</exclusion>
<!-- Must be removed because it conflicts with Jersey (another JaxRS implementation) -->
<exclusion>
<groupId>org.apache.cxf</groupId>
<artifactId>cxf-rt-rs-client</artifactId>
</exclusion>
</exclusions>
</dependency>
</dependencies>
</project>
```

</p>
</details>

### SSH settings

Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,90 @@
package fr.pilato.elasticsearch.crawler.fs.settings;

import java.util.ArrayList;
import java.util.List;

public class CustomTikaParser {

private String className = "";
private String pathToJar = "";
private ArrayList<String> mimeTypes = new ArrayList<String>();

public static Builder builder() {
return new Builder();
}

public static class Builder {

private String className = "";
private String pathToJar = "";
private ArrayList<String> mimeTypes = new ArrayList<String>();

public Builder setClassName(String className) {
this.className = className;
return this;
}

public Builder setPathToJar(String pathToJar) {
this.pathToJar = pathToJar;
return this;
}

public Builder setMimeTypes(ArrayList<String> mimeTypes) {
this.mimeTypes = mimeTypes;
return this;
}

public CustomTikaParser build() {
return new CustomTikaParser(className, pathToJar, mimeTypes);
}
}

public CustomTikaParser() {

}

private CustomTikaParser(String className, String pathToJar, ArrayList<String> mimeTypes) {

this.className = className;
this.pathToJar = pathToJar;
this.mimeTypes = mimeTypes;
}

public String getClassName() {
return className;
}

public void setClassName(String className) {
this.className = className;
}

public String getPathToJar() {
return pathToJar;
}

public void setPathToJar(String pathToJar) {
this.pathToJar = pathToJar;
}

public List<String> getMimeTypes() {
return mimeTypes;
}

public void setMimeTypes(ArrayList<String> mimeTypes) {
this.mimeTypes = mimeTypes;
}

@Override
public boolean equals(Object o) {
if (this == o) return true;
if (o == null || getClass() != o.getClass()) return false;

CustomTikaParser ctp = (CustomTikaParser) o;

if (className != null ? !className.equals(ctp.className) : ctp.className != null) return false;
if (pathToJar != null ? !pathToJar.equals(ctp.pathToJar) : ctp.pathToJar != null) return false;
return mimeTypes != null ? mimeTypes.equals(ctp.mimeTypes) : ctp.mimeTypes == null;

}

}
Original file line number Diff line number Diff line change
Expand Up @@ -49,6 +49,7 @@ public class Fs {
private boolean continueOnError = false;
private boolean pdfOcr = true;
private Ocr ocr = new Ocr();
private List<CustomTikaParser> customTikaParsers = new ArrayList<>();

public static Builder builder() {
return new Builder();
Expand Down Expand Up @@ -80,6 +81,7 @@ public static class Builder {
private boolean continueOnError = false;
private boolean pdfOcr = true;
private Ocr ocr = new Ocr();
private List<CustomTikaParser> customTikaParsers = new ArrayList<>();

public Builder setUrl(String url) {
this.url = url;
Expand Down Expand Up @@ -212,10 +214,15 @@ public Builder setOcr(Ocr ocr) {
return this;
}

public Builder setTikaCustomParsers(List<CustomTikaParser> customTikaParsers) {
this.customTikaParsers = customTikaParsers;
return this;
}

public Fs build() {
return new Fs(url, updateRate, includes, excludes, jsonSupport, filenameAsId, addFilesize,
removeDeleted, addAsInnerObject, storeSource, indexedChars, indexContent, attributesSupport, rawMetadata,
checksum, xmlSupport, indexFolders, langDetect, continueOnError, pdfOcr, ocr);
checksum, xmlSupport, indexFolders, langDetect, continueOnError, pdfOcr, ocr, customTikaParsers);
}
}

Expand All @@ -226,7 +233,7 @@ public Fs( ) {
private Fs(String url, TimeValue updateRate, List<String> includes, List<String> excludes, boolean jsonSupport,
boolean filenameAsId, boolean addFilesize, boolean removeDeleted, boolean addAsInnerObject, boolean storeSource,
Percentage indexedChars, boolean indexContent, boolean attributesSupport, boolean rawMetadata, String checksum, boolean xmlSupport,
boolean indexFolders, boolean langDetect, boolean continueOnError, boolean pdfOcr, Ocr ocr) {
boolean indexFolders, boolean langDetect, boolean continueOnError, boolean pdfOcr, Ocr ocr, List<CustomTikaParser> customTikaParsers) {
this.url = url;
this.updateRate = updateRate;
this.includes = includes;
Expand All @@ -248,6 +255,7 @@ private Fs(String url, TimeValue updateRate, List<String> includes, List<String>
this.continueOnError = continueOnError;
this.pdfOcr = pdfOcr;
this.ocr = ocr;
this.customTikaParsers = customTikaParsers;
}

public String getUrl() {
Expand Down Expand Up @@ -418,6 +426,14 @@ public void setOcr(Ocr ocr) {
this.ocr = ocr;
}

public List<CustomTikaParser> getCustomTikaParsers() {
return customTikaParsers;
}

public void setCustomTikaParsers(List<CustomTikaParser> customTikaParsers) {
this.customTikaParsers = customTikaParsers;
}

@Override
public boolean equals(Object o) {
if (this == o) return true;
Expand All @@ -444,6 +460,7 @@ public boolean equals(Object o) {
if (includes != null ? !includes.equals(fs.includes) : fs.includes != null) return false;
if (excludes != null ? !excludes.equals(fs.excludes) : fs.excludes != null) return false;
if (indexedChars != null ? !indexedChars.equals(fs.indexedChars) : fs.indexedChars != null) return false;
if (customTikaParsers != null ? !customTikaParsers.equals(fs.customTikaParsers) : fs.customTikaParsers != null) return false;
return checksum != null ? checksum.equals(fs.checksum) : fs.checksum == null;

}
Expand All @@ -470,6 +487,7 @@ public int hashCode() {
result = 31 * result + (langDetect ? 1 : 0);
result = 31 * result + (continueOnError ? 1 : 0);
result = 31 * result + (pdfOcr ? 1 : 0);
result = 31 * result + (customTikaParsers != null ? customTikaParsers.hashCode() : 0);
return result;
}
}
Loading