This project repackages the cruftstripper Ruby script from the Data Science Toolkit into a JRuby Jar file, which can then be used on servers which only have a JVM installed.
The Warbler gem is used to compile the Sentences Jar file:
gem install warbler
Yeah, yeah, I probably need a Gemfile so I can use bundle to keep track of things, and a Rakefile to call Warble.... I should make this project follow standard Ruby conventions.... though... really, I should figure out how to get Maven to build this... bah, just get it done.
warble compiled runnable jar sentences.rb
A simple Bash script, juice.sh is provided to wrap the Sentences Jar and to also use the GNU Strings command to extract sensible text information from any binary file.
Here's a DSpace media filter plugin that uses the juice command to create a full-text index for any binary file.
Pull requests accepted, this is a work in progress, albeit a pretty useful one.