Skip to content

Latest commit

 

History

History
112 lines (83 loc) · 5.44 KB

README.md

File metadata and controls

112 lines (83 loc) · 5.44 KB

indexing document text and metadata for analysis

using a bunch of open source tools to examine a whole bunch of documents

  • Apache Tika (document text and metadata processing)
  • Solr (text indexing, search, and faceting)
  • ICIJ Extract (for processing documents into e.g. a Solr index)
  • Blacklight (Ruby on Rails engine for exploring data stored in a Solr index)

setup the processing tools

additional solrconfig.xml

  <lib dir="${solr.install.dir:../../..}/contrib/extraction/lib" regex=".*\.jar" />
  <lib dir="${solr.install.dir:../../..}/dist/" regex="solr-cell-\d.*\.jar" />

  <requestHandler name="/update/extract"
                startup="lazy"
                class="solr.extraction.ExtractingRequestHandler" >
    <lst name="defaults">
      <str name="lowernames">true</str>
      <str name="fmap.content">_text_</str>
    </lst>
  </requestHandler>

build the jar for ICIJ's extract tool

# to successfully build extract.jar, i got errors on the javadoc task and the gpg task, and signing is only necessary for publishing the build, not running it locally
$ mvn install -Dmaven.test.skip=true -Dmaven.javadoc.skip=true -Dgpg.skip=true

index some documents

test on one PDF using curl

$ `curl 'http://localhost:8983/solr/document_core/update/extract?literal.id=helloworld&commit=true' -F "[email protected]"`
{
  "responseHeader":{
    "status":0,
    "QTime":2179}}

in bulk using extract

run something like:

# from the extract project directory after building
$ java -jar extract-cli/target/extract-cli-3.8.1.jar spew --ocr no -o solr -s 'http://localhost:8983/solr/document_core' --commitInterval 500  'my_cool_pdf_directory'

TODO - try calling tika as a CLI process from a script, no extract

extract might be overkill for me. try calling tika on the PDFs (since those are the documents you have at the moment) and then indexing fields of my choice from parsing, using python or ruby. i had trouble figuring out how to get tika to index the fields i wanted.

both pysolr and rsolr seem fine, tika bindings for python seem better than those for ruby, so prob easier to write indexing code in python than ruby? would be easy to do simple text extraction using tika and create a document based on that. could generate same document IDs as ICIJ extract (i think just a hash of the document content?).

https://github.com/ICIJ/extract/blob/master/extract-lib/src/main/java/org/icij/extract/document/DocumentFactory.java#L44

https://solr.apache.org/guide/8_8/overview-of-documents-fields-and-schema-design.html https://github.com/projectblacklight/blacklight/wiki/Blacklight-configuration

https://github.com/chrismattmann/tika-python https://github.com/django-haystack/pysolr/ https://github.com/mrcsparker/ruby_tika_app https://github.com/duke-libraries/tika-client https://github.com/kanety/tikarb https://github.com/rsolr/rsolr

this seems to do something similar to what i'm trying to do https://github.com/EricLondon/Docker-Rails-Tika-Elasticsearch / https://ericlondon.com/2017/02/01/integrate-tika-rest-service-with-rails-paperclip-attachments-to-extract-text-from-pdf-documents-and-store-in-elasticsearch.html

not sure how useful this is, but maybe? https://github.com/chrismattmann/tika-similarity

explore your indexed documents

done already in this repo, just in case it helps spin up similar in the future...

install rails

https://guides.rubyonrails.org/getting_started.html

rails new . --skip-spring --skip-listen

install blacklight

https://github.com/projectblacklight/blacklight/wiki/Quickstart

  • confirm the dependencies (ruby, solr, node, yarn, etc) are installed
  • go to the "Creating a new application the hard way" section -- https://github.com/projectblacklight/blacklight/wiki/Quickstart#creating-a-new-application-the-hard-way
  • skip solr_wrapper because you're already running solr in docker
  • rails generate blacklight:install --devise --solr_version=latest
  • bin/rails db:migrate
  • rails s
  • modify the default catalog controller a bunch because it expects a totally different schema than you get from tika throwing PDFs at solr