Skip to content

Recipe: Add full text indexing to your app

Michael J. Giarlo edited this page Oct 6, 2015 · 2 revisions

There are number of areas in the Hydra stack that need to be touched to do full-text indexing. Sufia supports full-text indexing using Apache Tika (which is provided in Apache Solr), and here's how it's implemented. (Note: if you're using Sufia, this is already done for you!)

Solr

The Solr schema contains a field called all_text_timv.

The Solr config pulls in a bunch of extraction libraries and adds the all_text_timv field to the default qf and pf. The ExtractingRequestHandler must be enabled as well.

Extraction libraries

Sufia uses a rake task to download extraction libraries and store them where Solr looks for them.

Blacklight catalog

The all_text_timv field is added to the all_fields search qf in the Catalog controller

Modeling

Sufia's GenericFile model mixes in a module that knows how to talk to Solr's ExtractingRequestHandler. (The #extract_content method is where that happens.)

Indexing

Sufia has an indexing service that takes the output of Apache Tika and indexes it in Solr. (This is the equivalent of overriding #to_solr on an ActiveFedora model.)

Workflow

When a file is uploaded, Sufia spawns a background job that characterizes the file. The #characterize method calls #append_metadata. That method in turn calls the #extract_content method which hits Apache Tika via the Solr API.

Clone this wiki locally