-
Notifications
You must be signed in to change notification settings - Fork 26
Configuring Solr and Search Relevancy
ArcLight is a Blacklight application, so all the Solr documentation in the Blacklight wiki is applicable, and a good place to start. This page will highlight some search-related aspects unique to ArcLight.
ArcLight's default data pipeline will index EAD2002 XML files into Solr through Traject (see docs). Two files establish the rules for capturing data from EAD into Solr fields. These files can be either extended or completely overridden by a local application using the ArcLight engine:
- Collection-level data: ead2_config.rb
- Component-level data: ead2_component_config.rb
The conventions used in those configs are explained in the Traject README and the Traject docs on capturing XML. Data is written into Solr fields. Most field names use "dynamic fields" and as such have suffixes like _ssim
or _tesm
. These indicate to Solr how the data in those fields should be indexed. Many EAD elements are intentionally captured into multiple Solr fields for different purposes.
Some example field suffixes found in ArcLight, and what they signify:
-
_ssi
: a string field that's stored, indexed for discovery, and can have only a single value -
_ssim
: a string field that is stored, indexed for discovery, and can be multi-valued -
_tesim
: includes (potentially a lot of) text, indexed for discovery using English, multi-valued -
_html_tesm
: includes raw XML, intended to be transformed into HTML for display, not indexed for discovery, multi-valued
The full list of dynamic field suffixes configured for use by ArcLight is in schema.xml, e.g.:
<dynamicField name="*_tesm" type="text_en" stored="true" indexed="false" multiValued="true" />
Most fields that ArcLight captures in the indexing process will factor into search retrieval. There is a catch-all "text"
field that is searched in a default All Fields search, and almost every field has its contents copied into it during indexing via copyField
rules in schema.xml, e.g.:
<copyField source="scopecontent_tesim" dest="text" />
In solrconfig.xml, see the default SearchHandler
configuration.
<requestHandler name="search" class="solr.SearchHandler" default="true">
ArcLight's <str name="qf">
and <str name="pf">
sections in solrconfig.xml list the fields that are searched by default. Note that these include the catch-all text
, but also list others, either because they are not copied into text
or because they warrant additional boosting for relevance ranking.
Relative boosts are applied per field using ^
, e.g., title_tesim^100
means a hit in the title field is 100x more important in scoring than a hit in the catch-all text
field.
You may wish to apply larger boosts in the Phrase Fields (pf
) configuration. This will help ensure that for multi-term queries the scoring boost will be higher if the terms appear in close proximity to each other in the document.
ArcLight includes a basic config for Solr's mm parameter in solrconfig.xml that indicates how many terms in a multi-term query need to match in a document to return it as a hit. This can be revised as needed for local implementations. E.g.,:
<str name="mm">4<90%</str>
<!-- 1-4 term query, all must match
5+ term query, 90% must match; rounded down, so e.g.:
* 5-10 term query, all but one must match
* 11-20 term query, all but two must match
* 21-29 term query, all but three must match, etc.
-->
ArcLight's search box includes a dropdown to search within a particular field. The list of fields is set via config in the CatalogController, e.g., for searching by Place:
config.add_search_field 'place', label: 'Place' do |field|
field.qt = 'search'
field.solr_parameters = {
qf: '${qf_place}',
pf: '${pf_place}'
}
end
The Solr fields included in Place search, and their relative boosts, are set in solrconfig.xml, e.g.:
<str name="qf_place">
place_teim
</str>
<str name="pf_place">
place_teim
</str>
You may wish to boost Phrase Field (pf
) matches higher with ^
.
For each EAD file indexed, you'll have one Solr doc that encodes all the collection-level description (from the <archdesc>
level), then one Solr doc for each individual component therein (<c>
, <c01>
, <c02>
, etc.), with the corresponding component-level description encoded. When you search ArcLight using the default All Results view, you're searching all component and collection "documents" together -- they are all interleaved. You will see a linked breadcrumb trail in each result to help give a sense of the component's context within its collection.
One aspect of ArcLight that is distinct from other Blacklight apps is that search results can be grouped by collection.
Perhaps unintuitively, a top-level collection record is part of a collection group just like a component is. This means that the collection document itself can and often will appear as a matching record within the collection group. If it didn't work that way, you'd have no way to 1) show highlighted keyword-in-context snippets for query matches in the collection-level description; 2) have the collection-level description weigh heavily in relevance rank for a group.
On a search results page, the matching collection groups appear in relevance order, but there's more to it than meets the eye. They appear in order of their highest-scoring document for the query (remember, that document might be the top-level collection description itself or it might be an individual component from within the collection). E.g.:
Group: Collection A (note: this does not have a "score")
Collection A doc (score 100)
Component A1 doc (score 3)
Component A2 doc (score 2)
Group: Collection B (note: this does not have a "score")
Component B5 doc (score 95)
Component B2 doc (score 80)
Collection B doc (score 75)
Note that the number of components in a collection that match the query has no impact on the relevance ranking whatsoever. One highly relevant component in a not-very-relevant collection will make that collection group beat out a relevant collection that includes thousands of moderately-relevant components.
All Blacklight 8 applications (including ArcLight) have an out-of-the-box Advanced Search page, reachable at path /catalog/advanced
. See Blacklight docs on Advanced Search. Note that several versions of Solr 7-9 include a bug that requires a workaround for this to function as intended.
For more documentation about configuring Solr in any Blacklight application, consult the Blacklight wiki.
solrconfig.xml is mostly used for configuring the search relevancy and boosting settings.
schema.xml is mostly used for configuring how a document field should get parsed and analyzed.
Different options to modify the score of the documents in Solr to get the most relevant results higher in the results list.
- Query Time.
- Query Time Boosting with the Dismax Query Parser:
- Using Fields: The Dismax Query Parser QP (get more info) will create a query that will be executed on many different fields. It provides the ability to consider some fields more important than others with the “qf” (query fields) parameter. The same parameter is used to specify the different fileds on which to execute the user query.
- Using a Phrase
- Using a Query: A boost query will influence the score of the result. Example: bq=bookmarked:true will boost the documents that are bookmarked regardless of the user query.
- Using Functions: Similar to using queries.
- Using the tie Breaker: example from the PUL catalog.
Examples of Queries Submitted to the DisMax Query Parser
For more control over how a search term gets processed by Solr, look at the documentation for Understanding Analyzers, Tokenizers and Filters. Solr comes packaged with sample field configurations to handle searches in multiple languages.
Suggestions
Solr relevancy tests, rspec-solr, pul_solr.
Adding extra fields to boost exact matches: left-anchor fields, exact title fields.
Exclude some words from stemming - protwords.txt.
Have a version of the field that does language analysis and one that leaves it unchanged, example from PUL catalog.