-
Notifications
You must be signed in to change notification settings - Fork 26
Configuring Solr and Search Relevancy
ArcLight is a Blacklight application, so all the Solr documentation in the Blacklight wiki is applicable, and a good place to start. This page will highlight some search-related aspects unique to ArcLight.
ArcLight's default data pipeline will index EAD2002 XML files into Solr through Traject (see docs). Two files establish the rules for capturing data from EAD into Solr fields. These files can be either extended or completely overridden by a local application using the ArcLight engine:
- Collection-level data: ead2_config.rb
- Component-level data: ead2_component_config.rb
The conventions used in those configs are explained in the Traject README and the Traject docs on capturing XML. Data is written into Solr fields. Most field names use "dynamic fields" and as such have suffixes like _ssim
or _tesm
. These indicate to Solr how the data in those fields should be indexed. Many EAD elements are intentionally captured into multiple Solr fields for different purposes.
ArcLight's Solr config files, located in /solr/conf, determine how the data within each field will be indexed, and how they should factor into search retrieval. You may wish to revise these files in your application to better suit your particular data. The two primary config files are:
Some example field suffixes found in ArcLight, and what they signify:
-
_ssi
: a string field that's stored, indexed for discovery, and can have only a single value -
_ssim
: a string field that is stored, indexed for discovery, and can be multi-valued -
_tesim
: includes (potentially a lot of) text, indexed for discovery using English, multi-valued -
_html_tesm
: includes raw XML, intended to be transformed into HTML for display, not indexed for discovery, multi-valued
The full list of dynamic field suffixes configured for use by ArcLight is in schema.xml, e.g.:
<dynamicField name="*_tesm" type="text_en" stored="true" indexed="false" multiValued="true" />
Most fields that ArcLight captures in the indexing process will factor into search retrieval. There is a catch-all "text"
field that is searched in a default All Fields search, and almost every field has its contents copied into it during indexing via copyField
rules in schema.xml, e.g.:
<copyField source="scopecontent_tesim" dest="text" />
In solrconfig.xml, see the default SearchHandler
configuration.
<requestHandler name="search" class="solr.SearchHandler" default="true">
This spreadsheet also provides an overview of Solr fields and their EAD sources. Updated April 2, 2024.
ArcLight's <str name="qf">
and <str name="pf">
sections in solrconfig.xml list the fields that are searched by default. Note that these include the catch-all text
, but also list others, either because they are not copied into text
or because they warrant additional boosting for relevance ranking.
Relative boosts are applied per field using ^
, e.g., title_tesim^100
means a hit in the title field is 100x more important in scoring than a hit in the catch-all text
field.
There are larger boosts applied in the Phrase Fields (pf
) configuration. This helps ensure that for multi-term queries the scoring boost will be higher when the terms appear in close proximity to each other in the document.
ArcLight includes a basic config for Solr's mm parameter in solrconfig.xml that indicates how many terms in a multi-term query need to match in a document to return it as a hit. This can be revised as needed for local implementations. E.g.,:
<str name="mm">4<90%</str>
<!-- 1-4 term query, all must match
5+ term query, 90% must match; rounded down, so e.g.:
* 5-10 term query, all but one must match
* 11-20 term query, all but two must match
* 21-29 term query, all but three must match, etc.
-->
ArcLight's search box includes a dropdown to search within a particular field. The list of fields is set via config in the CatalogController, e.g., for searching by Place:
config.add_search_field 'place', label: 'Place' do |field|
field.qt = 'search'
field.solr_parameters = {
qf: '${qf_place}',
pf: '${pf_place}'
}
end
The Solr fields included in Place search, and their relative boosts, are set in solrconfig.xml, e.g.:
<str name="qf_place">
place_teim
</str>
<str name="pf_place">
place_teim^2
</str>
Here again, there is a higher boost (^
) applied for matches in the Phrase Field (pf
). In a multi-term query, the proximity of those terms within the matching documents impacts the relevance score.
For each EAD file indexed, you'll have one Solr doc that encodes all the collection-level description (from the <archdesc>
level), then one Solr doc for each individual component therein (<c>
, <c01>
, <c02>
, etc.), with the corresponding component-level description encoded. When you search ArcLight using the default All Results view, you're searching all component and collection "documents" together -- they are all interleaved. You will see a linked breadcrumb trail in each result to help give a sense of the component's context within its collection.
One aspect of ArcLight that is distinct from other Blacklight apps is that search results can be grouped by collection.
Perhaps unintuitively, a top-level collection record is part of a collection group just like a component is. This means that the collection document itself can and often will appear as a matching record within the collection group. If it didn't work that way, you'd have no way to 1) show highlighted keyword-in-context snippets for query matches in the collection-level description; 2) have the collection-level description weigh heavily in relevance rank for a group.
On a search results page, the matching collection groups appear in relevance order, but there's more to it than meets the eye. They appear in order of their highest-scoring document for the query (remember, that document might be the top-level collection description itself or it might be an individual component from within the collection). E.g.:
Group: Collection A (note: this does not have a "score")
Collection A doc (score 100)
Component A1 doc (score 3)
Component A2 doc (score 2)
Group: Collection B (note: this does not have a "score")
Component B5 doc (score 95)
Component B2 doc (score 80)
Collection B doc (score 75)
Note that the number of components in a collection that match the query has no impact on the relevance ranking whatsoever. One highly relevant component in a not-very-relevant collection will make that collection group beat out a relevant collection that includes thousands of moderately-relevant components.
All Blacklight 8 applications (including ArcLight) have an out-of-the-box Advanced Search page, reachable at path /catalog/advanced
. See Blacklight docs on Advanced Search. Note that several versions of Solr 7-9 include a bug that requires a workaround for this to function as intended.
For more documentation about configuring Solr in any Blacklight application, consult the Blacklight wiki.
solrconfig.xml is mostly used for configuring the search relevancy and boosting settings.
schema.xml is mostly used for configuring how a document field should get parsed and analyzed.
Different options to modify the score of the documents in Solr to get the most relevant results higher in the results list.
- Query Time.
- Query Time Boosting with the Dismax Query Parser:
- Using Fields: The Dismax Query Parser QP (get more info) will create a query that will be executed on many different fields. It provides the ability to consider some fields more important than others with the “qf” (query fields) parameter. The same parameter is used to specify the different fileds on which to execute the user query.
- Using a Phrase
- Using a Query: A boost query will influence the score of the result. Example: bq=bookmarked:true will boost the documents that are bookmarked regardless of the user query.
- Using Functions: Similar to using queries.
- Using the tie Breaker: example from the PUL catalog.
Examples of Queries Submitted to the DisMax Query Parser
For more control over how a search term gets processed by Solr, look at the documentation for Understanding Analyzers, Tokenizers and Filters. Solr comes packaged with sample field configurations to handle searches in multiple languages.
Suggestions
Solr relevancy tests, rspec-solr, pul_solr.
Adding extra fields to boost exact matches: left-anchor fields, exact title fields.
Exclude some words from stemming - protwords.txt.
Have a version of the field that does language analysis and one that leaves it unchanged, example from PUL catalog.