🚧 Deprecated in favor of solr-ocrhighlighting
Efficient indexing and bounding-box "highlighting" for OCR text
- Store OCR bounding box information and token position directly in the Solr index in a space-efficient manner
- Retrieve bounding box and token position directly in your Solr query results, no additional parsing necessary
Indexing:
The OCR information is appended after each token as a concatenated list of <key>:<val>
pairs, see further down
for a detailed description of available keys.
POST /solr/mycore/update
[{ "id": "test_document",
"ocr_text": "this|p:13,l:5,n:6,x:11.1,y:22.2,w:33.3,h:44.4 is|p:13,l:5,n:7,x:22.2,y:33.3,w:44.4,h:55.5 a|p:13,l:5,n:8,x:33.3,y:33.3,w:44.4,h:55.5 test|p:13,l:5,n:9,x:44.4,y:33.3,w:44.4h:55.5" }]
Querying:
The plugin adds a new top-level key (ocr_highlight
in this case) that contains the OCR information for
each matching token as a structured object.
GET /solr/mycore/select?ocr_hl=true&ocr_hl.fields=ocr_text&indent=true&wt=json&q=test
{
"responseHeader": "...",
"response": {
"numFound": 1,
"docs": [{"id": "test_document"}]
},
"ocr_highlight":{
"test_document":{
"ocr_text":[{
"term":"test",
"page":13,
"line": 5,
"word": 9,
"x":0.444,
"y":0.333,
"width":0.444,
"height":0.555}]
}
}
}
At the Bavarian State Library, we try to provide full-text search over all of our OCRed content. In addition to obtaining matching documents, the user should also get a small snippet of the corresponding part of the page image, with the matching words highlighted, similar to what e.g. Google Books provides.
For this to work, we need some way of mapping matching tokens to their corresponding location in the underlying OCR text. A common approach used by a number of libraries is to use a secondary microservice for this that takes as input a document identifier and a text snippet and will return all coordinates of matching text snippets on the page. While this approach generally works okay, it has several drawbacks:
- Performance: Every snippet requires a query to the OCR service, which itself has to do a linear scan through the OCR document. For e.g. a result set of 100 snippets, this will result in 101 queries (initial Solr query and 100 snippet queries). Of course this can be optimized by batching and having a good index structure for the coordinate lookup, but it's still less than ideal.
- Storage: To reliably be able to map text matches to the base text, you have to store a copy of the full text in the index, alongside the regular index. This blows up the index size significantly. Foregoing storing the text and only using the normalized terms from the index for matching will break the mapping to OCR, since depending on the analyzer configuration, Lucene will perform stemming, etc.
Alternatively, you could also store the coordinates directly as strings in the index. This works by e.g.
indexing each token as <token>|<coordinates>
and telling Lucene to ignore everything after the pipe during
analysis. As the full text of the document is stored, you wil get back a series of these annotated tokens
as query results and can then parse the coordinates from your highlighting information. This solves the
Performance part of the above approach, but worsens the Storage problem: For every token, we now not only
have to store the token itself, but an expensive coordinate string as well.
This plugin uses a similar approach to the above, but solves the Storage problem by using an efficient binary format to store the OCR coordinate information in the index: We use bit-packing to combine a number of OCR coordinate parameters into a byte payload, which is not stored in the field itself, but as an associated Lucene Payload:
x
,y
,w
,h
: Coordinates of the bounding box on the page as either:- absolute unsigned integer offsets between 0 and
2^coordinateBits
(see below) - relative floating point percentages between 0 and 100 (e.g.
x:42.3
for a horizontal offset of 43.2%)
- absolute unsigned integer offsets between 0 and
pageIndex
: Unsigned integer that stores the page index of a token (optional)lineIndex
: Unsigned integer that stores the line index of a token (optional)wordIndex
: Unsigned integer that stores the word index of a token (optional)
For each of these values, you can configure the number of bits the plugin should use to store them, or disable
certain parameters entirely. This allows you to fine-tune the settings to your needs. In our case, for example, we
use these values: 4 * 12 bits (coordinates) + 9 bits (word index) + 11 bits (line index) + 12 bits (page index)
,
resulting in a 80 bit or 10 byte payload per token. A comparable string representation p0l0n0x000y000w000h000
would have at least 22 bytes, so we save >50% for every token.
At query time, we then retrieve the payload for each matching token and put the decoded information into the
ocr_highlight
result key that can be directly used without having to do any additional parsing.
Download the latest release from GitHub and put the JAR into your $SOLR_HOME/$SOLR_CORE/lib/
directory.
To use it, first add the DelimitedOcrInfoPayloadTokenFilterFactory
☕ filter to your analyzer chain (e.g. for a ocr_text
field type):
<fieldtype name="text_ocr" class="solr.TextField" omitTermFreqAndPositions="false">
<analyzer>
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="de.digitalcollections.lucene.analysis.util.DelimitedOcrInfoPayloadTokenFilterFactory"
delimiter="☞" absoluteCoordinates="false" coordinateBits="10" wordBits="0" lineBits="0" pageBits="12" />
<filter class="solr.StandardFilterFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.StopFilterFactory"/>
<filter class="solr.PorterStemFilterFactory"/>
</analyzer>
</fieldtype>
The filter takes the following parameters:
delimiter
: Character used for delimiting the payload from the token in the input document (default:|
)absoluteCoordinates
:true
orfalse
to configure whether the stored coordinates are absolutecoordinateBits
: Number of bits to use for encoding OCR coordinates in the index. (mandatory)
A value of10
(default) is recommended, resulting in coordBits to approximately two decimal places.wordBits
: Number of bits to use for encoding the word index.
Set to 0 (default) to disable storage of the word index.lineBits
: Number of bits to use for encoding the line index.
Set to 0 (default) to disable storage of the line index.pageBits
: Number of bits to use for encoding the page index.
Set to 0 (default) to disable storage of the page index.
The filter expects an input payload after the configured delimiter
in the input stream, with the payload being a
pseudo-JSON structure (e.g. k1:1,k2:3
) with the following keys:
p
: Page index (ifpageBits
> 0)l
: Line index (iflineBits
> 0)n
: Word index (ifwordBits
> 0)x
,y
,w
,h
: Coordinates of the OCR box as floating point percentages or integers (ifabsoluteCoordinates
)
As an example, consider the token foobar
with an OCR box of (0.50712, 0.31432, 0.87148, 0.05089)
(i.e. with absoluteCoordinates="false"
), the configured delimiter ☞
and storage of indices for the word (30
),
line (12
) and page (13
):
foobar☞p:13,l:12,n:30,x:50.7,y:31.4,w:87.1,h:5.1
.
Alternatively, with absoluteCoordinates="true"
, an OCR box of (512, 1024, 3192, 256)
and otherwise the same
settings:
foobar☞p:13,l:12,n:30,x:512,y:1024,w:3192,h:256
.
Finally, you just have to configure your schema to use the field type defined above. Storing the content is not recommended, since it significantly increases the index size and is not used at all for querying and highlighting:
<field name="ocr_text" type="text_ocr" indexed="true" stored="false" />
To enable highlighting using the OCR payloads, add the OcrHighlighting
component to your Solr
configuration, configure it with the same absoluteCoordinates
, coordinateBits
, wordBits
, lineBits
and pageBits
values that were used for the filter in the analyzer chain:
<config>
<searchComponent name="ocr_highlight"
class="de.digitalcollections.solr.plugin.components.ocrhighlighting.OcrHighlighting"
absoluteCoordinates="false" coordinateBits="10" wordBits="0" lineBits="0" pageBits="12" />
<requestHandler name="standard" class="solr.StandardRequestHandler">
<arr name="last-components">
<str>ocr_highlight</str>
</arr>
</requestHandler>
</config>
Now at query time, you can just set the ocr_hl=true
parameter, specify the fields you want highlighted via
ocr_hl.fields=myfield,myotherfield
and retrieve highlighted matches with their OCR coordinates:
GET /solr/mycore/select?ocr_hl=true&ocr_hl.fields=ocr_text&indent=true&q=augsburg&wt=json
{
"responseHeader":{
"status":0,
"QTime":158},
"response":{"numFound":526,"start":0,"docs":[
{
"id":"bsb10502835"},
{
"id":"bsb11032147"},
{
"id":"bsb10485243"},
...
},
"ocr_highlight":{
"bsb10502835":{
"ocr_text":[{
"page":7,
"position":9,
"term":"augsburg",
"x":0.111,
"y":0.062,
"width":0.075,
"height":0.013},
{
"page":7,
"position":264,
"term":"augsburg",
"x":0.320,
"y":0.670,
"width":0.099,
"height":0.012},
...]}},
...
}
}
}
-
How does highlighting work with phrase queries?
You will receive a bounding box object for every individual matching term in the phrase.
-
What are the performance and storage implications of using this plugin?
Performance: With an Intel Xeon [email protected] on a single core, we measured (with JMH):
- Encoding the Payload: 1,484,443.200 Payloads/Second or ~14.2MiB/s with an 80bit payload
- Decoding the Payload: 1,593,036.372 Payloads/Second or ~15.2MiB/s with an 80bit payload
Storage: This depends on your configuration. With our sample configuration of an 80 bit payload (see above), the payload overhead is 10 bytes per token. That is, for a corpus size of 10 Million Tokens, you will need approximately 95MiB to store the payloads. The actual storage required might be lower, since Lucene compresses the payloads with LZ4.
-
Does this work with SolrCloud?
It does! We're running it with SolrCloud ourselves.