How best to extend indexing for different languages and scripts on AndBible #3273

MarkLee196 · 2024-06-18T01:56:17Z

Texts in different languages and scripts sometimes require handling in different ways, be it rendering or indexing, and the open source nature of AndBible means that it is better sorted than many apps to do this. AndBible already has significant capacity for rendering less well supported scripts with features such as customizable fonts, but regarding indexing, the existing documentation regarding lucene indexing is limited so it is not easy to say what is already possible, also features such as customizable lucene indexes as not yet implemented. Thoughts and ideas on this matter are invited below.

MarkLee196 · 2024-06-18T02:14:52Z

I have confirmed by testing that changing the language in the configuration file can in some cases change the analyzer used by lucene to build the index. Though in most cases the same analyzer seemed to b be used, the configuration "LANG=zh-Hans" aka Chinese produced a notably different index. It seems that lucene was using the 'zh' (Zhongwen aka Chinese) part of the language code to choose which analyzer, to use rather that the script part 'Hans' (Hanzi Simplified aka simplified CJKV characters)., It would be useful for module producers to know which values for LANG call specific analyzers.

MarkLee196 · 2024-06-19T04:45:41Z

There are at 3 ways that it would be possible to implement customized indexing. (1) by implementing the use of custom lucene indexes for modules (2) by the use of custom lucene analysers, see for example here (3) by implementing the use of non-lucene searches (this is what diatheke the command line frontend to sword has as an option. Of course implementing more than one way to customize indexes and searching would be better than just one.

MarkLee196 · 2024-06-20T08:20:15Z

At present the issue with lucene indexes on AndBible that I have not yet been able to solve by using different values for LANG in the configuration files is the handling of surrogate characters, U+10000 and above. It seems that the analyzers are unable to deal with these properly. removing part of the string so that in some cases searching for a string including surrogate characters does not match, as what is stored in the index is partial, and in other cases when the search string includes surrogate characters extra false matches are produced. This suggests that the solution is for the analyzer to use the appropriate tokenizer..

tuomas2 · 2024-08-10T07:07:22Z

Maybe upgrading Lucene (AndBible/jsword#16) could help?

MarkLee196 added the Type: Discussion label Jun 18, 2024

github-project-automation bot added this to Tuomas' project board Jun 18, 2024

github-project-automation bot moved this to Needs triage in Tuomas' project board Jun 18, 2024

tuomas2 added this to Discussions Jun 18, 2024

github-project-automation bot moved this to Ongoing in Discussions Jun 18, 2024

tuomas2 moved this from Needs triage to Prio 1 in Tuomas' project board Aug 10, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How best to extend indexing for different languages and scripts on AndBible #3273

How best to extend indexing for different languages and scripts on AndBible #3273

MarkLee196 commented Jun 18, 2024

MarkLee196 commented Jun 18, 2024

MarkLee196 commented Jun 19, 2024

MarkLee196 commented Jun 20, 2024

tuomas2 commented Aug 10, 2024

How best to extend indexing for different languages and scripts on AndBible #3273

How best to extend indexing for different languages and scripts on AndBible #3273

Comments

MarkLee196 commented Jun 18, 2024

MarkLee196 commented Jun 18, 2024

MarkLee196 commented Jun 19, 2024

MarkLee196 commented Jun 20, 2024

tuomas2 commented Aug 10, 2024