Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How best to extend indexing for different languages and scripts on AndBible #3273

Open
MarkLee196 opened this issue Jun 18, 2024 · 4 comments

Comments

@MarkLee196
Copy link

Texts in different languages and scripts sometimes require handling in different ways, be it rendering or indexing, and the open source nature of AndBible means that it is better sorted than many apps to do this. AndBible already has significant capacity for rendering less well supported scripts with features such as customizable fonts, but regarding indexing, the existing documentation regarding lucene indexing is limited so it is not easy to say what is already possible, also features such as customizable lucene indexes as not yet implemented. Thoughts and ideas on this matter are invited below.

@MarkLee196
Copy link
Author

I have confirmed by testing that changing the language in the configuration file can in some cases change the analyzer used by lucene to build the index. Though in most cases the same analyzer seemed to b be used, the configuration "LANG=zh-Hans" aka Chinese produced a notably different index. It seems that lucene was using the 'zh' (Zhongwen aka Chinese) part of the language code to choose which analyzer, to use rather that the script part 'Hans' (Hanzi Simplified aka simplified CJKV characters)., It would be useful for module producers to know which values for LANG call specific analyzers.

@MarkLee196
Copy link
Author

There are at 3 ways that it would be possible to implement customized indexing. (1) by implementing the use of custom lucene indexes for modules (2) by the use of custom lucene analysers, see for example here (3) by implementing the use of non-lucene searches (this is what diatheke the command line frontend to sword has as an option. Of course implementing more than one way to customize indexes and searching would be better than just one.

@MarkLee196
Copy link
Author

At present the issue with lucene indexes on AndBible that I have not yet been able to solve by using different values for LANG in the configuration files is the handling of surrogate characters, U+10000 and above. It seems that the analyzers are unable to deal with these properly. removing part of the string so that in some cases searching for a string including surrogate characters does not match, as what is stored in the index is partial, and in other cases when the search string includes surrogate characters extra false matches are produced. This suggests that the solution is for the analyzer to use the appropriate tokenizer..

@tuomas2
Copy link
Contributor

tuomas2 commented Aug 10, 2024

Maybe upgrading Lucene (AndBible/jsword#16) could help?

@tuomas2 tuomas2 moved this from Needs triage to Prio 1 in Tuomas' project board Aug 10, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: Ongoing
Status: Prio 1
Development

No branches or pull requests

2 participants