Update lucene to version 8.11.2 #16

tuomas2 · 2024-07-19T16:00:46Z

Replaces #15

This gave access to some new features in Lucene, such as Regular Expression search. This is a major refactor because I updated Lucene 5 major versions.

I tested several languages, English, Czech, Chinese, Japanese, Thai and search works in these languages. I am not capable to test if the stemming is good for all languages, so some more testing by native speakers is necessary.

…queries, search as before

I think "a" is not a stop word in this context, because it is a verb here. But my French is not that good.

I don't speak all of these languages, so I sometimes just changed the test to reflect the output. At least that should prevent regression.

tuomas2 · 2024-07-19T16:08:23Z

So summarizing @JJK96 , I would like that we try to:

Remove AbstractBookAnalyzer alltogether, and all custom analyzers that are based on that.
Use StopwordAnalyzer as a baseclass for our custom analyzers (KeyAnalyzer etc)
Modify properties file / factory accordingly to use classes from core and other libs.
Change filter classes (used by some analyzers like KeyAnalyzer) so that they do not store book (as it does not seem to be used)

(related to discussion started here: #15 (comment))

Also removed LuceneAnalyzer and moved it's functionality into AnalyzerFactory AnalyzerFactory now returns a real subclass of Analyzer, instead of a wrapper. For all languages, language-specific analyzers are used, instead of Snowball Analyzers

Removed EnglishAnalyzer test in AnalyzerFactoryTest

JJK96 · 2024-10-14T18:41:24Z

Remove AbstractBookAnalyzer alltogether, and all custom analyzers that are based on that.
Use StopwordAnalyzer as a baseclass for our custom analyzers (KeyAnalyzer etc)
- I used Analyzer as the base, since stopwording was not used by these classes.
Modify properties file / factory accordingly to use classes from core and other libs.
Change filter classes (used by some analyzers like KeyAnalyzer) so that they do not store book (as it does not seem to be used)

…ries would always search the whole bible.

Added check for index version when getting index status. This ensures that the status correctly represents if the index is invalid.

tuomas2

Some comments.

Will fix most of them myself in an upcoming commit.

tuomas2 · 2024-11-14T11:40:04Z

notes.md

@@ -0,0 +1,18 @@
+Functionality AbstractBookAnalyzer


this file should be deleted before merging, right?

tuomas2 · 2024-11-14T11:44:17Z

src/main/java/org/crosswire/jsword/index/Index.java

@@ -44,6 +44,7 @@ public interface Index {
     * @throws BookException 
     */
    Key find(String query) throws BookException;
+    Key find(String query, boolean full_text) throws BookException;


Suggested change

Key find(String query, boolean full_text) throws BookException;

Key find(String query, boolean fullText) throws BookException;

tuomas2 · 2024-11-14T12:06:38Z

src/main/java/org/crosswire/jsword/index/lucene/LuceneIndex.java

-        Field headingField = new Field(FIELD_HEADING, "", Field.Store.YES, Field.Index.ANALYZED, Field.TermVector.NO);
-        Field headingStemField = new Field(FIELD_HEADING_STEM, "", Field.Store.NO, Field.Index.ANALYZED, Field.TermVector.NO);
-        Field morphologyField  = new Field(FIELD_MORPHOLOGY , "", Field.Store.NO, Field.Index.ANALYZED, Field.TermVector.NO);
+        FieldType stored_not_analyzed = new FieldType(StringField.TYPE_STORED);


tuomas2 · 2024-11-14T12:22:24Z

src/main/java/org/crosswire/jsword/index/lucene/analysis/AnalyzerFactory.java

+        return createAnalyzer(book, false);
+    }
+
+    public Analyzer createAnalyzer(Book book, Boolean stopwording) {


stopWording

tuomas2 · 2024-11-14T12:26:46Z

src/main/java/org/crosswire/jsword/index/lucene/analysis/AnalyzerFactory.java

+        analyzerPerField.put(LuceneIndex.FIELD_INTRO_STEM, analyzer);
+        analyzerPerField.put(LuceneIndex.FIELD_HEADING_STEM, analyzer);
+        //analyzerPerField.put(LuceneIndex.FIELD_HEADING, myNaturalLanguageAnalyzer);  //heading to use same analyzer as BODY
+        //analyzerPerField.put(LuceneIndex.FIELD_INTRO, myNaturalLanguageAnalyzer);


unnecessary comments?

tuomas2 · 2024-11-14T12:36:09Z

src/main/resources/IndexMetadata.properties

-Latest.Index.Version=1.2
-Lucene.Version=3.0
-
+Latest.Index.Version=1.3


Should add comment above (as there seem to be version history)

tuomas2 · 2024-11-14T12:46:00Z

I'll merge these both to develop and start preparing a beta release. Looks good so far, but haven't tested yet in practice.

JJK96 added 17 commits July 8, 2024 19:54

Compiles

8258128

Uncleaned version that supports regex searching

41a8b6d

For regex queries search in full non-canonical text, while for other …

fbeaac7

…queries, search as before

Add switch for regex search type

982ce80

Make Regex search case insensitive

4239e9c

Fix Thai analyzer

4c92c9c

Fix Hebrew analyser

a06ecda

Fix Arabic

c784ccc

Fix Persian

7c43cca

Remove local.properties

d7616bc

Fix analyzer references

02fa61f

Fix tests

54c73b6

Add local.properties to gitignore

a4f26c2

Add smartcn analyzer

c3933c7

Fix Chinese and Japanese

d26a312

Fix French stemmer test

f00f512

I think "a" is not a stop word in this context, because it is a verb here. But my French is not that good.

Fix all tests

f355696

I don't speak all of these languages, so I sometimes just changed the test to reflect the output. At least that should prevent regression.

tuomas2 changed the base branch from master to develop July 19, 2024 16:01

tuomas2 changed the base branch from develop to master July 19, 2024 16:02

tuomas2 mentioned this pull request Jul 19, 2024

Update lucene to version 8.11.2 #15

Closed

tuomas2 assigned JJK96 Jul 19, 2024

tuomas2 mentioned this pull request Aug 10, 2024

How best to extend indexing for different languages and scripts on AndBible AndBible/and-bible#3273

Open

JJK96 added 6 commits August 19, 2024 21:17

Removed AbstractBookAnalyzer

d830c48

Also removed LuceneAnalyzer and moved it's functionality into AnalyzerFactory AnalyzerFactory now returns a real subclass of Analyzer, instead of a wrapper. For all languages, language-specific analyzers are used, instead of Snowball Analyzers

All tests compiling, but not completely working yet

25ceeb8

Update test, stemming has been implemented now

c588094

Make stopwording optional but disabled by default

069667a

Removed EnglishAnalyzer test in AnalyzerFactoryTest

Make code cleaner

dd6c939

Restructured

4aaf655

JJK96 added 3 commits October 14, 2024 21:08

Remove print

89d6f45

Apply range query to regex queries as well. Fixes bug where regex que…

9e6da6d

…ries would always search the whole bible.

Invalidate old Lucene indices

c2a7da0

Added check for index version when getting index status. This ensures that the status correctly represents if the index is invalid.

tuomas2 commented Nov 14, 2024

View reviewed changes

Code review fixes

b2989d4

tuomas2 changed the base branch from master to develop November 14, 2024 12:45

tuomas2 merged commit 1445a75 into develop Nov 14, 2024
1 check passed

tuomas2 mentioned this pull request Jan 8, 2025

Is this project still maintained ? Else, where should contributions be sent ? crosswire/jsword#131

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update lucene to version 8.11.2 #16

Update lucene to version 8.11.2 #16

tuomas2 commented Jul 19, 2024

tuomas2 commented Jul 19, 2024 •

edited

Loading

JJK96 commented Oct 14, 2024 •

edited by tuomas2

Loading

tuomas2 left a comment

tuomas2 Nov 14, 2024

tuomas2 Nov 14, 2024

tuomas2 Nov 14, 2024

tuomas2 Nov 14, 2024

tuomas2 Nov 14, 2024

tuomas2 Nov 14, 2024

tuomas2 commented Nov 14, 2024

	Key find(String query, boolean full_text) throws BookException;
	Key find(String query, boolean fullText) throws BookException;

Update lucene to version 8.11.2 #16

Update lucene to version 8.11.2 #16

Conversation

tuomas2 commented Jul 19, 2024

tuomas2 commented Jul 19, 2024 • edited Loading

JJK96 commented Oct 14, 2024 • edited by tuomas2 Loading

tuomas2 left a comment

Choose a reason for hiding this comment

tuomas2 Nov 14, 2024

Choose a reason for hiding this comment

tuomas2 Nov 14, 2024

Choose a reason for hiding this comment

tuomas2 Nov 14, 2024

Choose a reason for hiding this comment

tuomas2 Nov 14, 2024

Choose a reason for hiding this comment

tuomas2 Nov 14, 2024

Choose a reason for hiding this comment

tuomas2 Nov 14, 2024

Choose a reason for hiding this comment

tuomas2 commented Nov 14, 2024

tuomas2 commented Jul 19, 2024 •

edited

Loading

JJK96 commented Oct 14, 2024 •

edited by tuomas2

Loading