feat: Support more tokenizers/stemmers/filter #10

VoVAllen · 2024-12-09T10:23:53Z

Design an extensive syntax to support user to add tokenizers with different configurations.

-- Provide either config or index_name --
CREATE function create_tokenizer(config text, table_name text, column_name text, config text);
CREATE function tokenize(query text, tokenizer_name text)
RETURNS bm25vector;

SELECT create_tokenizer("document_standard", $$
tokenizer = 'standard'
table = "documents"
column = "text"
pretokenizer = 'standard'
[tokenizer.config]
stemmer = 'porter2'
[pretokenizer.config]
punctuation = 'removed'
whitespace = '\w+|[^\w\s]+'
$$)
SELECT tokenize('I'm a doctor', "document_standard");


CREATE INDEX documents_embedding_bm25 ON documents USING bm25 (embedding bm25_ops) WITH  (options = $$
tokenizer = 'standard'
pre_tokenizers = 
$$);
tokenize('I'm a doctor', tokenizer_name => "documents_standard");

Reference:

https://huggingface.co/docs/tokenizers/api/pre-tokenizers

    es_bm25_settings = {
        "settings": {
            "index": {
                "similarity": {
                    "default": {
                        "type": "BM25",
                        "k1": k1,
                        "b": b,
                    }
                }
            },
            "analysis": {
                "analyzer": {
                    "custom_analyzer": {
                        "type": "standard",
                        "max_token_length": 1_000_000,
                        "stopwords": "_english_",
                        "filter": [ "lowercase", "custom_snowball"]
                    }
                },
                "filter": {
                    "custom_snowball": {
                        "type": "snowball",
                        "language": "English"
                    }
                }
            }
        }
    }
    ```

The text was updated successfully, but these errors were encountered:

kemingy · 2024-12-13T03:12:56Z

Does the tokenizer require a model (even for the word tokenizer)?

Yes. Without a pre-trained model on a large dataset, the NDCG drops a lot.

Can we support this kind of config?

No. Because the tokenizer model is highly coupled with those configurations (pre_tokenizer, stemmer, stopwords, etc.). What we can support is:

hardcoded process like what we do now:

VectorChord-bm25/src/token.rs

Lines 38 to 53 in fd29c83

    
           impl Tokenizer for BertWithStemmerAndSplit { 
        
               fn encode(&self, text: &str) -> Vec<u32> { 
        
                   let mut results = Vec::new(); 
        
                   let lower_text = text.to_lowercase(); 
        
                   let split = TOKEN_PATTERN_RE.find_iter(&lower_text); 
        
                   for token in split { 
        
                       if STOP_WORDS.contains(token.as_str()) { 
        
                           continue; 
        
                       } 
        
                       let stemmed_token = 
        
                           tantivy_stemmers::algorithms::english_porter_2(token.as_str()).to_string(); 
        
                       let encoding = self.0.encode_fast(stemmed_token, false).unwrap(); 
        
                       results.extend_from_slice(encoding.get_ids()); 
        
                   } 
        
                   results 
        
               }

dynamic load user-trained huggingface tokenizer models (limited feature) with several hardcoded processes (mainly for stemmer or some language-specific processing)

VoVAllen · 2024-12-13T03:58:47Z

What I mean is to create a standard tokenizer like ES. It store all the tokens in a table. When new documents added, it will check whether the token exists. If not, add it to the table.

What do you mean by model? Do you mean the stats model for bpe tokenizer?

VoVAllen · 2024-12-13T04:00:26Z

es mentioned unicode segmentation https://docs.rs/unicode-segmentation/latest/unicode_segmentation/. Does it help in the result?

kemingy · 2024-12-13T06:31:25Z

Experiments

Current BERT uncased
WordLevel trained on wikitext-103-raw-v1 with r"(?u)\b\w\w+\b" and English snowball stemmer
WordLevel trained on fiqa with r"(?u)\b\w\w+\b" and English snowball stemmer
Tocken trained on wikitext-103-raw-v1 with Unicode segmentation and snowball stemmer
Unicode trained on dataset online, so the indexing time is larger, others are similar to Tocken
Unicode(L) is using Lucene stopwords which is less than NLTK stopwords
Unicode(C) is customized stopwords based on Lucene & few NLTK
Unicode(W) is weighted with Lucene stopwords & NLTK stopwords

tested with top-k=10

Tokenizer	Dataset	QPS	NDCG@10
BERT uncased	fiqa	455.22/s	0.22669
Word(wiki/30k)	fiqa	847.28/s	0.2026
Word(wiki/100k)	fiqa	881.75/s	0.21836
Word(wiki/500k)	fiqa	890.11/s	0.22807
Word(fiqa/30k)	fiqa	751.22/s	0.14533
Word(fiqa/100k)	fiqa	780.91/s	0.16659
Tocken	fiqa	346.58/s	0.24268
Unicode	fiqa	905.15/s	0.23496
Unicode(L)	fiqa	340.32/s	0.25295
Unicode(C)	fiqa	387.12/s	0.25213
Unicode(W)	fiqa	448.29/s	0.25476
ES	fiqa	350.59/s	0.25364
-	-	-	-
BERT uncased	trec-covid	96.19/s	0.67545
Word(wiki/30k)	trec-covid	287.03/s	0.57424
Word(wiki/100k)	trec-covid	292.57/s	0.63196
Word(wiki/500k)	trec-covid	282.97/s	0.64036
Tocken	trec-covid	155.86/s	0.59249
Unicode	trec-covid	268.10/s	0.67253
Unicode(L)	trec-covid	148.46/s	0.61241
Unicode(C)	trec-covid	170.53/s	0.63289
Unicode(W)	trec-covid	188.11/s	0.6457
ES	trec-covid	127.36/s	0.68803
-	-	-	-
BERT uncased	webis-touche2020	178.83/s	0.31151
Word(wiki/100k)	webis-touche2020	414.55/s	0.31562
Word(wiki/500k)	webis-touche2020	448.72/s	0.31418
Tocken	webis-touche2020	279.20/s	0.34596
Unicode	webis-touche2020	439.86/s	0.32139
Unicode(L)	webis-touche2020	287.05/s	0.34009
Unicode(C)	webis-touche2020	338.40/s	0.32646
Unicode(W)	webis-touche2020	290.63/s	0.33961
ES	webis-touche2020	137.45/s	0.34707
-	-	-	-
BERT uncased	quora	456.39/s	0.80833
Tocken	quora	475.44/s	0.80833
Unicode(W)	quora	479.13/s	0.80833
ES	quora	706.51/s	0.80776
-	-	-	-
Unicode(W)	arguana	-	-
ES	arguana	56.62/s	0.47204

kemingy · 2024-12-13T06:35:13Z

What I mean is to create a standard tokenizer like ES. It store all the tokens in a table. When new documents added, it will check whether the token exists. If not, add it to the table.

What do you mean by model? Do you mean the stats model for bpe tokenizer?

I understand this method. But it's not suitable:

usually we limit 30k tokens according to the frequency, which is not applicable with this method
poor NDCG score, check the test above, it's trained on the dataset fiqa
poor performance, because you will need to sync the table every time you encounter new tokens

kemingy · 2024-12-13T06:36:34Z

es mentioned unicode segmentation https://docs.rs/unicode-segmentation/latest/unicode_segmentation/. Does it help in the result?

I tried unicode normalization, but it doesn't help. Will do more experiements.

VoVAllen · 2024-12-13T06:41:31Z

Can we align with https://github.com/xhluca/bm25s/blob/main/bm25s/tokenization.py first?

VoVAllen · 2024-12-13T07:44:49Z

Also it's possible that the index might be wrong. You may want to try the query without the index if needed.

VoVAllen · 2024-12-16T10:19:45Z

SELECT create_standard_tokenizer(tokenizer_name, table_name, column_name, config)

This will create a trigger on the column, like

-- Create the trigger function
CREATE OR REPLACE FUNCTION trigger_update_tokenizer()
RETURNS TRIGGER AS $$
BEGIN
    -- Check if the specific column is set
    IF NEW.column_name IS NOT NULL THEN
        -- Call the update_tokenizer function with the new value
        PERFORM update_tokenizer(NEW.column_name);
    END IF;
    RETURN NEW;
END;
$$ LANGUAGE plpgsql;

-- Create the trigger on the table
CREATE TRIGGER trigger_on_column_name
AFTER INSERT ON table_name
FOR EACH ROW
EXECUTE FUNCTION trigger_update_tokenizer();

And update_tokenizer will update the vocab dict in a specific table (under our own schema like vchord_bm25.tokenizer_name).

Then user should call tokenizer(document, tokenizer_name => "documents_standard") to tokenize the document into BM25 vec.

kemingy self-assigned this Dec 10, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Support more tokenizers/stemmers/filter #10

feat: Support more tokenizers/stemmers/filter #10

VoVAllen commented Dec 9, 2024 •

edited

Loading

kemingy commented Dec 13, 2024

VoVAllen commented Dec 13, 2024

VoVAllen commented Dec 13, 2024

kemingy commented Dec 13, 2024 •

edited

Loading

kemingy commented Dec 13, 2024

kemingy commented Dec 13, 2024

VoVAllen commented Dec 13, 2024

VoVAllen commented Dec 13, 2024

VoVAllen commented Dec 16, 2024 •

edited

Loading

feat: Support more tokenizers/stemmers/filter #10

feat: Support more tokenizers/stemmers/filter #10

Comments

VoVAllen commented Dec 9, 2024 • edited Loading

kemingy commented Dec 13, 2024

VoVAllen commented Dec 13, 2024

VoVAllen commented Dec 13, 2024

kemingy commented Dec 13, 2024 • edited Loading

Experiments

kemingy commented Dec 13, 2024

kemingy commented Dec 13, 2024

VoVAllen commented Dec 13, 2024

VoVAllen commented Dec 13, 2024

VoVAllen commented Dec 16, 2024 • edited Loading

VoVAllen commented Dec 9, 2024 •

edited

Loading

kemingy commented Dec 13, 2024 •

edited

Loading

VoVAllen commented Dec 16, 2024 •

edited

Loading