-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: Support more tokenizers/stemmers/filter #10
Comments
Yes. Without a pre-trained model on a large dataset, the NDCG drops a lot.
No. Because the tokenizer model is highly coupled with those configurations (pre_tokenizer, stemmer, stopwords, etc.). What we can support is:
|
What I mean is to create a standard tokenizer like ES. It store all the tokens in a table. When new documents added, it will check whether the token exists. If not, add it to the table. What do you mean by model? Do you mean the stats model for bpe tokenizer? |
es mentioned unicode segmentation https://docs.rs/unicode-segmentation/latest/unicode_segmentation/. Does it help in the result? |
Experiments
tested with
|
I understand this method. But it's not suitable:
|
I tried unicode normalization, but it doesn't help. Will do more experiements. |
Can we align with https://github.com/xhluca/bm25s/blob/main/bm25s/tokenization.py first? |
Also it's possible that the index might be wrong. You may want to try the query without the index if needed. |
This will create a trigger on the column, like -- Create the trigger function
CREATE OR REPLACE FUNCTION trigger_update_tokenizer()
RETURNS TRIGGER AS $$
BEGIN
-- Check if the specific column is set
IF NEW.column_name IS NOT NULL THEN
-- Call the update_tokenizer function with the new value
PERFORM update_tokenizer(NEW.column_name);
END IF;
RETURN NEW;
END;
$$ LANGUAGE plpgsql;
-- Create the trigger on the table
CREATE TRIGGER trigger_on_column_name
AFTER INSERT ON table_name
FOR EACH ROW
EXECUTE FUNCTION trigger_update_tokenizer(); And Then user should call |
Design an extensive syntax to support user to add tokenizers with different configurations.
Reference:
The text was updated successfully, but these errors were encountered: