- Add support for numo matrix library. @yagince @srapilly
- Drop support for Ruby versions less than 2.4.
- Fix the
term_frequency
method in theBM25Model
class, caused by a typographical error (documents.size
instead ofdocument.size
).
-
Add
tokenizer
option toDocument
class. @satoryuThe value is an object with a
tokenize
method that accepts a string and returns an array ofToken
instances.For example, to use natto instead of unicode_utils for Japanese, install MeCab (
brew install mecab
), and then:require 'natto' class Tokenizer def initialize @nm = Natto::MeCab.new end def tokenize(text) @nm.enum_parse(text).map do |node| Token.new(node) end end end document = TfIdfSimilarity::Document.new("こんにちは世界", tokenizer: tokenizer)
-
Add
to_s
method toToken
class, to use less memory than chaininglowercase_filter
withclassic_filter
. @satoryu
- Add support for recent RubyGems and Ruby versions (
require 'delegate'
). @diasks2 - Drop support for Ruby 1.9.3.
- Update the
classic_filter
method of theToken
class to remove possessives when the apostrophe is a backtick (`) or a single quotation mark (’). @diasks2 - Drop support for Ruby 1.9.2.
- Add the
document_index
andtext_index
methods to theModel
class and its subclasses.
- Extract logic from the
BM25Model
andTfIdfModel
classes to a newModel
class. - Drop support for Ruby 1.8.7.
- Load only the required methods from the
unicode_utils
gem, to use less memory.
- Install the
unicode_utils
gem only on Ruby versions greater than 1.8.
- Remove
:function
option fromTfIdfModel
class. UseBM25Model
class, instead.
Major refactor of v0.0.x.