XapianFu is a Ruby library for working with Xapian databases. It builds on the GPL licensed Xapian Ruby bindings but provides an interface more in-line with “The Ruby Way”(tm) and is considerably easier to use.
For example, you can work almost entirely with Hash objects - XapianFu will handle converting the Hash keys into Xapian term prefixes when indexing and when parsing queries.
It also handles storing and retrieving hash entries as Xapian::Document values. XapianFu basically gives you a persistent Hash with full text indexing (and ACID transactions).
sudo gem install xapian-fu
Xapian Fu requires the Xapian Ruby bindings to be available. On Debian/Ubuntu, you can install the ‘libxapian-ruby1.8` package to get them. Alternatively, you can install the `xapian-ruby` gem, which reportedly will also provide them. You can also just get the upstream Xapian release and manually install it.
XapianFu::XapianDb is the corner-stone of XapianFu. A XapianDb instance will handle setting up a XapianFu::XapianDocumentsAccessor for reading and writing documents from and to a Xapian database. It makes use of XapianFu::QueryParser for parsing and setting up a query.
XapianFu::XapianDoc represents a document retrieved from or to be added to a Xapian database.
Create a database, add 3 documents to it and then search and retrieve them.
require 'xapian-fu' include XapianFu db = XapianDb.new(:dir => 'example.db', :create => true, :store => [:title, :year]) db << { :title => 'Brokeback Mountain', :year => 2005 } db << { :title => 'Cold Mountain', :year => 2004 } db << { :title => 'Yes Man', :year => 2008 } db.flush db.search("mountain").each do |match| puts match.values[:title] end
Create an in-memory database, add 3 documents to it and then search and retrieve them in year order.
db = XapianDb.new(:store => [:title], :sortable => [:year]) db << { :title => 'Brokeback Mountain', :year => 2005 } db << { :title => 'Cold Mountain', :year => 2004 } db << { :title => 'Yes Man', :year => 2008 } db.search("mountain", :order => :year)
Simple integration with the will_paginate Rails helpers.
@results = db.search("mountain", :page => 1, :per_page => 5) will_paginate @results
Spelling suggestions, like Google’s “Did you mean…” feature:
db = XapianDb.new(:dir => 'example.db', :create => true) db << "There is a mouse in this house" @results = db.search "moose house" unless @results.corrected_query.empty? puts "Did you mean '#{@results.corrected_query}'" end
Ensure that a group of documents are either entirely added to the database or not at all - the transaction is aborted if an exception is raised inside the block. The documents only become available to searches at the end of the block, when the transaction is committed.
db = XapianDb.new(:store => [:title, :year], :sortable => [:year]) db.transaction do db << { :title => 'Brokeback Mountain', :year => 2005 } db << { :title => 'Cold Mountain', :year => 2004 } db << { :title => 'Yes Man', :year => 2008 } end db.search("mountain")
Fields can be described in more detail using a hash. For example, telling XapianFu that a particular field is a Date or Integer will allow very efficient on-disk storage and will ensure the same type of object is instantiated when returning those stored values. And in the case of Integer, allows you to order search results without worrying about leading zeros.
db = XapianDb.new(:fields => { :title => { :store => true }, :released => { :type => Date, :store => true }, :votes => { :type => Integer, :store => true } }) db << { :title => 'Brokeback Mountain', :released => Date.parse('13th January 2006'), :votes => 105302 } db << { :title => 'Cold Mountain, :released => Date.parse('2nd January 2004'), :votes => 45895 } db << { :title => 'Yes Man', :released => Date.parse('26th December 2008'), :votes => 44936 } db.search("mountain", :order => :votes)
Find the document with the highest :year value
db.documents.max(:year)
XapianFu supports Xapian’s ‘MatchAll` and `MatchNothing` queries:
db.search(:all) db.search(:nothing)
Search on particular fields
db.search("title:mountain year:2005")
Boolean AND (default)
db.search("ruby AND rails") db.search("ruby rails")
Boolean OR
db.search("rails OR sinatra") db.search("rails sinatra", :default_op => :or)
Exclude certain terms
db.search("ruby -rails")
Wildcards
db.search("xap*")
Phrase searches
db.search("'a steamer in the gene pool'", :phrase => true)
And any combinations of the above:
db.search("(ruby OR sinatra) -rails xap*")
Sometimes you may want to increase the weight of a particular term in a document. Xapian supports adding {extra weight}(trac.xapian.org/wiki/FAQ/ExtraWeight) to a term at index time by providing an integer “wdf” (default is 1).
You may set an optional :weights option when initializing a XapianDb. The :weights option accepts a Proc or Lambda that will be called with the key, value and list of document fields as each term is indexed. Your function should return an integer to set the weight to.
XapianDb.new(:weights => lambda {|k, v, f| k == :title ? 3 : 1}
If you want to implement something like [this](getting-started-with-xapian.readthedocs.org/en/latest/howtos/boolean_filters.html#searching), then:
db = XapianDb.new( fields: { name: {:index => true}, colors: {:boolean => true} } ) db << {name: "Foo", colors: ["red", "black"]} db << {name: "Foo", colors: ["red", "green"]} db << {name: "Foo", colors: ["blue", "yellow"]} db.search("foo", filter: {:colors => ["red"]})
The main thing here is that filtering by color doesn’t affect the relevancy of the documents returned.
Many times you want to allow users to narrow down the search results by restricting the query to specific values of a given category. This is called [faceted search](readthedocs.org/docs/getting-started-with-xapian/en/latest/xapian-core-rst/facets.html).
To find out which values you can display to your users, you can do something like this:
results = db.search("foo", facets: [:colors, :year]) results.facets # { # :colors => [ # ["blue", 4] # ["red", 1] # ], # # :year => [ # [2010, 3], # [2011, 2], # [2012, 1] # ] # }
When filtering by one of these values, it’s best to define the field as boolean (see section above) and then use ‘:filter`:
db.search("foo", filter: {colors: ["blue"], year: [2010]})
XapianFu always stores the :id field, so you can easily use it with something like ActiveRecord to index database records:
db = XapianDb.new(:dir => 'posts.db', :create => true) Post.all.each { |p| db << p.attributes } docs = db.search("custard") docs.each_with_index { |doc,i| docs[i] = Post.find(doc.id) }
Combine it with the max value search to do batch delta updates by primary key:
db = XapianDb.new(:dir => 'posts.db') latest_doc = db.documents.max(:id) new_posts = Post.find(:all, :conditions => ['id > ?', lastest_doc.id]) new_posts.each { |p| db << p.attributes }
Or by :updated_at field if you prefer:
db = XapianDb.new(:dir => 'posts.db', :fields => { :updated_at => { :type => Time, :store => true } }) last_updated_doc = db.documents.max(:updated_at) new_posts = Post.find(:all, :conditions => ['updated_at >= ?', last_updated_doc.updated_at]) new_posts.each { |p| db << p.attributes }
Deleted records won’t show up in results but can eventually put your result pagination out of whack. So, you’ll need to track deletions yourself, either with a deleted_at field, some kind of delete log or perhaps by reindexing once in a while.
db = XapianDb.new(:dir => 'posts.db') deleted_posts = Post.find(:all, :conditions => 'deleted_at is not null') deleted_posts.each do |post| db.documents.delete(post.id) post.destroy end
- Author
-
John Leach ([email protected])
- Copyright
-
Copyright © 2009-2012 John Leach
- License
-
MIT (The Xapian library is GPL)
- Mailing list
- Web page
- Github