Paolo Ferragina, my host in Pisa, pointed out to me some theoretical underpinnings for Lucene’s indexing method.
Internally, Lucene keeps a stack of segment indexes. Their size increases logarithmically down the stack. New documents are added as tiny indexes pushed onto the top of the stack. When there are b indexes on the top of the stack that are the same size, they are all popped off the stack and merged into a single index which is pushed back onto the stack. Thus the number of segments of each size resembles the digits in the base-b representation of the total number of documents. For example, if b=10 and there are 234 documents indexed, then there would be a four one-document indexes on the top of the stack, followed by three ten-document indexes and finally two one-hundred document indexes.
This keeps the incremental cost of adding a document fairly low, while keeping the number of indexes to be searched fairly small. It’s also still optimal for batch indexing, since adding n tokens still only requires order n*log_b(n) comparisons, as good as any other sorting algorithm.
Apparently this is all theoretically justified by Mark Overmars‘ dynamization work, done around 1980. This paper might be the appropriate reference.

March 22, 2005 at 5:35 pm |
The technique you mention is indeed a generalization of the “logarithmic method” of Overmars and is similar to the one used in this paper:
http://www.cs.duke.edu/~jsv/Papers/catalog/node54.html
But it’s a different application of the technique, one that I haven’t seen before. I wonder if this is better in practice than B-tree insertions (it’s probably more space-efficient, at least).