Paolo Ferragina, my host in Pisa, pointed out to me some theoretical underpinnings for Lucene’s indexing method.

Internally, Lucene keeps a stack of *segment* indexes. Their size increases logarithmically down the stack. New documents are added as tiny indexes pushed onto the top of the stack. When there are *b* indexes on the top of the stack that are the same size, they are all popped off the stack and merged into a single index which is pushed back onto the stack. Thus the number of segments of each size resembles the digits in the base-b representation of the total number of documents. For example, if b=10 and there are 234 documents indexed, then there would be a four one-document indexes on the top of the stack, followed by three ten-document indexes and finally two one-hundred document indexes.

This keeps the incremental cost of adding a document fairly low, while keeping the number of indexes to be searched fairly small. It’s also still optimal for batch indexing, since adding n tokens still only requires order n*log_b(n) comparisons, as good as any other sorting algorithm.

Apparently this is all theoretically justified by Mark Overmars‘ dynamization work, done around 1980. This paper might be the appropriate reference.

### Like this:

Like Loading...

*Related*

This entry was posted on November 20, 2004 at 1:59 am and is filed under Uncategorized. You can follow any responses to this entry through the RSS 2.0 feed.
You can leave a response, or trackback from your own site.

March 22, 2005 at 5:35 pm |

The technique you mention is indeed a generalization of the “logarithmic method” of Overmars and is similar to the one used in this paper:

http://www.cs.duke.edu/~jsv/Papers/catalog/node54.html

But it’s a different application of the technique, one that I haven’t seen before. I wonder if this is better in practice than B-tree insertions (it’s probably more space-efficient, at least).