Dynamization and Lucene

Paolo Ferragina, my host in Pisa, pointed out to me some theoretical underpinnings for Lucene’s indexing method.

Internally, Lucene keeps a stack of segment indexes. Their size increases logarithmically down the stack. New documents are added as tiny indexes pushed onto the top of the stack. When there are b indexes on the top of the stack that are the same size, they are all popped off the stack and merged into a single index which is pushed back onto the stack. Thus the number of segments of each size resembles the digits in the base-b representation of the total number of documents. For example, if b=10 and there are 234 documents indexed, then there would be a four one-document indexes on the top of the stack, followed by three ten-document indexes and finally two one-hundred document indexes.

This keeps the incremental cost of adding a document fairly low, while keeping the number of indexes to be searched fairly small. It’s also still optimal for batch indexing, since adding n tokens still only requires order n*log_b(n) comparisons, as good as any other sorting algorithm.

Apparently this is all theoretically justified by Mark Overmars‘ dynamization work, done around 1980. This paper might be the appropriate reference.

One Response to “Dynamization and Lucene”

  1. Octavian Procopiuc Says:

    The technique you mention is indeed a generalization of the “logarithmic method” of Overmars and is similar to the one used in this paper:
    But it’s a different application of the technique, one that I haven’t seen before. I wonder if this is better in practice than B-tree insertions (it’s probably more space-efficient, at least).

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: