Paolo Ferragina, my host in Pisa, pointed out to me some theoretical underpinnings for Lucene’s indexing method.
Internally, Lucene keeps a stack of segment indexes. Their size increases logarithmically down the stack. New documents are added as tiny indexes pushed onto the top of the stack. When there are b indexes on the top of the stack that are the same size, they are all popped off the stack and merged into a single index which is pushed back onto the stack. Thus the number of segments of each size resembles the digits in the base-b representation of the total number of documents. For example, if b=10 and there are 234 documents indexed, then there would be a four one-document indexes on the top of the stack, followed by three ten-document indexes and finally two one-hundred document indexes.
This keeps the incremental cost of adding a document fairly low, while keeping the number of indexes to be searched fairly small. It’s also still optimal for batch indexing, since adding n tokens still only requires order n*log_b(n) comparisons, as good as any other sorting algorithm.
Apparently this is all theoretically justified by Mark Overmars‘ dynamization work, done around 1980. This paper might be the appropriate reference.
