Back to DAG

Full-Text Search (Lucene)

databases

How Full-Text Search Works

A SQL LIKE '%database%' query forces a full table scan — it cannot use a B-tree index because the pattern starts with a wildcard. For searching natural language text across millions of documents, we need a fundamentally different data structure: the inverted index.

The Inverted Index

An inverted index maps every term to the list of document IDs containing that term (called a posting list). To answer "which documents contain the word database?", you look up the term in the index and immediately get the answer — no scanning required.

Lucene's Analyzer Pipeline

Before text enters the inverted index, it passes through an analyzer with three stages:

  1. Character filters — strip HTML tags, normalize unicode, replace patterns (e.g., convert "café" → "cafe").
  2. Tokenizer — split text into individual tokens. The standard tokenizer splits on whitespace and punctuation: "full-text search" → ["full", "text", "search"].
  3. Token filters — transform each token:
    • Lowercase: "Database" → "database"
    • Stemming: "running" → "run", "databases" → "databas" (Porter stemmer)
    • Stop word removal: remove "the", "is", "a", "an" — they add noise, not signal
    • Synonyms: "quick" → ["quick", "fast"] so searching for either finds both

The same analyzer must be applied at index time and query time — if you stem "running" to "run" at index time but query for the literal "running", you get zero results.

Segments and Near-Real-Time Search

Lucene does not update the inverted index in place. Instead:

  1. New documents are buffered in memory (the in-memory buffer).
  2. Periodically (or on refresh, default every 1 second in Elasticsearch), the buffer is flushed to a new immutable segment on disk.
  3. Each segment is a self-contained mini inverted index with its own dictionary and posting lists.
  4. A search fans out across all segments and merges results.

Because segments are immutable, deletes are handled by a separate "delete bitmap" — the document is marked as deleted and filtered out at query time. The actual data is only physically removed during segment merging.

Segment Merging

Over time, many small segments accumulate. Lucene runs a background merge policy (similar to LSM-tree compaction) that combines multiple small segments into fewer large ones. This reduces the number of segments a search must query, improves query latency, and reclaims space from deleted documents.

PostgreSQL Full-Text Search

PostgreSQL provides built-in full-text search using the tsvector type (a sorted list of lexemes with positions) and tsquery type (a boolean query). A GIN index on a tsvector column gives inverted-index performance without leaving PostgreSQL. Example: SELECT * FROM articles WHERE to_tsvector('english', body) @@ to_tsquery('database & indexing').

Elasticsearch = Distributed Lucene

Elasticsearch wraps Lucene with distribution: it shards the index across nodes, replicates for fault tolerance, and provides a REST API. Each Elasticsearch shard is a single Lucene index. When you search, the coordinating node fans out to all shards, each shard searches its local Lucene segments, and results are merged.

Real-Life: Building a Search Pipeline

Real-World Example

Consider building search for an e-commerce site with 5 million product listings. A user searches for "running shoes".

Step 1 — Indexing: Each product's title and description are passed through the analyzer. "Men's Lightweight Running Shoes — Trail Edition" becomes the tokens: ["men", "lightweight", "run", "shoe", "trail", "edit"]. Note: "Running" was lowercased and stemmed to "run"; "Shoes" stemmed to "shoe"; the dash and "—" were removed by the tokenizer.

Step 2 — Inverted index: The term "run" now has a posting list pointing to this product (and thousands of others). The term "shoe" has its own posting list.

Step 3 — Query: The user's query "running shoes" goes through the same analyzer → ["run", "shoe"]. Lucene intersects the posting lists for "run" and "shoe", then scores matches using BM25.

Step 4 — Near-real-time: A new product added 500ms ago is already searchable because Elasticsearch refreshes segments every second by default.

PostgreSQL alternative: For smaller scale (under 1M documents), PostgreSQL's built-in tsvector with a GIN index is often sufficient and avoids running a separate Elasticsearch cluster.

Lucene Analyzer Pipeline and Inverted Index

Analyzer Pipeline "Running Shoes!" Char Filter Tokenizer Token Filters run shoe lowercase → stem → stop words strip HTML/unicode split on spaces Inverted Index (3 segments merged view) Term Posting List (doc IDs) run doc2 doc15 doc42 doc108 doc330 doc1024 ... shoe doc2 doc7 doc42 doc89 doc330 doc998 ... databas doc5 doc18 doc42 doc77 doc560 ... index doc3 doc42 doc108 doc450 ... Query "run" AND "shoe" → intersect posting lists → doc2, doc42, doc330 Each result scored with BM25, then sorted by relevance
Step 1 of 2