What is an Inverted Index?
An inverted index is a data structure that maps each unique term (word) to a list of document IDs that contain that term. This list is called a posting list. Inverted indexes are the backbone of full-text search engines like Lucene, Elasticsearch, Solr, and PostgreSQL's GIN indexes.
Why "inverted"?
A normal (forward) index maps: document ID -> list of terms in the document. An inverted index flips this: term -> list of document IDs containing the term. This inversion is what makes keyword search efficient — instead of scanning every document for a word, you look up the word directly and get all matching documents.
Building an inverted index
- Tokenization: split each document's text into individual terms (words). "The quick brown fox" becomes ["The", "quick", "brown", "fox"].
- Normalization: transform tokens to a canonical form:
- Lowercasing: "The" -> "the"
- Stemming: "running" -> "run", "foxes" -> "fox"
- Stop-word removal: remove common words like "the", "is", "a" (optional, modern engines often keep them)
- Index construction: for each term, record which document IDs contain it. The resulting posting lists are typically sorted by document ID.
Querying
- Single-term query ("fox"): look up the posting list for "fox" — O(1) lookup, returns all matching doc IDs.
- AND query ("quick AND fox"): retrieve posting lists for both terms, then intersect them. Since lists are sorted by doc ID, this is a merge-based intersection in O(n + m) time, where n and m are the posting list lengths.
- OR query ("quick OR fox"): merge (union) the two posting lists.
Skip pointers
For large posting lists, intersection can be optimized with skip pointers: every k-th element in the posting list has a pointer that jumps ahead, allowing you to skip over large segments that cannot match. This reduces intersection time for lists of very different lengths.
Scoring and ranking
Beyond simple boolean matching, search engines use TF-IDF (Term Frequency - Inverse Document Frequency) or BM25 to rank results by relevance. The inverted index typically stores additional metadata per posting (term frequency, positions) to support these calculations.
Real implementations
- Lucene/Elasticsearch: inverted index stored as immutable segments (similar to SSTables), merged in the background. Each segment has its own inverted index.
- PostgreSQL GIN: Generalized Inverted Index, used for full-text search (
tsvector), array containment, and JSONB queries.
Real-Life: How Search Engines Work
When you type "database crash recovery" into a search engine, here is what happens at the inverted index level:
Index (simplified):
| Term | Posting List (doc IDs) |
|---|---|
| database | [1, 3, 7, 12, 15, 23, ...] |
| crash | [3, 7, 19, 42, ...] |
| recovery | [3, 7, 8, 19, 31, ...] |
Query execution:
- Look up "database" -> [1, 3, 7, 12, 15, 23]
- Look up "crash" -> [3, 7, 19, 42]
- Look up "recovery" -> [3, 7, 8, 19, 31]
- Intersect all three lists: documents [3, 7] contain ALL three terms.
- Rank by relevance (BM25 scoring) and return the top results.
Intersection via sorted merge: Two pointers walk through the sorted lists simultaneously. If both point to the same doc ID, it is a match. If one is smaller, advance that pointer. This is O(n + m) — very efficient.
Real-world scale:
- Google's index contains hundreds of billions of documents. The inverted index is sharded across thousands of machines, with each shard holding a subset of documents.
- Elasticsearch clusters at companies like Netflix, Uber, and GitHub handle billions of log entries, each indexed for sub-second search.
- PostgreSQL GIN index enables full-text search:
SELECT * FROM articles WHERE to_tsvector(body) @@ to_tsquery('database & crash').