Back to DAG

TF-IDF & BM25

databases

How TF-IDF Scores Documents

When a search engine receives a query like "database indexing", it needs to rank thousands of matching documents by relevance. TF-IDF (Term Frequency–Inverse Document Frequency) is the foundational scoring formula that makes this possible.

Term Frequency (TF)

TF measures how often a term appears in a single document:

TF(t, d) = count of term t in document d / total terms in document d

A document that mentions "indexing" 5 times out of 200 words has TF = 5/200 = 0.025. The intuition: the more often a term appears in a document, the more relevant that document likely is to that term.

Inverse Document Frequency (IDF)

IDF measures how rare a term is across the entire corpus:

IDF(t) = log(N / df(t))

Where N is the total number of documents and df(t) is the number of documents containing term t. The word "the" might appear in every document (df = N), giving IDF = log(1) = 0 — it carries no discriminative value. A rare technical term like "B-tree" might appear in only 50 out of 1,000,000 documents, giving IDF = log(20000) ≈ 9.9 — very high signal.

Combined TF-IDF Score

TF-IDF(t, d) = TF(t, d) × IDF(t)

This product ensures that a term scores high only when it is frequent in the document AND rare across the corpus. Common words like "is" or "the" get near-zero scores regardless of frequency, while rare domain-specific terms dominate the ranking.

The BM25 Improvement

Plain TF-IDF has two problems: (1) TF grows linearly — a document mentioning "database" 100 times scores 10x higher than one mentioning it 10 times, which is rarely 10x more relevant. (2) It ignores document length — a 10,000-word document naturally has higher term counts than a 200-word document.

BM25 (Best Matching 25) fixes both:

BM25(t, d) = IDF(t) × (TF(t,d) × (k1 + 1)) / (TF(t,d) + k1 × (1 - b + b × |d|/avgdl))

  • k1 (typically 1.2): controls TF saturation. Higher k1 means TF contributes more before saturating.
  • b (typically 0.75): controls document length normalization. b=0 means no length normalization; b=1 means full normalization.
  • |d|/avgdl: ratio of document length to average document length.

BM25 is the default ranking function in Elasticsearch, Apache Lucene, and most modern search engines. It consistently outperforms raw TF-IDF in retrieval quality benchmarks.

Real-Life: Scoring a Search Query

Real-World Example

Suppose you have a corpus of 10,000 documents and search for "database indexing". The engine scores each document by summing BM25 scores for each query term.

Document A (500 words): mentions "database" 8 times, "indexing" 3 times. Document B (2000 words): mentions "database" 20 times, "indexing" 1 time.

Even though Document B has more raw mentions of "database", BM25's saturation means those extra mentions yield diminishing returns. And BM25's length normalization penalizes Document B for being 4x longer — those 20 mentions across 2000 words are less concentrated than 8 mentions across 500 words.

Result: Document A likely scores higher, which matches human intuition — it is a more focused, relevant document.

IDF in action: if "database" appears in 5,000 of 10,000 documents (IDF ≈ 0.7) but "indexing" appears in only 200 (IDF ≈ 3.9), the term "indexing" contributes roughly 5x more to the score. Rare terms dominate ranking — this is what makes TF-IDF and BM25 effective.

Elasticsearch default: when you run a match query in Elasticsearch, it uses BM25 with k1=1.2, b=0.75. You can tune these per-field in the index mapping.

TF-IDF vs BM25: Term Frequency Saturation

TF Component: TF-IDF (linear) vs BM25 (saturating) Score 0 2 4 6 8 Term Frequency (count in document) 2 5 8 12 16 BM25 saturates diminishing returns TF-IDF grows forever Raw TF-IDF BM25 (k1=1.2)
Step 1 of 2