How TF-IDF Scores Documents
When a search engine receives a query like "database indexing", it needs to rank thousands of matching documents by relevance. TF-IDF (Term Frequency–Inverse Document Frequency) is the foundational scoring formula that makes this possible.
Term Frequency (TF)
TF measures how often a term appears in a single document:
TF(t, d) = count of term t in document d / total terms in document d
A document that mentions "indexing" 5 times out of 200 words has TF = 5/200 = 0.025. The intuition: the more often a term appears in a document, the more relevant that document likely is to that term.
Inverse Document Frequency (IDF)
IDF measures how rare a term is across the entire corpus:
IDF(t) = log(N / df(t))
Where N is the total number of documents and df(t) is the number of documents containing term t. The word "the" might appear in every document (df = N), giving IDF = log(1) = 0 — it carries no discriminative value. A rare technical term like "B-tree" might appear in only 50 out of 1,000,000 documents, giving IDF = log(20000) ≈ 9.9 — very high signal.
Combined TF-IDF Score
TF-IDF(t, d) = TF(t, d) × IDF(t)
This product ensures that a term scores high only when it is frequent in the document AND rare across the corpus. Common words like "is" or "the" get near-zero scores regardless of frequency, while rare domain-specific terms dominate the ranking.
The BM25 Improvement
Plain TF-IDF has two problems: (1) TF grows linearly — a document mentioning "database" 100 times scores 10x higher than one mentioning it 10 times, which is rarely 10x more relevant. (2) It ignores document length — a 10,000-word document naturally has higher term counts than a 200-word document.
BM25 (Best Matching 25) fixes both:
BM25(t, d) = IDF(t) × (TF(t,d) × (k1 + 1)) / (TF(t,d) + k1 × (1 - b + b × |d|/avgdl))
- k1 (typically 1.2): controls TF saturation. Higher k1 means TF contributes more before saturating.
- b (typically 0.75): controls document length normalization. b=0 means no length normalization; b=1 means full normalization.
- |d|/avgdl: ratio of document length to average document length.
BM25 is the default ranking function in Elasticsearch, Apache Lucene, and most modern search engines. It consistently outperforms raw TF-IDF in retrieval quality benchmarks.
Real-Life: Scoring a Search Query
Suppose you have a corpus of 10,000 documents and search for "database indexing". The engine scores each document by summing BM25 scores for each query term.
Document A (500 words): mentions "database" 8 times, "indexing" 3 times. Document B (2000 words): mentions "database" 20 times, "indexing" 1 time.
Even though Document B has more raw mentions of "database", BM25's saturation means those extra mentions yield diminishing returns. And BM25's length normalization penalizes Document B for being 4x longer — those 20 mentions across 2000 words are less concentrated than 8 mentions across 500 words.
Result: Document A likely scores higher, which matches human intuition — it is a more focused, relevant document.
IDF in action: if "database" appears in 5,000 of 10,000 documents (IDF ≈ 0.7) but "indexing" appears in only 200 (IDF ≈ 3.9), the term "indexing" contributes roughly 5x more to the score. Rare terms dominate ranking — this is what makes TF-IDF and BM25 effective.
Elasticsearch default: when you run a match query in Elasticsearch, it uses BM25 with k1=1.2, b=0.75. You can tune these per-field in the index mapping.