Time-Series Database Fundamentals

Prerequisites(3)

LSM Tree Columnar Storage (OLAP)Partitioning / Sharding

A time-series database is optimized for storing and querying data points indexed by timestamp. Each data point typically consists of a timestamp, a set of tags (metadata), and one or more field values (measurements). Time-series data arises naturally in monitoring (CPU usage, request latency), IoT (sensor readings, vehicle telemetry), financial markets (stock prices, trade volumes), and event tracking (user actions, log events).

Write Pattern: Append-Only, High Cardinality

Time-series workloads are write-heavy and append-only. Data arrives continuously and is rarely updated. The write rate can be enormous: a fleet of 10,000 servers each reporting 100 metrics every 10 seconds generates 100,000 data points per second. The "cardinality" of a time series is the number of unique tag combinations. High cardinality (e.g., per-user metrics) stresses indexing because each unique tag combination creates a separate series.

Query Pattern: Range Scans and Aggregations

Queries almost always involve a time range: "give me CPU usage for server X over the last 24 hours." Aggregation is common: average, percentile, min/max over time windows (e.g., "average CPU every 5 minutes"). Point queries by primary key are rare. This is the opposite of OLTP workloads.

Storage Engine

Time-series databases use specialized storage:

Time-partitioned shards: Data is divided into chunks by time range (e.g., one shard per day or week). Old shards can be dropped without expensive deletion.
Columnar layout: Each field is stored in a separate column, enabling efficient compression (timestamps are monotonically increasing, values are often similar) and fast scans.
LSM-based write path: Incoming data is buffered in memory and flushed to immutable on-disk segments, similar to LSM trees but optimized for time-ordered data.

Retention Policies and Downsampling

Old data is often less valuable at full resolution. A retention policy automatically deletes data older than a threshold (e.g., 30 days of raw data). Downsampling aggregates old data to lower resolution: instead of per-second readings, keep per-hour averages. This dramatically reduces storage while preserving trends.

Notable Systems

InfluxDB: Purpose-built time-series DB with its own query language (InfluxQL / Flux). Uses a TSI (Time-Series Index) for tag indexing and TSM (Time-Structured Merge tree) for data storage.
TimescaleDB: PostgreSQL extension. Transparent hypertables partition data by time. Full SQL compatibility.
Prometheus: Pull-based monitoring system. Scrapes metrics from HTTP endpoints at configurable intervals. Local storage with remote-write for long-term retention (e.g., Thanos, Cortex).

Real-Life: Monitoring Infrastructure with Prometheus + InfluxDB

Real-World Example

Prometheus for infrastructure monitoring:

Each server runs a metrics exporter (e.g., node_exporter for Linux). Prometheus scrapes these endpoints every 15 seconds.
Data is stored locally on the Prometheus server. PromQL enables powerful queries: rate(http_requests_total{status="500"}[5m]) computes the per-second rate of 500 errors over 5-minute windows.
Alertmanager fires alerts when PromQL conditions are met (e.g., error rate > 1% for 5 minutes).
For long-term storage, Prometheus remote-writes to Thanos or Cortex, which store data in object storage (S3).

InfluxDB for IoT:

A factory has 1,000 sensors each reporting temperature, pressure, and vibration every second. That's 3,000 writes/second.
InfluxDB stores this in TSM format, partitioned by 1-hour shards. Queries like "average temperature per sensor per 5 minutes over the last 24 hours" scan only the relevant time shards.
Retention policy: keep raw data for 7 days, keep 1-hour downsampled aggregates for 1 year.

TimescaleDB at real scale:

Timescale reports customers storing trillions of data points. As a PostgreSQL extension, it supports JOINs between time-series hypertables and regular relational tables—a major advantage over purpose-built TSDBs.

Time-Series Storage: Time-Partitioned Shards

Step 1 of 2

Time-Series DB (InfluxDB)