Back to DAG

Change Data Capture (CDC)

databases

What is Change Data Capture?

Change Data Capture (CDC) is a technique for capturing row-level changes (INSERT, UPDATE, DELETE) from a database and streaming them to external systems in near real-time. Rather than periodically polling the database for changes, CDC reads the database's internal change log and emits each change as an event.

WAL-Based CDC

The most reliable approach reads changes from the database's Write-Ahead Log (WAL):

  • PostgreSQL logical decoding: PostgreSQL writes all changes to its WAL. Logical decoding translates the physical WAL entries into a logical stream of row changes (table name, operation type, old and new column values). Replication slots ensure no changes are missed even if the consumer is temporarily down.
  • MySQL binlog: MySQL writes all data modifications to the binary log. In ROW format, each binlog event contains the before-and-after images of modified rows. A CDC tool reads the binlog as a replica would.

WAL-based CDC has several advantages: it captures every change (no missed updates), it does not require modifying the application or schema (no triggers), and it has minimal impact on database performance.

Debezium: Open-Source CDC Platform

Debezium is the leading open-source CDC framework. It runs as a set of Kafka Connect connectors:

  1. The Debezium connector connects to the database as a replication client.
  2. It reads changes from the WAL/binlog and converts them into structured Kafka messages.
  3. Each table maps to a Kafka topic. Each change is a message with the operation type, before/after row images, and metadata (timestamp, transaction ID, LSN).
  4. Downstream consumers read from Kafka topics and react to changes.

Debezium supports PostgreSQL, MySQL, MongoDB, SQL Server, Oracle, Cassandra, and more.

CDC Use Cases

  • Sync to search index: When a product row changes in PostgreSQL, the CDC event triggers an update to the corresponding Elasticsearch document. The search index stays in sync without the application needing to dual-write.
  • Event-driven microservices: Service A writes to its database. CDC captures the change and publishes it to Kafka. Service B consumes the event and updates its own state. This replaces fragile dual-writes (write to DB + send message) with a reliable outbox pattern.
  • Cache invalidation: When a row changes in the database, the CDC event triggers deletion of the corresponding cache entry. The next read fills the cache with fresh data.
  • Real-time analytics: Stream changes to a data warehouse or analytics engine (e.g., BigQuery, ClickHouse) for up-to-the-second dashboards.
  • Audit logging: Every change is captured with its timestamp, transaction ID, and before/after values, providing a complete audit trail.

Real-Life: Debezium + Kafka Pipeline

Real-World Example

A typical Debezium deployment for e-commerce:

  1. Source: PostgreSQL database with tables orders, products, customers.
  2. Debezium connector: Reads PostgreSQL logical replication slot, emits changes to Kafka topics db.public.orders, db.public.products, db.public.customers.
  3. Kafka: Stores the change events durably. Multiple consumers can read independently.
  4. Consumers:
    • Elasticsearch sink: Reads from db.public.products, updates the product search index in real-time.
    • Analytics consumer: Reads from db.public.orders, writes to ClickHouse for real-time revenue dashboards.
    • Notification service: Reads from db.public.orders, sends order confirmation emails.

Netflix uses CDC extensively to sync data across its microservices. Changes in one service's database are captured via CDC and published to a shared event bus, enabling other services to maintain derived views.

Airbnb uses Debezium to capture changes from MySQL databases and stream them to their data lake, enabling real-time analytics on booking data.

Wepay (Chase) pioneered the use of Debezium for exactly-once processing of financial transactions, using Kafka Connect's offset tracking to guarantee no changes are lost or duplicated.

CDC Pipeline with Debezium and Kafka

App WRITE PostgreSQL WAL (logical) WAL stream Debezium Kafka Connect Kafka topic: orders topic: products topic: customers Elastic Search Analytics Cache Example CDC Event (JSON) { "op": "u", // u=update, c=create, d=delete "before": {"id": 42, "status": "pending", "total": 99.00}, "after": {"id": 42, "status": "shipped", "total": 99.00}, "source": {"db": "shop", "table": "orders", "lsn": 84920} } Key CDC Use Cases Search sync | Microservice events | Cache invalidation | Real-time analytics | Audit log
Step 1 of 2