What is Change Data Capture?
Change Data Capture (CDC) is a technique for capturing row-level changes (INSERT, UPDATE, DELETE) from a database and streaming them to external systems in near real-time. Rather than periodically polling the database for changes, CDC reads the database's internal change log and emits each change as an event.
WAL-Based CDC
The most reliable approach reads changes from the database's Write-Ahead Log (WAL):
- PostgreSQL logical decoding: PostgreSQL writes all changes to its WAL. Logical decoding translates the physical WAL entries into a logical stream of row changes (table name, operation type, old and new column values). Replication slots ensure no changes are missed even if the consumer is temporarily down.
- MySQL binlog: MySQL writes all data modifications to the binary log. In ROW format, each binlog event contains the before-and-after images of modified rows. A CDC tool reads the binlog as a replica would.
WAL-based CDC has several advantages: it captures every change (no missed updates), it does not require modifying the application or schema (no triggers), and it has minimal impact on database performance.
Debezium: Open-Source CDC Platform
Debezium is the leading open-source CDC framework. It runs as a set of Kafka Connect connectors:
- The Debezium connector connects to the database as a replication client.
- It reads changes from the WAL/binlog and converts them into structured Kafka messages.
- Each table maps to a Kafka topic. Each change is a message with the operation type, before/after row images, and metadata (timestamp, transaction ID, LSN).
- Downstream consumers read from Kafka topics and react to changes.
Debezium supports PostgreSQL, MySQL, MongoDB, SQL Server, Oracle, Cassandra, and more.
CDC Use Cases
- Sync to search index: When a product row changes in PostgreSQL, the CDC event triggers an update to the corresponding Elasticsearch document. The search index stays in sync without the application needing to dual-write.
- Event-driven microservices: Service A writes to its database. CDC captures the change and publishes it to Kafka. Service B consumes the event and updates its own state. This replaces fragile dual-writes (write to DB + send message) with a reliable outbox pattern.
- Cache invalidation: When a row changes in the database, the CDC event triggers deletion of the corresponding cache entry. The next read fills the cache with fresh data.
- Real-time analytics: Stream changes to a data warehouse or analytics engine (e.g., BigQuery, ClickHouse) for up-to-the-second dashboards.
- Audit logging: Every change is captured with its timestamp, transaction ID, and before/after values, providing a complete audit trail.
Real-Life: Debezium + Kafka Pipeline
A typical Debezium deployment for e-commerce:
- Source: PostgreSQL database with tables
orders,products,customers. - Debezium connector: Reads PostgreSQL logical replication slot, emits changes to Kafka topics
db.public.orders,db.public.products,db.public.customers. - Kafka: Stores the change events durably. Multiple consumers can read independently.
- Consumers:
- Elasticsearch sink: Reads from
db.public.products, updates the product search index in real-time. - Analytics consumer: Reads from
db.public.orders, writes to ClickHouse for real-time revenue dashboards. - Notification service: Reads from
db.public.orders, sends order confirmation emails.
- Elasticsearch sink: Reads from
Netflix uses CDC extensively to sync data across its microservices. Changes in one service's database are captured via CDC and published to a shared event bus, enabling other services to maintain derived views.
Airbnb uses Debezium to capture changes from MySQL databases and stream them to their data lake, enabling real-time analytics on booking data.
Wepay (Chase) pioneered the use of Debezium for exactly-once processing of financial transactions, using Kafka Connect's offset tracking to guarantee no changes are lost or duplicated.