From Hourly to Real-Time: Architecting Event-Driven Pipelines with CDC

Featured image showing event-driven pipeline architecture diagram with CDC components — Architectural comparison of batch processing versus real-time event-driven pipelines with Change Data Capture

Change Data Capture (CDC) is eliminating the batch processing bottleneck that has plagued data engineering for decades. Pinterest’s recent migration from 24-hour latency to 15-minute freshness offers a blueprint for modern event-driven architecture, but the implementation reveals critical trade-offs between storage costs and query performance.

This analysis explores the technical mechanics of CDC pipelines, from SQLite trigger patterns to petabyte-scale Iceberg implementations, and examines when real-time streams actually matter versus when you’re just burning infrastructure budget for marginal gains.

The Batch Latency Tax

Pinterest’s legacy data infrastructure was a textbook example of batch processing debt. Their previous system relied on multiple independently maintained pipelines running full-table batch jobs, resulting in data latency exceeding 24 hours. For a platform serving 500 million monthly active users, waiting a full day for analytics and ML workflows to reflect user behavior isn’t just inconvenient, it’s a competitive disadvantage.

The operational reality was worse than the latency metrics suggested. Daily changes for many tables sat below 5%, yet the system reprocessed 100% of records every cycle. Row-level deletions weren’t natively supported, forcing workarounds that further complicated data consistency.

When your incremental data load realities at scale involve processing petabytes to capture gigabytes of actual changes, you’re paying cloud bills to shuffle immutable data around.

Engineers following Pinterest’s migration noted that collapsing latency from 24 hours to 15 minutes represents a fundamental shift in data freshness. But achieving this required more than flipping a configuration switch, it demanded a complete architectural rethink.

How CDC Actually Works

CDC captures database modifications, inserts, updates, deletes, as they happen, streaming changes through an event pipeline rather than waiting for the next batch window. The pattern sounds straightforward on paper: detect the change, emit an event, process downstream. In practice, implementing this at scale introduces complexity around exactly-once semantics, schema evolution, and storage optimization.

Pinterest’s solution illustrates the modern CDC stack. They deployed Debezium and TiCDC to capture changes from MySQL and TiDB instances, streaming events through Kafka to Flink and Spark for processing, ultimately landing in Iceberg tables on S3. This isn’t a vendor-specific stack, it’s become the industry standard for CDC versus microbatching architectural decision frameworks that prioritize latency over throughput.

The critical insight isn’t the tool selection, it’s the separation of concerns. CDC tables function as append-only ledgers capturing every change event with sub-five-minute latency. Base tables maintain historical snapshots, updated via Spark Merge Into operations every 15 minutes to an hour. This dual-table approach isolates the high-velocity stream from the analytical query workload.

Pinterest’s Architecture: A Case Study in Incremental Processing

Pinterest’s implementation reveals the engineering decisions that separate functional CDC from cost-effective CDC. Their architecture handles petabyte-scale data across thousands of pipelines while processing only the 5% of records that actually change daily.

Data flow abstract diagram showing event-driven pipeline with CDC components — Data flow architecture illustrating CDC pipeline from source databases through Kafka to Iceberg storage

The storage layer choice proved particularly consequential. Iceberg offers two update strategies: Copy on Write (COW) and Merge on Read (MOR). COW rewrites entire data files during updates, creating significant storage and compute overhead. MOR writes changes to separate files and applies them at read time, reducing write amplification but potentially slowing queries.

Pinterest standardized on Merge on Read after evaluating both approaches. COW introduced storage costs that outweighed its benefits for their workload patterns. By partitioning base tables using Iceberg bucketing on primary key hashes, they enabled Spark to parallelize upserts efficiently while minimizing data scanned per operation. They also addressed the small files problem by instructing Spark to distribute writes by partition, preventing the metadata explosion that plagues poorly configured streaming pipelines.

The result: infrastructure cost savings alongside the latency reduction. When you’re only processing changed records rather than full table scans, your compute bill drops proportionally to your data churn rate.

Implementation Patterns: From SQLite to SQL Server

CDC isn’t exclusive to petabyte-scale distributed systems. The same patterns apply to embedded databases and traditional RDBMS, though the implementation mechanics differ significantly.

SQLite Trigger-Based Pattern

For SQLite, common in edge devices and embedded applications, CDC requires trigger-based patterns since SQLite doesn’t expose logical replication streams:

CREATE TABLE IF NOT EXISTS cdc_log (
  id            INTEGER PRIMARY KEY AUTOINCREMENT,
  table_name    TEXT NOT NULL,
  operation     TEXT NOT NULL,
  row_id        INTEGER NOT NULL,
  payload       TEXT NOT NULL,
  created_at    TEXT NOT NULL DEFAULT CURRENT_TIMESTAMP,
  processed     INTEGER NOT NULL DEFAULT 0
);

CREATE INDEX IF NOT EXISTS idx_cdc_processed ON cdc_log(processed);

Triggers capture changes into this log table, which a streaming worker polls and publishes to Kafka or Pulsar:

CREATE TRIGGER users_after_update
AFTER UPDATE ON users
BEGIN
  INSERT INTO cdc_log (table_name, operation, row_id, payload)
  VALUES (
    'users',
    'UPDATE',
    NEW.user_id,
    json_object(
      'user_id', NEW.user_id,
      'email', NEW.email,
      'plan', NEW.plan,
      'updated_at', NEW.updated_at
    )
  );
END;

Warehouse robots diagram showing change event streaming from SQLite to distributed database architecture — Visual representation of warehouse robotics analogy for CDC event streaming from embedded to distributed databases

A Python worker then polls unprocessed rows and publishes to your streaming platform of choice. For production deployments, switch from boolean processed flags to checkpoint-based ID tracking to reduce write amplification, and ensure WAL mode is enabled for concurrency between writes and streaming reads.

SQL Server Native CDC

SQL Server offers native CDC capabilities that operate at the transaction log level, avoiding the trigger overhead:

USE [YourDatabaseName];
GO
EXEC sys.sp_cdc_enable_db;
GO

EXEC sys.sp_cdc_enable_table
  @source_schema = N'dbo',
  @source_name   = N'YourTableName',
  @role_name     = NULL,
  @supports_net_changes = 1;
GO

SQL Server CDC reads the transaction log directly, capturing changes with minimal performance impact on the transactional database. Change tables store the operation type, timestamp, and affected data, queryable via functions like cdc.fn_cdc_get_all_changes.

The Real-Time Illusion: When CDC Makes Sense

When CDC Shines

Your daily change rate is low relative to total dataset size (Pinterest’s 5%)
Downstream systems need immediate consistency (recommendation engines, fraud detection)
You’re currently paying to reprocess immutable data daily

When CDC Is Overkill

Your analytics queries run weekly
Data freshness requirements exceed your actual business decision cycles
You haven’t optimized your batch pipeline costs yet

Not every pipeline needs sub-15-minute latency. The Pinterest migration makes sense because their use cases, analytics, machine learning, and product features, directly benefit from fresh data. But CDC introduces operational complexity: monitoring at-least-once delivery guarantees, handling schema evolution safely, and managing the cold data storage architecture tradeoffs between historical snapshots and real-time streams.

The columnar format performance challenges in pipelines also apply here, choosing Iceberg, Delta Lake, or Hudi affects your ability to handle late-arriving data and schema changes efficiently.

Tooling Reality Check: Sub-Second vs. Polling

The CDC landscape reveals a latency spectrum that vendors often obscure. Salesforce’s ecosystem illustrates this clearly: traditional polling-based integrations introduce 15, 60 seconds of staleness, while Change Data Capture pushes events with sub-second latency.

Platforms like Ampersand leverage Salesforce CDC for sub-second webhooks, critical for AI agents and voice products where polling latency breaks the user experience. In contrast, polling-based platforms (Nango at 15, 30 second intervals, Paragon and Workato with queue-based polling) introduce architectural latency that compounds under load, what engineers call the “noisy neighbor” problem, where one customer’s large sync backs up the queue for everyone else.

This distinction matters for enterprise data platform consolidation trends because unified APIs often normalize data to common schemas, losing access to CDC-specific features. When your enterprise customers require custom object support and real-time sync, deep integration platforms handle the complexity natively while unified APIs force passthrough workarounds.

Migration Path: From Batch to Stream

Moving from hourly batches to event-driven pipelines requires incremental migration, not big-bang rewrites. Start by dual-writing: run your CDC pipeline alongside existing batch jobs, validating data consistency before cutting over.

Monitor For These Issues:

Schema drift between source and target
Duplicate event processing (idempotency failures)
Small file accumulation in your storage layer
Consumer lag in your streaming platform

Pinterest’s approach of separating CDC tables (append-only, high-velocity) from base tables (compacted, query-optimized) provides a migration template. Historical data loads through bootstrap pipelines, while ongoing changes stream through the CDC path. Maintenance jobs handle compaction and snapshot expiration, preventing storage costs from exploding as your event log grows.

The 24-hour to 15-minute improvement isn’t magic, it’s the result of processing only what changed, storing it efficiently, and querying it intelligently. For data teams drowning in batch pipeline debt, CDC offers a path out. Just ensure you’re solving a latency problem, not creating an operational one.