SQLite Is All You Need For Durable Workflows (And Why That Scares the Temporal Crowd)

SQLite Is All You Need For Durable Workflows (And Why That Scares the Temporal Crowd)

Why the Hacker News crowd is ditching Temporal and Kafka for SQLite-backed durable execution, and what it means for your AI agent architecture.

Diagram showing SQLite database with Litestream replicating to S3, highlighting the lightweight durable execution architecture
The SQLite + Litestream pattern: each workflow gets its own database file, streamed asynchronously to object storage.

A curious thing happened on Hacker News this week. A post titled “SQLite is all you need for durable workflows” racked up 352 points in hours, sparking a debate that has the distributed systems community sharply divided. The gist? You don’t need Temporal, Kafka, or even Postgres to build reliable, durable workflows. A local SQLite database with asynchronous backup might be all you need.

This isn’t just lightweight architecture nostalgia. It’s a direct challenge to a decade of conventional wisdom that said resilient systems require heavyweight orchestration layers. And for AI agent workloads, those bursty, experimental, state-hungry processes that are eating the world, the SQLite approach might actually be the smarter default.

The Argument That’s Making Engineers Uncomfortable

The core insight is deceptively simple: durable execution is about preserving workflow state, not about complex infrastructure. The compute can be ephemeral, disposable, and cheap. The state just needs to survive failures.

SQLite gives you transactional durability without a separate database service, without network hops, without connection pools, and without the operational overhead of managing a Postgres cluster. Every write is ACID-compliant by default. Every transaction is fsync’d to disk. There’s no network partition to worry about because the database lives in the same process as your application.

The emerging pattern looks like this:

  1. Each AI agent or workflow runs in its own micro-VM or container.
  2. Each instance gets its own SQLite database file.
  3. Litestream asynchronously streams every change to S3-compatible object storage.
  4. On failure, spin up a new instance, download the latest database snapshot, and resume.

That’s it. No Kafka brokers. No Temporal server. No Postgres replication slots. No orchestrator to babysit.

The Litestream Escape Hatch

The missing piece that made this viable is Litestream. It solves SQLite’s most glaring limitation, single-machine failure, by continuously streaming write-ahead log (WAL) changes to S3-compatible storage. The replication is asynchronous, which means there’s a window where recent writes could be lost. But for a lot of AI workloads, that’s an acceptable trade-off.

Here’s what a production-ready setup looks like:

# Install Litestream
curl -fsSL https://litestream.io/install.sh | bash

# Configure continuous replication to S3
cat > /etc/litestream.yml << EOF
dbs:
  - path: /data/workflow.db
    replicas:
      - url: s3://my-bucket/workflows/{instance-id}
        retention: 72h
EOF

# Run your workflow engine with Litestream
litestream replicate -exec "./workflow-engine"

On failure recovery, the startup sequence is equally minimal:

# Restore the latest snapshot
litestream restore -o /data/workflow.db s3://my-bucket/workflows/{instance-id}

# Resume execution
litestream replicate -exec "./workflow-engine"

The HN crowd loves this because it eliminates an entire category of operational complexity. You don’t need a dedicated database administrator for SQLite. You don’t need to tune connection pools. You don’t need to worry about replication lag between a central database and your compute nodes.

Where This Breaks Down (Be Honest)

Let’s not pretend this is a universal solution. The SQLite approach has sharp edges that will cut you if you’re not careful.

High availability is not native. SQLite doesn’t support clustering, leader election, or multi-node reads. If your instance dies and Litestream hasn’t replicated the latest writes, you lose data. That’s fine for AI experimentation. It’s not fine for payment processing or medical records.

Concurrent writes from multiple processes are a no-go. SQLite’s write locking is coarse. If you need multiple workers to write to the same workflow state, you’re better off with Postgres or a purpose-built engine. Hatchet’s “Durable Execution the Hard Way” tutorial uses Postgres specifically because it handles concurrent writers natively.

The 1 MiB barrier is real. If your workflow steps produce large payloads, SQLite isn’t the problem, your architecture is. You need to store large data externally (R2, S3, D1) and pass references between steps. Cloudflare’s Workflows V2 enforces this explicitly, as documented in their control plane scaling analysis.

Asynchronous replication is a gamble. Litestream streams changes asynchronously. If your machine dies between a write and its replication, that write is gone. For many AI agent workflows, where the cost of re-running a step is cheaper than maintaining a synchronous replication setup, this is acceptable. For others, it’s a hard no.

Why AI Agents Fit This Model Perfectly

The emerging consensus is that SQLite-backed durable execution is especially suited for AI agent workloads. Here’s why:

AI agents are inherently bursty and experimental. They spin up, make API calls, generate tokens, maybe fail, and rerun. The traditional approach of routing all state through a centralized Kafka or Temporal cluster adds latency and operational overhead that’s completely at odds with the fast-iteration, high-failure-rate nature of agent development.

Fault isolation matters. Each agent gets its own SQLite database. That means one agent’s corruption, data explosion, or infinite loop doesn’t cascade to others. In a centralized model, a runaway agent can saturate the event log, degrade the control plane, or bring down unrelated workflows. With SQLite, each agent’s state is encapsulated, the worst case is that one database file gets bloated and you delete it.

The debugging story is incredible. Want to replay an agent’s thinking process? Download its SQLite database file, open it in any SQLite viewer, and inspect every step, every decision, every API call. The entire execution history is in a single file. Compare that to digging through Kafka logs or Temporal event histories.

The cost curve bends the right way. Running a 50-step workflow in Cloudflare Workflows costs approximately $0.0003 per execution. At 10,000 daily executions, that’s $90/month in orchestration overhead alone. SQLite eliminates these per-step costs entirely. The database is local. Checkpointing is free. The only cost is storage and compute.

How the Heavyweights Compare

For context, here’s where the major durable execution approaches sit on the simplicity-to-scale spectrum:

Approach Complexity Scale Ceiling Data Loss Window Best For
SQLite + Litestream Very Low Single machine Seconds to minutes AI agents, experimental workflows
Postgres (Hatchet approach) Medium Multi-node reads Zero (synchronous) Production workflows, shared state
Temporal / Cadence High Global (partitioned) Zero Enterprise orchestration
Kafka Streams Very High Global (partitioned) Configurable Event-sourced systems

The SQLite approach isn’t trying to compete at the high end. It’s competing at the low end, where the complexity tax of Temporal or Kafka often outweighs the reliability benefits.

Cloudflare Workflows V2: The Best of Both Worlds?

Interestingly, Cloudflare’s Workflows V2 architecture demonstrates a hybrid approach that borrows heavily from the SQLite philosophy. Each workflow instance runs on a dedicated Durable Object backed by SQLite. The control plane was rearchitected to eliminate the centralized bottleneck that capped V1 at 4,500 concurrent instances.

The scaling improvements are dramatic:

Metric Workflows V1 Workflows V2 Improvement
Concurrent Instances 4,500 per account 50,000 per account 11.1x Increase
Creation Rate 100 executions/sec 300 executions/sec 3x Increase
Queue Depth 1 million per workflow 2 million per workflow 2x Increase

This works because Cloudflare uses Durable Objects as “independent, hibernating schedulers” for each workflow instance. Each DO has its own SQLite-backed storage, its own execution context, and its own failure domain. The result is an architecture that scales horizontally without a centralized coordinator bottleneck.

The lesson is clear: SQLite at the edge is a powerful orchestration primitive when you distribute the state management instead of centralizing it.

The Uncomfortable Truth About Complexity Tax

The broader trend here is a recognition that complexity has a tax that most teams underestimate. Every time you add a Kafka topic, a Temporal worker, or a Postgres replica, you’re not just paying for infrastructure, you’re paying in cognitive load, debugging time, failure modes, and operational burnout.

The SQLite approach forces a honest question: Do you actually need a distributed system, or do you just need a reliable local database with good backup?

For AI agent workloads, the answer is increasingly the latter. Agents are inherently experimental. They fail often, they iterate fast, and they benefit from having state that’s co-located with compute. The argument for SQL over specialized vector databases follows a similar logic, sometimes the simplest tool is the right one, and the specialized alternative is over-engineering.

Where We Go From Here

The SQLite-for-workflows trend is still early, but the signals are strong. Cloudflare’s V2 rearchitecture proves that SQLite-backed distributed state can scale to 50,000 concurrent instances. Litestream has made disaster recovery practical for single-file databases. And the AI agent boom is creating demand for lightweight, isolated, debuggable state management that existing orchestration tools weren’t designed for.

The next steps include:

  • Broader production testing of SQLite-backed workflow engines at scale
  • Tooling improvements for backup, restore, and disaster recovery
  • Integration with AI agent frameworks that can natively leverage per-instance SQLite databases
  • Standardization around patterns for reference-based data management (large payloads go to object storage, IDs flow through SQLite)

This doesn’t mean Temporal or Kafka are dead. They solve real problems for high-availability, multi-tenant, globally distributed systems. But for the vast majority of AI agent use cases, where state is per-instance, failure is expected, and debugging speed is paramount, SQLite might just be the smarter default.

The question isn’t whether SQLite can compete with Temporal. It’s whether you needed Temporal in the first place.

Share:

Related Articles