Practice Architecture Under Fire: The Case for Weekly Production Incident Challenges
A recent proposal floating through architecture forums highlighted this exact blind spot. While we have endless ways to practice designing systems, RFCs, review boards, design patterns, we have almost no structured way to practice reasoning through catastrophic failures. The suggestion was deceptively simple: weekly challenges based on messy, real-world production incidents where the goal isn’t elegant design, but forensic diagnosis under pressure.
The response revealed a deeper anxiety about our industry’s preparedness. One engineer recounted watching their company fire the support provider and shove on-call duties onto the dev team without training, costly architectural decisions leading to 2 AM production incidents that could have been avoided. Predictably, when systems collapsed on a weekend, the untrained on-call developer couldn’t fix it. The disaster was entirely preventable, but only if someone had rehearsed the failure modes first.
Why Chaos Engineering Isn’t Enough
Wait, you might say, we already do chaos engineering. We break things in production on purpose.
That’s not what I’m talking about. Traditional chaos engineering is about validating hypotheses. You define steady state, inject a failure, and confirm the system self-heals. It’s a validation tool, not a training methodology.
steady_state = {
"error_rate": lambda: get_error_rate() < 0.01, # < 1% errors
"latency_p99": lambda: get_p99_latency() < 500, # < 500ms
"throughput": lambda: get_rps() > 1000, # > 1000 RPS
"availability": lambda: get_availability() > 0.999 # 99.9%
}
But knowing that your circuit breaker works doesn’t teach you how to diagnose why it failed in the first place. It doesn’t teach you to read the tea leaves of a slow memory leak or identify architectural shortcuts that create fragile production systems before they cascade into outages.
The missing piece is deliberate practice in diagnostic reasoning, the specific skill of looking at symptoms (latency spikes, error rates, log anomalies) and working backwards through the architecture to find the root cause while the pager is screaming.

The Weekly Incident Challenge Framework
Here’s how this actually works. Instead of theoretical design reviews, you run weekly simulations using real incident post-mortems (sanitized, of course) from your own system or public databases.
The Setup:
You present the team with a scenario: “Payment service latency has spiked from 50ms to 4 seconds. CPU is normal. Database connections are maxed out. The CDN is showing green across the board. You have 30 minutes to find the blast radius and propose a mitigation.”
The Rules:
- No looking at the actual architecture diagrams. You debug from observability data only, logs, metrics, traces.
- Time-boxed pressure. Real incidents don’t wait for inspiration.
- Rotate the “incident commander” every week to build communication skills under stress.
This mirrors the structured approach of chaos engineering but focuses on the human response rather than the system response. Where chaos engineering asks “will the system recover?”, incident challenges ask “can the human diagnose this before the SLA explodes?”
Building the Simulation Muscle
Starting this practice requires using automated checks to prevent architectural drift before deployment, but applied to your incident response capabilities. You need to build the infrastructure for fake incidents before you can run them.
Begin with low-stakes scenarios:
– Service Failure: Kill a single pod and see how quickly the team identifies the missing replica
– Resource Exhaustion: Simulate memory pressure without actually crashing the node
– Dependency Timeout: Inject latency into a downstream service and watch for cascading retry storms
apiVersion: chaos-mesh.org/v1alpha1
kind: NetworkChaos
metadata:
name: network-latency
spec:
action: delay
mode: one
duration: "60s"
selector:
namespaces:
- payments
delay:
latency: "500ms"
correlation: "25%"
But here’s the critical difference from standard chaos engineering: don’t tell them what you broke. In a real incident, nobody announces “Attention: I have just terminated the database primary.” The skill is in discovering the failure mode, not watching a pre-announced experiment.
The Resilience Scorecard
| Metric | Target | Why It Matters |
|---|---|---|
| Mean Time to Diagnosis (MTTD) | < 10 min | Speed of understanding the failure mode |
| Blast Radius Assessment | < 5 min | Identifying which customers/services are hit |
| Recovery Option Generation | < 15 min | Having multiple mitigation paths ready |
| False Positive Rate | < 10% | Not every alert is the root cause |
Notice what’s missing: uptime percentage. These drills aren’t about keeping the system up, they’re about training the humans who have to fix it when it goes down.
From Monoliths to Micro-Incidents
The complexity of your backend structural choices impacts system resilience directly determines your incident patterns. A modular monolith fails differently than a distributed mesh of services, and your drills should reflect your actual architecture.
If you’re running Kubernetes, practice pod-level failures and node drains. If you’re on serverless (AWS Fargate), simulate task failures and cold start latency spikes. If you’re managing Aurora clusters, rehearse reader node reboots and failover scenarios. The training must match the terrain.
# Example: ECS Fargate task failure simulation
apiVersion: litmuschaos.io/v1alpha1
kind: Engine
metadata:
name: fargate-task-kill
spec:
appinfo:
appns: payments
applabel: app=payment-service
chaosServiceAccount: litmus-admin
experiments:
- name: pod-delete
spec:
components:
env:
- name: TOTAL_CHAOS_DURATION
value: "30"
- name: CHAOS_INTERVAL
value: "10"
The Cultural Shift
Implementing this requires what chaos engineering practitioners call a “Chaos Committee”, a group that approves experiments, sets safety guardrails, and reviews results. But for incident challenges, the committee’s job is different: they design the scenarios, evaluate the response quality, and ensure psychological safety.
The goal isn’t blame. When a team misses the diagnosis in a drill, that’s a learning opportunity, not a performance review. The only failure is refusing to participate or ignoring the gaps these exercises reveal.
Start Ugly, Start Now
You don’t need a fancy platform. Start with a simple script that injects latency into a test endpoint, then gather the team in a war room (physical or virtual) and run through the diagnostic paces. Record everything. Review the recording. Ask: “Why did you check the CDN first? What metric would have told you it was the database connection pool?”
After six weeks, you’ll have a team that thinks in failure modes automatically. They’ll start designing systems that fail better because they’ve personally experienced how hard it is to fix fragile architectures while the revenue counter spins backward.
The alternative is learning these lessons during actual outages, with actual customers screaming, and actual money bleeding. That’s expensive tuition.
Run the drills. Break things on purpose. Train the muscle before you need it.

