The 3 AM Wake-Up Call
It was November 25th, 2023. The engineering team at a major e-commerce platform was about to experience their worst Black Friday everβnot because of high traffic, but because of a subtle Kubernetes misconfiguration that had been lurking in production for months.
At 3:47 AM EST, the first PagerDuty alert fired. By 4:15 AM, 73% of their checkout services were down. By sunrise, they’d lost an estimated $100,000 in revenue per minute.
This is the story of how a simple memory leak combined with missing resource limits created a cascading failure that brought down an entire Kubernetes clusterβand the critical lessons learned.
The Setup: What Could Go Wrong?
The Stack:
- Kubernetes 1.24 running on AWS EKS
- 50+ microservices in production
- Peak traffic: 50,000 requests/second
- Node count: 120 c5.4xlarge instances
The Configuration (The Fatal Flaw):
apiVersion: apps/v1
kind: Deployment
metadata:
name: checkout-service
spec:
replicas: 20
template:
spec:
containers:
- name: checkout
image: checkout:v2.3.1
# HORROR: No resource limits defined!
# resources:
# limits:
# memory: "2Gi"
# cpu: "1000m"
# requests:
# memory: "1Gi"
# cpu: "500m"
The Horror Unfolds: Timeline of Disaster
Hour 1 (00:00 – 01:00): The Silent Buildup
The checkout service had a memory leak in a newly deployed caching layer. Without memory limits, pods slowly consumed more and more RAM:
- 00:00: Normal operation, ~800MB per pod
- 00:30: Memory creep to 1.2GB per pod
- 01:00: 2GB per pod, still no alarms
Hour 2 (01:00 – 02:00): Node Pressure Begins
kubectl top nodes
NAME CPU MEMORY
ip-10-0-1-45.ec2.internal 45% 89% # Getting close!
ip-10-0-1-67.ec2.internal 38% 91% # Danger zone
ip-10-0-1-89.ec2.internal 52% 94% # Critical!
Kubernetes started experiencing memory pressure, but without resource limits, it couldn’t make intelligent eviction decisions.
Hour 3 (02:00 – 03:00): The Point of No Return
At 02:47 AM, the first node ran out of memory completely. The kernel OOMKiller started randomly terminating processes:
[1234567.890] Out of memory: Killed process 12345 (checkout-service)
[1234568.123] Out of memory: Killed process 12346 (kubelet) # OH NO!
The horror: When kubelet died, the node became NotReady, triggering pod rescheduling to other already-stressed nodes.
Hour 4 (03:00 – 04:00): Cascading Failure
This is where it became a horror story:
- Pod Rescheduling Storm: 20 checkout pods tried to reschedule to remaining nodes
- Node Death Spiral: Each new pod increased memory pressure on healthy nodes
- Cluster API Overload: kube-apiserver was overwhelmed with 10,000+ events/second
- Monitoring Goes Dark: Prometheus pods were OOMKilled, losing visibility
- etcd Split Brain: Network congestion caused etcd quorum loss
# The logs told the horror story
E1125 03:47:23.456789 12345 kubelet.go:1234] Failed to get node info
E1125 03:47:24.789012 12346 scheduler.go:567] Failed to schedule pod
E1125 03:47:25.123456 12347 apiserver.go:890] etcd cluster unavailable
E1125 03:47:26.456789 12348 controller.go:234] Failed to update deployment
The Root Cause: A Perfect Storm of Misconfigurations
Issue #1: No Resource Limits The biggest mistakeβno memory or CPU limits meant pods could consume unlimited resources.
Issue #2: No LimitRange Policy No cluster-wide enforcement of resource limits:
# What SHOULD have been in place
apiVersion: v1
kind: LimitRange
metadata:
name: mem-limit-range
namespace: production
spec:
limits:
- max:
memory: "4Gi"
cpu: "2"
min:
memory: "128Mi"
cpu: "100m"
default:
memory: "512Mi"
cpu: "500m"
defaultRequest:
memory: "256Mi"
cpu: "250m"
type: Container
Issue #3: No Pod Disruption Budget No PDB meant chaos during node failures:
# Missing PDB configuration
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
name: checkout-pdb
spec:
minAvailable: 15 # Ensure 15 pods always available
selector:
matchLabels:
app: checkout
Issue #4: No Horizontal Pod Autoscaler Safeguards HPA existed but had no maxReplicas limit during the memory leak escalation.
Issue #5: Inadequate Monitoring
- No alerts on node memory pressure
- No alerts on missing resource limits
- Prometheus running without resource guarantees
The Fix: Multi-Layered Defense Strategy
Immediate Actions (Day 1):
- Emergency Resource Limits Rollout
# Applied to all 50+ services
for deployment in $(kubectl get deployments -n production -o name); do
kubectl patch $deployment -n production -p '{
"spec": {
"template": {
"spec": {
"containers": [{
"name": "main",
"resources": {
"limits": {"memory": "2Gi", "cpu": "1"},
"requests": {"memory": "1Gi", "cpu": "500m"}
}
}]
}
}
}
}'
done
- Memory Leak Fix
// The actual bug in the caching layer
// BEFORE (Leaky):
var cacheMap = make(map[string]*Order) // Never cleaned!
// AFTER (Fixed):
var cache = &sync.Map{}
go func() {
ticker := time.NewTicker(5 * time.Minute)
for range ticker.C {
cache.Range(func(key, value interface{}) bool {
if time.Since(value.(*Order).Timestamp) > 10*time.Minute {
cache.Delete(key)
}
return true
})
}
}()
Long-term Hardening (Week 1-4):
- Admission Controllers
# OPA Gatekeeper policy
apiVersion: constraints.gatekeeper.sh/v1beta1
kind: K8sRequiredResources
metadata:
name: must-have-resource-limits
spec:
match:
kinds:
- apiGroups: ["apps"]
kinds: ["Deployment", "StatefulSet"]
parameters:
limits:
- memory
- cpu
- Node Auto-scaling Safeguards
# Cluster Autoscaler configuration
--max-nodes-total=200
--balance-similar-node-groups
--skip-nodes-with-system-pods=false
--expendable-pods-priority-cutoff=-10
- Enhanced Monitoring
# Critical alerts added
- alert: PodMemoryUsageHigh
expr: container_memory_usage_bytes / container_spec_memory_limit_bytes > 0.9
for: 5m
- alert: NodeMemoryPressure
expr: kube_node_status_condition{condition="MemoryPressure",status="true"} == 1
for: 2m
- alert: PodMissingResourceLimits
expr: kube_pod_container_resource_limits_memory_bytes == 0
for: 1m
The Financial Impact
Direct Costs:
- Lost revenue: ~$3.2M (32 minutes of downtime during peak)
- AWS over-provisioning post-incident: $45K/month increase
- Engineering overtime: $180K
- Customer compensation: $750K
Indirect Costs:
- Customer trust damage (18% increase in cart abandonment for 2 weeks)
- Competitor advantage during Black Friday
- Team morale and burnout
Total Estimated Impact: $5.2M+
Key Takeaways: How to Avoid This Horror Story
1. Always Set Resource Limits
# Non-negotiable for production
resources:
limits:
memory: "2Gi"
cpu: "1000m"
requests:
memory: "1Gi"
cpu: "500m"
2. Implement LimitRanges Enforce at the namespace level, not just deployment level.
3. Use Pod Disruption Budgets Protect critical services during disruptions.
4. Monitor Resource Usage Actively
- Pod-level memory/CPU tracking
- Node pressure conditions
- Missing resource limit detection
5. Load Test with Realistic Scenarios Include memory leak simulation in chaos engineering.
6. Implement Circuit Breakers Prevent cascading failures between services.
7. Practice Incident Response Regular game days for OOMKilled scenarios.
Conclusion: Horror Stories Are Teachers
This Black Friday disaster cost millions but taught invaluable lessons about Kubernetes resource management. The most insidious production failures often stem from seemingly minor omissionsβa missing resource limit here, a skipped PDB there.
In the words of their CTO post-mortem: “We treated Kubernetes like magic. It’s not. It’s a powerful tool that requires discipline, monitoring, and respect for the fundamentals.”
Next in this series: “When cert-manager Forgot to Renew: The HTTPS Apocalypse” – How expired certificates brought down a multi-region Kubernetes cluster.