Kubernetes Orchestration Security

Kubernetes Horror Story: The $100K Kubernetes Memory Leak: When OOMKilled Took Down Black Friday

The 3 AM Wake-Up Call

It was November 25th, 2023. The engineering team at a major e-commerce platform was about to experience their worst Black Friday everβ€”not because of high traffic, but because of a subtle Kubernetes misconfiguration that had been lurking in production for months.

At 3:47 AM EST, the first PagerDuty alert fired. By 4:15 AM, 73% of their checkout services were down. By sunrise, they’d lost an estimated $100,000 in revenue per minute.

This is the story of how a simple memory leak combined with missing resource limits created a cascading failure that brought down an entire Kubernetes clusterβ€”and the critical lessons learned.

The Setup: What Could Go Wrong?

The Stack:

  • Kubernetes 1.24 running on AWS EKS
  • 50+ microservices in production
  • Peak traffic: 50,000 requests/second
  • Node count: 120 c5.4xlarge instances

The Configuration (The Fatal Flaw):

apiVersion: apps/v1
kind: Deployment
metadata:
  name: checkout-service
spec:
  replicas: 20
  template:
    spec:
      containers:
      - name: checkout
        image: checkout:v2.3.1
        # HORROR: No resource limits defined!
        # resources:
        #   limits:
        #     memory: "2Gi"
        #     cpu: "1000m"
        #   requests:
        #     memory: "1Gi"
        #     cpu: "500m"

The Horror Unfolds: Timeline of Disaster

Hour 1 (00:00 – 01:00): The Silent Buildup

The checkout service had a memory leak in a newly deployed caching layer. Without memory limits, pods slowly consumed more and more RAM:

  • 00:00: Normal operation, ~800MB per pod
  • 00:30: Memory creep to 1.2GB per pod
  • 01:00: 2GB per pod, still no alarms

Hour 2 (01:00 – 02:00): Node Pressure Begins

kubectl top nodes
NAME                          CPU   MEMORY
ip-10-0-1-45.ec2.internal    45%   89%    # Getting close!
ip-10-0-1-67.ec2.internal    38%   91%    # Danger zone
ip-10-0-1-89.ec2.internal    52%   94%    # Critical!

Kubernetes started experiencing memory pressure, but without resource limits, it couldn’t make intelligent eviction decisions.

Hour 3 (02:00 – 03:00): The Point of No Return

At 02:47 AM, the first node ran out of memory completely. The kernel OOMKiller started randomly terminating processes:

[1234567.890] Out of memory: Killed process 12345 (checkout-service)
[1234568.123] Out of memory: Killed process 12346 (kubelet)  # OH NO!

The horror: When kubelet died, the node became NotReady, triggering pod rescheduling to other already-stressed nodes.

Hour 4 (03:00 – 04:00): Cascading Failure

This is where it became a horror story:

  1. Pod Rescheduling Storm: 20 checkout pods tried to reschedule to remaining nodes
  2. Node Death Spiral: Each new pod increased memory pressure on healthy nodes
  3. Cluster API Overload: kube-apiserver was overwhelmed with 10,000+ events/second
  4. Monitoring Goes Dark: Prometheus pods were OOMKilled, losing visibility
  5. etcd Split Brain: Network congestion caused etcd quorum loss

# The logs told the horror story
E1125 03:47:23.456789   12345 kubelet.go:1234] Failed to get node info
E1125 03:47:24.789012   12346 scheduler.go:567] Failed to schedule pod
E1125 03:47:25.123456   12347 apiserver.go:890] etcd cluster unavailable
E1125 03:47:26.456789   12348 controller.go:234] Failed to update deployment

The Root Cause: A Perfect Storm of Misconfigurations

Issue #1: No Resource Limits The biggest mistakeβ€”no memory or CPU limits meant pods could consume unlimited resources.

Issue #2: No LimitRange Policy No cluster-wide enforcement of resource limits:

# What SHOULD have been in place
apiVersion: v1
kind: LimitRange
metadata:
  name: mem-limit-range
  namespace: production
spec:
  limits:
  - max:
      memory: "4Gi"
      cpu: "2"
    min:
      memory: "128Mi"
      cpu: "100m"
    default:
      memory: "512Mi"
      cpu: "500m"
    defaultRequest:
      memory: "256Mi"
      cpu: "250m"
    type: Container

Issue #3: No Pod Disruption Budget No PDB meant chaos during node failures:

# Missing PDB configuration
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: checkout-pdb
spec:
  minAvailable: 15  # Ensure 15 pods always available
  selector:
    matchLabels:
      app: checkout

Issue #4: No Horizontal Pod Autoscaler Safeguards HPA existed but had no maxReplicas limit during the memory leak escalation.

Issue #5: Inadequate Monitoring

  • No alerts on node memory pressure
  • No alerts on missing resource limits
  • Prometheus running without resource guarantees

The Fix: Multi-Layered Defense Strategy

Immediate Actions (Day 1):

  1. Emergency Resource Limits Rollout

# Applied to all 50+ services
for deployment in $(kubectl get deployments -n production -o name); do
  kubectl patch $deployment -n production -p '{
    "spec": {
      "template": {
        "spec": {
          "containers": [{
            "name": "main",
            "resources": {
              "limits": {"memory": "2Gi", "cpu": "1"},
              "requests": {"memory": "1Gi", "cpu": "500m"}
            }
          }]
        }
      }
    }
  }'
done
  1. Memory Leak Fix

// The actual bug in the caching layer
// BEFORE (Leaky):
var cacheMap = make(map[string]*Order)  // Never cleaned!

// AFTER (Fixed):
var cache = &sync.Map{}
go func() {
    ticker := time.NewTicker(5 * time.Minute)
    for range ticker.C {
        cache.Range(func(key, value interface{}) bool {
            if time.Since(value.(*Order).Timestamp) > 10*time.Minute {
                cache.Delete(key)
            }
            return true
        })
    }
}()

Long-term Hardening (Week 1-4):

  1. Admission Controllers

# OPA Gatekeeper policy
apiVersion: constraints.gatekeeper.sh/v1beta1
kind: K8sRequiredResources
metadata:
  name: must-have-resource-limits
spec:
  match:
    kinds:
    - apiGroups: ["apps"]
      kinds: ["Deployment", "StatefulSet"]
  parameters:
    limits:
    - memory
    - cpu
  1. Node Auto-scaling Safeguards

# Cluster Autoscaler configuration
--max-nodes-total=200
--balance-similar-node-groups
--skip-nodes-with-system-pods=false
--expendable-pods-priority-cutoff=-10
  1. Enhanced Monitoring

# Critical alerts added
- alert: PodMemoryUsageHigh
  expr: container_memory_usage_bytes / container_spec_memory_limit_bytes > 0.9
  for: 5m

- alert: NodeMemoryPressure
  expr: kube_node_status_condition{condition="MemoryPressure",status="true"} == 1
  for: 2m

- alert: PodMissingResourceLimits
  expr: kube_pod_container_resource_limits_memory_bytes == 0
  for: 1m

The Financial Impact

Direct Costs:

  • Lost revenue: ~$3.2M (32 minutes of downtime during peak)
  • AWS over-provisioning post-incident: $45K/month increase
  • Engineering overtime: $180K
  • Customer compensation: $750K

Indirect Costs:

  • Customer trust damage (18% increase in cart abandonment for 2 weeks)
  • Competitor advantage during Black Friday
  • Team morale and burnout

Total Estimated Impact: $5.2M+

Key Takeaways: How to Avoid This Horror Story

1. Always Set Resource Limits

# Non-negotiable for production
resources:
  limits:
    memory: "2Gi"
    cpu: "1000m"
  requests:
    memory: "1Gi"
    cpu: "500m"

2. Implement LimitRanges Enforce at the namespace level, not just deployment level.

3. Use Pod Disruption Budgets Protect critical services during disruptions.

4. Monitor Resource Usage Actively

  • Pod-level memory/CPU tracking
  • Node pressure conditions
  • Missing resource limit detection

5. Load Test with Realistic Scenarios Include memory leak simulation in chaos engineering.

6. Implement Circuit Breakers Prevent cascading failures between services.

7. Practice Incident Response Regular game days for OOMKilled scenarios.

Conclusion: Horror Stories Are Teachers

This Black Friday disaster cost millions but taught invaluable lessons about Kubernetes resource management. The most insidious production failures often stem from seemingly minor omissionsβ€”a missing resource limit here, a skipped PDB there.

In the words of their CTO post-mortem: “We treated Kubernetes like magic. It’s not. It’s a powerful tool that requires discipline, monitoring, and respect for the fundamentals.”

Next in this series: “When cert-manager Forgot to Renew: The HTTPS Apocalypse” – How expired certificates brought down a multi-region Kubernetes cluster.

Leave a Reply

Your email address will not be published. Required fields are marked *