Cloud-Native AI Kubernetes Security

Kubernetes Horror Story: Reddit’s 314-Minute Pi-Day Outage

Reddit’s 314-minute Pi-Day outage was caused by a Kubernetes 1.24 upgrade that broke Calico networking. Learn what went wrong and how to prevent it.

On March 14, 2023 (Pi Day), Reddit experienced one of its worst outages in history—314 minutes of complete downtime affecting 52 million active users. The mathematical coincidence (π ≈ 3.14) might seem humorous, but the engineering implications are sobering.

The culprit? A seemingly innocuous label rename in Kubernetes 1.24 combined with undocumented Calico Route Reflector configuration that had been set up years ago by engineers who had since left the company.

This is a story about:

  • The dangers of tribal knowledge
  • Why change logs matter (even when you think you read them)
  • The hidden coupling between infrastructure components
  • Why “works in test” doesn’t guarantee production success

Timeline: The Disaster Unfolds

T-0 Minutes: 19:00 UTC – The Upgrade Begins

# Standard Kubernetes upgrade procedure
kubeadm upgrade apply v1.24.0

[upgrade/successful] SUCCESS! Your cluster was upgraded to "v1.24.0". Enjoy!

The Reddit infrastructure team initiated a routine upgrade from Kubernetes 1.23 to 1.24 on one of their most critical clusters. This wasn’t their first rodeo—they’d performed dozens of these upgrades before.

Confidence level: High
Testing status: Passed in staging
Expected impact: Zero downtime
Actual result: Catastrophic failure

T+40 Minutes: 19:40 UTC – The System Crashes

Forty minutes after the upgrade, the entire cluster went dark. Traffic graphs flatlined. The front page of the internet went offline.

# What engineers saw (or rather, didn't see)
$ kubectl get nodes
Unable to connect to the server: dial tcp: lookup api.k8s.reddit.com: no such host

$ kubectl top pods
Error from server: Metrics API not available

# Grafana dashboards: All metrics showing "No Data"
# PagerDuty: 🚨 CRITICAL - Reddit Production Cluster Unresponsive

The Visibility Problem: The monitoring system suffered the same naming dependency. It failed gracefully—continuing to report “healthy” while observing nothing. False negatives at scale. Substack

T+60 Minutes: 20:00 UTC – The Investigation Begins

Two teams split up:

  • Team Alpha: Investigating DNS issues
  • Team Bravo: Planning the rollback procedure

When trying to troubleshoot the problem, engineers couldn’t see any metrics or logs. The cluster was not reporting anything.

# Attempting to diagnose
$ kubectl get pods -n kube-system
NAME                                      READY   STATUS    RESTARTS   AGE
coredns-64897985d-4fkxm                   0/1     Pending   0          45m
coredns-64897985d-9xltq                   0/1     Pending   0          45m
calico-node-abc123                        0/1     Creating  0          45m  # STUCK!
calico-node-def456                        0/1     Creating  0          45m  # STUCK!

# The smoking gun - Calico CNI was stuck in "Creating" state

T+90 Minutes: 20:30 UTC – The Root Cause Discovered

After checking the logs and Calico Network System, they found that the Calico Route Reflector (which is responsible for distributing data and messages across nodes) required some custom actions during the upgrade.

The Breaking Change:

# Kubernetes 1.24 changelog (explicitly documented!)
# https://github.com/kubernetes/kubernetes/blob/master/CHANGELOG/CHANGELOG-1.24.md

## Urgent Upgrade Notes

### (No, really, you MUST read this before you upgrade)

- **The `node-role.kubernetes.io/master` label is deprecated**
  - Replaced with: `node-role.kubernetes.io/control-plane`
  - Both labels exist during upgrade for compatibility
  - Post-upgrade: only `control-plane` label remains

Reddit’s Undocumented Calico Configuration:

# What Reddit had configured (years ago, by engineers who left)
apiVersion: projectcalico.org/v3
kind: BGPPeer
metadata:
  name: route-reflector-config
spec:
  nodeSelector: node-role.kubernetes.io/master == 'true'  # ⚠️ DANGER!
  peerIP: 10.0.1.100
  asNumber: 64512

Why This Mattered:

For Calico to work, every server needs to periodically exchange configuration information with every other server. Because they had so many servers, they designated 3 servers to aggregate the config and spread it to the others (those are the route-reflectors). These route-reflectors were designated by having a special label on them: node-role.kubernetes.io/master

T+120 Minutes: 21:00 UTC – The Rollback Decision

With the root cause identified, Reddit faced a critical choice:

Option A: Fix Forward (2-4 hours, high uncertainty)

  • Manually reconfigure Calico Route Reflectors
  • Unknown complexity with current cluster state
  • Risk of making things worse

Option B: Rollback (1 hour estimated, untested in production)

  • Restore from backup
  • Follow documented procedure (last updated 3 years ago)
  • Hope it works

There was no supported downgrade procedure for Kubernetes, so the team decided to restore from a backup and state reload. However, the restore procedure was written several years ago and had not been updated with changes in the environment.

They chose Option B.

T+150 Minutes: 21:30 UTC – The Rollback Nightmare

# The rollback procedure was outdated
# Written for: Kubernetes 1.19 with Docker runtime
# Current cluster: Kubernetes 1.24 with CRI-O runtime

# Example of problems encountered:
Error: CRI-O config not found at expected path
Error: etcd snapshot restore failed - incompatible version
Error: Node TLS certificates expired during restore

The procedure had been written against a now end-of-life Kubernetes version and pre-dated the switch to CRI-O, so it had to be rewritten live to accommodate the changes.

T+180 Minutes: 22:00 UTC – Additional Issues Surface

Pods were taking too long to start and stop, and container images were taking a very long time to download. During the upgrade, Calico Network Interface stayed in the “creating” process. So simply on&off fixed that problem.

T+200 Minutes: 22:20 UTC – Nodes Won’t Connect

One by one they started to bring the nodes back while shutting down the autoscaler. But the nodes couldn’t connect with each other. The problem was quickly isolated to the TLS certificate issue.

# Certificate expiration during the chaos
$ kubectl get csr
NAME        AGE    SIGNERNAME                            REQUESTOR           CONDITION
node-csr-1  3h     kubernetes.io/kubelet-serving         system:node:node1   Pending
node-csr-2  3h     kubernetes.io/kubelet-serving         system:node:node2   Pending

# Manual certificate approval required
$ kubectl certificate approve node-csr-1
$ kubectl certificate approve node-csr-2

T+314 Minutes: 24:14 UTC – Reddit Returns

Reddit returned at 3:21 PM PST, exactly 314 minutes later.

Final Status:

  • ✅ All nodes restored
  • ✅ Calico networking operational
  • ✅ Traffic gradually restored to avoid thundering herd
  • ✅ 52 million users can browse cat memes again

The Technical Deep Dive: What Really Went Wrong

1. The Kubernetes 1.24 Breaking Change

Kubernetes introduced a well-documented change in version 1.20 that became enforced in 1.24:

# Old label (deprecated in 1.20, removed in 1.24)
node-role.kubernetes.io/master: ""

# New label (required in 1.24+)
node-role.kubernetes.io/control-plane: ""
```

**Why this change was made:**
- Terminology shift: "master/slave" → "control-plane/worker"
- Improved clarity and inclusivity
- Better alignment with actual architecture

**The Changelog Entry:**

The fundamental reason why Reddit weren't able to prevent this outage is that they were evaluating the warnings in the CHANGELOG against a model of the system, not against the actual system. 

### **2. Calico Route Reflectors: The Hidden Dependency**

**What are Route Reflectors?**

In large Kubernetes clusters (1000+ nodes), having every node maintain a full mesh of BGP connections becomes impractical:
```
Full Mesh Connections = N * (N-1) / 2
1000 nodes = 499,500 connections ❌

With 3 Route Reflectors = 3,000 connections ✅

Reddit’s Architecture:

graph TB
    subgraph "Control Plane Nodes (with master label)"
        RR1[Route Reflector 1]
        RR2[Route Reflector 2]
        RR3[Route Reflector 3]
    end
    
    subgraph "Worker Nodes (1000+)"
        W1[Worker 1]
        W2[Worker 2]
        W3[Worker 3]
        WN[Worker N...]
    end
    
    W1 --> RR1
    W2 --> RR1
    W3 --> RR2
    WN --> RR3
    
    RR1 <--> RR2
    RR2 <--> RR3
    RR3 <--> RR1

The Configuration That Failed:

# Calico BGP Configuration (stored in etcd or as CRD)
apiVersion: projectcalico.org/v3
kind: BGPConfiguration
metadata:
  name: default
spec:
  logSeverityScreen: Info
  nodeToNodeMeshEnabled: false  # Disable full mesh
  asNumber: 64512

---
# Route Reflector Node Configuration
# This relied on the deprecated label!
apiVersion: projectcalico.org/v3
kind: BGPPeer
metadata:
  name: peer-with-route-reflectors
spec:
  nodeSelector: node-role.kubernetes.io/master == 'true'  # 💀 DOOM
  peerSelector: route-reflector == 'true'

What Happened During Upgrade:

# Before upgrade (K8s 1.23)
$ kubectl get nodes -l node-role.kubernetes.io/master
NAME              STATUS   ROLES    AGE   VERSION
control-plane-1   Ready    master   365d  v1.23.0
control-plane-2   Ready    master   365d  v1.23.0
control-plane-3   Ready    master   365d  v1.23.0

# After upgrade (K8s 1.24)
$ kubectl get nodes -l node-role.kubernetes.io/master
No resources found  # 💥 ROUTE REFLECTORS GONE!

$ kubectl get nodes -l node-role.kubernetes.io/control-plane
NAME              STATUS   ROLES           AGE   VERSION
control-plane-1   Ready    control-plane   365d  v1.24.0
control-plane-2   Ready    control-plane   365d  v1.24.0
control-plane-3   Ready    control-plane   365d  v1.24.0
```

**The Cascade Effect:**
```
1. Kubernetes 1.24 upgrades → master label removed
2. Calico can't find route reflectors (no matching label)
3. Worker nodes can't get routing information
4. Pod-to-pod networking fails
5. DNS (CoreDNS pods) becomes unreachable
6. Service discovery fails
7. API server becomes partially unreachable
8. kubectl commands fail
9. Monitoring systems fail (same networking dependency)
10. Total cluster collapse

3. The Observability Paradox

The monitoring system suffered the same naming dependency. It failed gracefully—continuing to report “healthy” while observing nothing. False negatives at scale.

Why Monitoring Failed:

# Prometheus configuration (hypothetical but likely)
scrape_configs:
  - job_name: 'kubernetes-nodes'
    kubernetes_sd_configs:
      - role: node
    relabel_configs:
      - source_labels: [__meta_kubernetes_node_label_node_role_kubernetes_io_master]
        action: keep  # Only monitor master nodes

When the label changed, Prometheus silently stopped scraping metrics from control plane nodes, creating a false sense of normalcy.

4. The Knowledge Loss Problem

The people who set that up previously were moved to other departments or companies and never documented it.






The Fix: How Reddit Recovered and Hardened

Immediate Actions (During Outage)

1. Emergency Calico Route Reflector Fix

# Quick fix during recovery
# Manually add control-plane label to route reflector nodes

kubectl label nodes control-plane-1 route-reflector=true
kubectl label nodes control-plane-2 route-reflector=true
kubectl label nodes control-plane-3 route-reflector=true

# Update Calico BGP configuration
cat <<EOF | kubectl apply -f -
apiVersion: projectcalico.org/v3
kind: BGPPeer
metadata:
  name: peer-with-route-reflectors
spec:
  nodeSelector: route-reflector == 'true'  # Use custom label, not K8s internal
  peerSelector: route-reflector == 'true'
EOF

2. Manual Certificate Renewal

# Renew kubelet certificates that expired during chaos
for node in $(kubectl get nodes -o name); do
  kubectl certificate approve ${node}-csr
done

3. Gradual Traffic Restoration

# Don't restore all 52M users at once!
# Gradually increase allowed connections

# Start with 10% traffic
kubectl scale deployment frontend --replicas=100  # from 1000

# Monitor for 5 minutes
# Increase to 25%
kubectl scale deployment frontend --replicas=250

# Continue until 100%

Long-term Preventive Measures

1. Declarative Calico Configuration (Modern Approach)

Our goal is to provide the ability to configure Calico entirely from declarative configuration. Nothing should require bespoke configuration per-cluster.

# Modern Calico Route Reflector Configuration
# Uses Kubernetes API, fully declarative, version controlled

apiVersion: projectcalico.org/v3
kind: BGPConfiguration
metadata:
  name: default
spec:
  logSeverityScreen: Info
  nodeToNodeMeshEnabled: false
  asNumber: 64512
  serviceClusterIPs:
    - cidr: 10.96.0.0/12  # K8s service CIDR
  
---
apiVersion: v1
kind: ConfigMap
metadata:
  name: calico-config
  namespace: kube-system
data:
  # Explicit route reflector configuration
  route_reflector_cluster_id: "10.0.0.1"
  
---
# Use node selectors that won't break with K8s upgrades
apiVersion: projectcalico.org/v3
kind: BGPPeer
metadata:
  name: global-rr-config
spec:
  nodeSelector: topology.kubernetes.io/zone == 'us-east-1a'  # Stable selector
  peerSelector: calico-route-reflector == 'enabled'
  asNumber: 64512

2. Pre-Upgrade Cluster Validation

#!/bin/bash
# pre-upgrade-validator.sh
# Run this BEFORE any Kubernetes upgrade!

echo "🔍 Checking for deprecated API usage..."

# Check for deprecated labels
deprecated_labels=(
  "node-role.kubernetes.io/master"
  "beta.kubernetes.io/arch"
  "beta.kubernetes.io/os"
)

for label in "${deprecated_labels[@]}"; do
  echo "Checking for label: $label"
  matches=$(kubectl get nodes -l "$label" --no-headers 2>/dev/null | wc -l)
  if [ "$matches" -gt 0 ]; then
    echo "⚠️  WARNING: Found $matches nodes with deprecated label: $label"
    echo "   Resources depending on this label:"
    
    # Check BGPPeers
    kubectl get bgppeer -A -o yaml | grep -A2 "$label" || true
    
    # Check BGPConfiguration
    kubectl get bgpconfiguration -A -o yaml | grep -A2 "$label" || true
    
    # Check NetworkPolicies
    kubectl get networkpolicy -A -o yaml | grep -A2 "$label" || true
  fi
done

echo ""
echo "🔍 Checking Calico Route Reflector configuration..."

# Verify route reflectors exist and are accessible
RR_COUNT=$(calicoctl get node --output=yaml | grep -c "routeReflectorClusterID")
if [ "$RR_COUNT" -eq 0 ]; then
  echo "❌ ERROR: No route reflectors found!"
  exit 1
fi

echo "✅ Found $RR_COUNT route reflector(s)"

# Test BGP connectivity
echo "🔍 Testing BGP sessions..."
calicoctl node status | grep -E "Established|Ready" || {
  echo "❌ ERROR: BGP sessions not healthy!"
  exit 1
}

echo "✅ All pre-flight checks passed!"

3. Comprehensive Documentation

markdown

# Calico Route Reflector Runbook

## Architecture Overview
[Diagram of route reflector topology]

## Critical Dependencies
- Node label: `calico-route-reflector=enabled` (DO NOT use K8s internal labels!)
- BGP ASN: 64512
- Route Reflector Cluster ID: 10.0.0.1

## Upgrade Procedure
1. Before ANY Kubernetes upgrade:
   - Run `./scripts/pre-upgrade-validator.sh`
   - Verify route reflector health: `calicoctl node status`
   - Take snapshot of Calico config: `calicoctl get bgpconfig -o yaml > backup.yaml`

2. During upgrade:
   - Monitor BGP session count: should remain stable
   - Watch for route reflector pod restarts

3. After upgrade:
   - Verify BGP sessions re-established within 5 minutes
   - Check pod networking: `kubectl run -it --rm debug --image=nicolaka/netshoot -- ping 8.8.8.8`

## Troubleshooting
If route reflectors fail:
1. Check node labels: `kubectl get nodes --show-labels | grep calico`
2. Verify BGPPeer config: `calicoctl get bgppeer -o yaml`
3. Emergency fix: [link to procedure]

## Contacts
- Primary: SRE Team Lead (@john-doe)
- Secondary: Network Engineering (@jane-smith)
- Escalation: VP Engineering

Last Updated: 2024-03-15
Reviewed By: SRE Team

4. Improved Testing Strategy

# test-cluster-config.yaml
# TEST clusters should match PRODUCTION as closely as possible!

apiVersion: v1
kind: ConfigMap
metadata:
  name: cluster-parity-config
data:
  # Ensure test cluster has SAME configuration as prod
  
  calico_version: "v3.25.0"  # Must match prod
  kubernetes_version: "v1.24.0"  # Target upgrade version
  cni_plugin: "calico"
  container_runtime: "containerd"  # Must match prod (not Docker!)
  
  # Route reflector configuration
  route_reflector_count: "3"
  route_reflector_label: "calico-route-reflector=enabled"
  
  # Critical: Test should have route reflectors like prod!
  enable_route_reflectors: "true"

5. Automated Compliance Scanning

# OPA Gatekeeper Policy
# Prevent use of deprecated labels in any manifests

apiVersion: templates.gatekeeper.sh/v1beta1
kind: ConstraintTemplate
metadata:
  name: k8snodeprecatedlabels
spec:
  crd:
    spec:
      names:
        kind: K8sNodeDeprecatedLabels
  targets:
    - target: admission.k8s.gatekeeper.sh
      rego: |
        package k8snodeprecatedlabels
        
        violation[{"msg": msg}] {
          deprecated_labels := [
            "node-role.kubernetes.io/master",
            "beta.kubernetes.io/arch",
            "beta.kubernetes.io/os"
          ]
          
          input.review.object.spec.nodeSelector[label]
          label == deprecated_labels[_]
          
          msg := sprintf("Deprecated node label used: %v. Update to current label format.", [label])
        }

---
apiVersion: constraints.gatekeeper.sh/v1beta1
kind: K8sNodeDeprecatedLabels
metadata:
  name: no-deprecated-labels
spec:
  match:
    kinds:
      - apiGroups: [""]
        kinds: ["Pod"]
      - apiGroups: ["apps"]
        kinds: ["Deployment", "DaemonSet", "StatefulSet"]
      - apiGroups: ["projectcalico.org"]
        kinds: ["BGPPeer", "BGPConfiguration"]

6. Enhanced Monitoring

# Prometheus AlertManager Rules
# Alert on route reflector health

apiVersion: v1
kind: ConfigMap
metadata:
  name: prometheus-rules
data:
  calico-alerts.yaml: |
    groups:
    - name: calico_networking
      interval: 30s
      rules:
      
      # Alert if route reflectors are missing
      - alert: CalicoRouteReflectorMissing
        expr: count(calicoctl_bgp_peer_established{type="route_reflector"}) < 3
        for: 2m
        labels:
          severity: critical
          component: calico
        annotations:
          summary: "Insufficient Calico Route Reflectors"
          description: "Only {{ $value }} route reflectors found, expected 3"
      
      # Alert if BGP sessions are down
      - alert: CalicoBGPSessionDown
        expr: calicoctl_bgp_peer_established == 0
        for: 1m
        labels:
          severity: critical
          component: calico
        annotations:
          summary: "Calico BGP session down"
          description: "BGP session on node {{ $labels.node }} is not established"
      
      # Alert on deprecated label usage
      - alert: DeprecatedKubernetesLabels
        expr: count(kube_node_labels{label_node_role_kubernetes_io_master="true"}) > 0
        for: 5m
        labels:
          severity: warning
          component: kubernetes
        annotations:
          summary: "Deprecated Kubernetes labels in use"
          description: "Found nodes using deprecated master label - will break in K8s 1.24+"

Key Lessons Learned

1. Changelogs Are Not Enough

The fundamental reason why Reddit weren’t able to prevent this outage is that they were evaluating the warnings in the CHANGELOG against a model of the system, not against the actual system.

The Problem: Reddit engineers DID read the changelog. They saw the mastercontrol-plane label change. But they didn’t know that Calico was using this label in production.

The Solution:

# Don't trust your mental model - query the actual system!
# Before any K8s upgrade, run these queries:

# Find all resources using the deprecated label
kubectl get all -A -o json | jq -r '
  .items[] |
  select(.spec.nodeSelector."node-role.kubernetes.io/master" != null) |
  "\(.kind)/\(.metadata.name) in \(.metadata.namespace)"
'

# Check Calico BGP configuration
calicoctl get bgppeer -o yaml | grep -i "master"

# Check custom resources
kubectl get crd -o name | xargs -I {} kubectl get {} -A -o yaml | grep -i "master"

2. Test Environments Must Match Production

Reddit’s Situation:

  • ✅ Staging cluster tested successfully
  • ❌ Staging cluster didn’t have route reflectors configured
  • ❌ Staging cluster was too small to need route reflectors
  • 💥 Production cluster had route reflectors (undocumented)

Best Practice:

# Enforce production parity in CI/CD

- name: Validate Test Cluster Parity
  run: |
    # Compare test vs prod configurations
    diff \
      <(kubectl --context=prod get configmap cluster-config -o yaml) \
      <(kubectl --context=test get configmap cluster-config -o yaml)
    
    # Ensure test cluster has route reflectors if prod does
    PROD_RR=$(kubectl --context=prod get nodes -l calico-route-reflector=enabled --no-headers | wc -l)
    TEST_RR=$(kubectl --context=test get nodes -l calico-route-reflector=enabled --no-headers | wc -l)
    
    if [ "$PROD_RR" -gt 0 ] && [ "$TEST_RR" -eq 0 ]; then
      echo "ERROR: Prod has route reflectors but test doesn't!"
      exit 1
    fi

3. Document Everything (Especially Custom Configurations)

But the people who set that up previously were moved to other departments or companies and never documented it.

Create Living Documentation:

# Infrastructure Decision Record (IDR)

## IDR-042: Calico Route Reflector Configuration

### Status
Active | Date: 2019-03-15 | Last Reviewed: 2024-03-15

### Context
As our Kubernetes cluster grew beyond 500 nodes, full-mesh BGP became 
impractical (124,750 connections). We implemented route reflectors to 
reduce BGP session count to ~1,500.

### Decision
Deploy 3 route reflectors on control plane nodes using custom labels
(NOT Kubernetes internal labels which may change).

### Configuration
```yaml
# Exact configuration stored in git
# repo: infrastructure/calico-configs/route-reflectors.yaml
```

### Dependencies
- ⚠️ **CRITICAL**: Do NOT rely on `node-role.kubernetes.io/*` labels
- ✅ Use custom label: `calico-route-reflector=enabled`
- Must have minimum 3 route reflectors for quorum

### Upgrade Considerations
Before ANY Kubernetes upgrade:
1. Verify route reflector health
2. Check BGP session count
3. Ensure labels still exist

### Contacts
- Owner: Network SRE Team
- Reviewers: Platform Engineering, Infrastructure
- SME: @network-architect (original implementer)

### Related Incidents
- INC-2023-0314: Pi-Day outage (route reflectors failed)

4. Have a Tested Rollback Plan

Reddit’s Challenge: There was no supported downgrade procedure for Kubernetes, and the restore procedure was written several years ago.

Best Practice: GitOps + Backup Strategy

# backup-strategy.yaml

apiVersion: velero.io/v1
kind: Schedule
metadata:
  name: cluster-state-backup
  namespace: velero
spec:
  # Backup every 6 hours
  schedule: "0 */6 * * *"
  
  template:
    # Include everything needed for restore
    includedNamespaces:
    - '*'
    
    includedResources:
    - '*'
    
    # Backup Calico configurations
    labelSelector:
      matchLabels:
        backup: required
    
    # Include custom Calico CRDs
    includeClusterResources: true
    
    snapshotVolumes: true
    ttl: 720h  # 30 days retention

---
# Disaster recovery procedure (tested quarterly)

apiVersion: v1
kind: ConfigMap
metadata:
  name: dr-runbook
data:
  recovery.sh: |
    #!/bin/bash
    # Last tested: 2024-Q1 DR drill
    
    # 1. Restore cluster state from Velero backup
    velero restore create \
      --from-backup cluster-state-latest \
      --restore-volumes=true
    
    # 2. Verify Calico route reflectors
    calicoctl node status | grep Established
    
    # 3. Test pod networking
    kubectl run test --image=nicolaka/netshoot --rm -it -- ping 8.8.8.8
    
    # 4. Gradually restore traffic (documented in runbook)

5. Automate Pre-Flight Checks

#!/bin/bash
# cluster-upgrade-preflight.sh
# Run this before EVERY upgrade!

set -euo pipefail

echo "🚀 Kubernetes Upgrade Pre-Flight Checklist"
echo "=========================================="

# Check 1: Changelog review
echo "✅ 1. Have you read the Kubernetes CHANGELOG?"
read -p "   Confirm (yes/no): " confirm
if [ "$confirm" != "yes" ]; then
  echo "❌ STOP: Read the changelog first!"
  exit 1
fi

# Check 2: Deprecated API usage
echo "🔍 2. Checking for deprecated APIs..."
kubectl get --raw /apis | jq -r '.groups[].preferredVersion.groupVersion' | while read gv; do
  kubectl api-resources --api-group="${gv%/*}" --verbs=list --namespaced=true -o name 2>/dev/null | while read resource; do
    echo "   Checking $gv/$resource..."
  done
done

# Check 3: Custom resources using deprecated labels
echo "🔍 3. Checking for deprecated label usage..."
DEPRECATED_LABELS=$(kubectl get all -A -o json | \
  jq -r '.items[] | select(.spec.nodeSelector."node-role.kubernetes.io/master" != null) | .metadata.name' | \
  wc -l)

if [ "$DEPRECATED_LABELS" -gt 0 ]; then
  echo "❌ CRITICAL: Found $DEPRECATED_LABELS resources using deprecated master label!"
  echo "   Resources:"
  kubectl get all -A -o json | \
    jq -r '.items[] | select(.spec.nodeSelector."node-role.kubernetes.io/master" != null) | "   - \(.kind)/\(.metadata.name)"'
  exit 1
fi

# Check 4: Calico health
echo "🔍 4. Verifying Calico route reflector health..."
RR_COUNT=$(kubectl get nodes -l calico-route-reflector=enabled --no-headers | wc -l)
if [ "$RR_COUNT" -lt 3 ]; then
  echo "⚠️  WARNING: Only $RR_COUNT route reflectors (expected 3+)"
fi

# Check 5: BGP session health
echo "🔍 5. Checking BGP session status..."
if ! calicoctl node status | grep -q "Established"; then
  echo "❌ CRITICAL: BGP sessions not healthy!"
  calicoctl node status
  exit 1
fi

# Check 6: Backup verification
echo "🔍 6. Verifying recent backup exists..."
LATEST_BACKUP=$(velero backup get --output json | jq -r '.items[0].metadata.createdTimestamp')
BACKUP_AGE=$(( ($(date +%s) - $(date -d "$LATEST_BACKUP" +%s)) / 3600 ))

if [ "$BACKUP_AGE" -gt 24 ]; then
  echo "⚠️  WARNING: Latest backup is $BACKUP_AGE hours old (>24h)"
fi

# Check 7: Rollback procedure tested
echo "✅ 7. Has rollback procedure been tested in last 90 days?"
read -p "   Confirm (yes/no): " confirm
if [ "$confirm" != "yes" ]; then
  echo "⚠️  WARNING: Rollback procedure not recently tested!"
fi

echo ""
echo "=========================================="
echo "✅ Pre-flight checks complete!"
echo ""
echo "📋 Final Checklist:"
echo "   [ ] Change freeze lifted"
echo "   [ ] On-call team notified"
echo "   [ ] Incident response ready"
echo "   [ ] Communication plan in place"
echo "   [ ] Rollback window calculated"
echo ""
echo "Ready to proceed with upgrade? (yes/no)"
read final_confirm

if [ "$final_confirm" != "yes" ]; then
  echo "Upgrade aborted by operator"
  exit 1
fi

echo "🚀 Proceeding with upgrade..."

The Business Impact

Quantified Losses

Direct Impact:

  • Downtime: 314 minutes (5.23 hours)
  • Affected Users: 52 million active users
  • Lost Page Views: ~780 million (estimated)
  • Ad Revenue Loss: ~$500K – $1M (estimated based on traffic)
  • AWS Costs: Increased due to emergency resources

Indirect Impact:

  • User Trust: Temporary migration to competitors
  • Brand Damage: Negative press coverage
  • Engineering Costs: 100+ engineer-hours during incident
  • Opportunity Cost: Delayed feature releases
  • Market Position: Competitor advantage during outage window

Reputational Cost

Media Coverage:

  • Featured on Hacker News (3,500+ upvotes)
  • Multiple tech blog analyses
  • Social media memes (#RedditDown trending)
  • Used as case study in SRE courses

Preventing This in Your Cluster

Quick Action Checklist

🔴 CRITICAL (Do This Week):

# 1. Find deprecated label usage in your cluster
kubectl get all -A -o json | \
  jq -r '.items[] | select(
    .spec.nodeSelector."node-role.kubernetes.io/master" != null or
    .spec.affinity.nodeAffinity.requiredDuringSchedulingIgnoredDuringExecution.nodeSelectorTerms[]?.matchExpressions[]?.key == "node-role.kubernetes.io/master"
  ) | "\(.kind)/\(.metadata.namespace)/\(.metadata.name)"'

# 2. Check Calico configuration
calicoctl get bgppeer -o yaml | grep -i "master"
calicoctl get bgpconfig -o yaml | grep -i "master"

# 3. Review custom CRDs
kubectl get crd -o name | xargs -I {} sh -c 'kubectl get {} -A -o yaml' | grep -i "node-role.kubernetes.io/master"

🟡 HIGH PRIORITY (Do This Month):

  1. Document all custom networking configurations
  2. Create infrastructure decision records (IDRs)
  3. Test your rollback procedure
  4. Implement automated pre-flight checks
  5. Set up proper Calico monitoring

🟢 IMPORTANT (Do This Quarter):

  1. Migrate to declarative Calico configuration
  2. Implement GitOps for all infrastructure
  3. Set up compliance scanning (OPA Gatekeeper)
  4. Run disaster recovery drills
  5. Create comprehensive runbooks

The Kubernetes Upgrade Checklist

# Pre-Upgrade (T-7 days)
- [ ] Read complete CHANGELOG for target version
- [ ] Review deprecation warnings
- [ ] Scan cluster for deprecated API usage
- [ ] Verify backup recent (<24h) and restorable
- [ ] Test upgrade in staging (identical to prod!)
- [ ] Document rollback procedure
- [ ] Calculate rollback window
- [ ] Schedule change with stakeholders
- [ ] Prepare incident response team

# Pre-Upgrade (T-1 hour)
- [ ] Run automated pre-flight checks
- [ ] Verify monitoring healthy
- [ ] Take final backup
- [ ] Enable verbose audit logging
- [ ] Put cluster in maintenance mode
- [ ] Notify all stakeholders

# During Upgrade (T+0)
- [ ] Monitor control plane health
- [ ] Watch for CNI issues (pods starting slow?)
- [ ] Check node status every 5 minutes
- [ ] Verify BGP sessions stable (if using Calico)
- [ ] Monitor pod creation/deletion times
- [ ] Watch for certificate issues

# Post-Upgrade (T+1 hour)
- [ ] Verify all nodes Ready
- [ ] Check pod networking (curl tests)
- [ ] Validate DNS resolution
- [ ] Test ingress/egress traffic
- [ ] Verify persistent volumes accessible
- [ ] Review metrics for anomalies
- [ ] Disable maintenance mode
- [ ] Gradually restore traffic

# Post-Upgrade (T+24 hours)
- [ ] Review incident logs (even if successful)
- [ ] Update documentation
- [ ] Share learnings with team
- [ ] Plan next upgrade

Additional Resources

Tools for Preventing Similar Outages

  1. Pluto – Find deprecated Kubernetes APIs

   # https://github.com/FairwindsOps/pluto
   pluto detect-helm -owide
   pluto detect-files -d . -owide
  1. kubent – Kubernetes No Trouble – Easy upgrade checks

   # https://github.com/doitintl/kube-no-trouble
   kubent
  1. Nova – Check for outdated Helm charts

   # https://github.com/FairwindsOps/nova
   nova find --wide
  1. Velero – Backup and restore

   # https://velero.io/
   velero backup create pre-upgrade-backup --include-cluster-resources=true
  1. OPA Gatekeeper – Policy enforcement

bash

   # https://open-policy-agent.github.io/gatekeeper/
   kubectl apply -f policy-no-deprecated-labels.yaml

Further Reading


Conclusion: The Horror Story That Teaches

The failure wasn’t in any single component—it was in the spaces between components. The system failed at the interfaces, not the implementations.

Reddit’s Pi-Day outage is a masterclass in emergent complexity. Every individual decision was sound:

  • ✅ Kubernetes needed upgrading for security
  • ✅ The label rename improved clarity
  • ✅ Route reflectors were necessary at scale
  • ✅ The staging tests passed

Yet the combination created catastrophic failure.

The Core Truth: Modern distributed systems fail at the edges of our mental models. We can’t prevent all outages, but we can:

  1. Make our mental models explicit (documentation)
  2. Validate against reality (automated checks)
  3. Test our assumptions (DR drills)
  4. Learn from failures (postmortems)
  5. Share knowledge (like this blog post!)


Next in the Series

Coming Next Week: “How Spotify Accidentally Deleted All Its Kubernetes Clusters (With Zero User Impact)” – A story about amazing disaster recovery and the importance of multi-cluster architecture.

Leave a Reply

Your email address will not be published. Required fields are marked *