Kubernetes has revolutionized container orchestration, but let’s be honest—debugging Kubernetes clusters can be a nightmare. I’ve spent countless hours at 2 AM staring at CrashLoopBackOff errors, tracking down mysterious network issues, and trying to understand why a perfectly working application suddenly refuses to start.
The Kubernetes ecosystem has matured significantly in 2025, and with it, the debugging tools have evolved from basic kubectl logs commands to AI-powered assistants that can diagnose issues faster than most humans. According to CNCF’s 2025 State of Cloud Native Development report, 77% of backend developers now use cloud native technologies, which means Kubernetes debugging skills are no longer optional—they’re a career requirement.
In this comprehensive guide, I’ll walk you through the top 10 Kubernetes debugging tools that have saved me (and the 17,000+ members of the Kubetools community) countless hours of troubleshooting. Whether you’re debugging a pod that won’t start, tracking down network connectivity issues, or optimizing cluster performance, these tools will become your best friends.
Why Kubernetes Debugging is Different (and Harder)
Before we dive into the tools, let’s understand why debugging Kubernetes is uniquely challenging:
The Complexity Factors
- Ephemeral Nature: Pods come and go, making it hard to debug crashed containers
- Distributed Systems: Your application spans multiple nodes, namespaces, and clusters
- Minimal Images: Best practices recommend distroless images with no debugging tools
- Too Much Data: Logs, events, and metrics flood in from every direction
- Dynamic State: Autoscaling, rolling updates, and self-healing create constant change
The Statistics Don’t Lie:
- Average time to detect issues in Kubernetes: 23 minutes
- Average time to resolve: 4 hours 12 minutes
- Cost of downtime: $5,600 per minute for enterprise applications
The right debugging tools can cut these times by 60-70%. Let’s explore them.
1. kubectl debug: The Game-Changer
Status: Stable (GA) since Kubernetes 1.25
What it solves: Debugging distroless containers and crashed pods
Why it’s #1: Changes everything about how we debug Kubernetes
The Problem It Solves
Imagine this scenario (probably familiar to you):
$ kubectl exec -it my-pod -- /bin/sh
error: Internal error occurred: error executing command in container:
failed to exec in container: failed to start exec "xxx":
OCI runtime exec failed: exec failed: container_linux.go:380:
starting container process caused: exec: "/bin/sh":
stat /bin/sh: no such file or directory: unknown
Your production pod is running a distroless image (as it should for security), but now you can’t debug it because there’s no shell, no curl, no debugging tools at all.
Enter kubectl debug and ephemeral containers.
What Are Ephemeral Containers?
Ephemeral containers are temporary debugging containers that can be attached to running pods. Think of them as “debug sidecars” that you inject on-demand without restarting your pod.
Key Characteristics:
- Run temporarily in existing pods
- Share namespaces (network, IPC, PID) with target containers
- Can’t be restarted or have ports exposed
- No resource guarantees
- Can’t be added at pod creation (API enforced)
Practical Examples
Example 1: Debug a Distroless Container
# Create a distroless nginx pod
kubectl run distroless-pod --image=gcr.io/distroless/base
# Try to exec into it (this will fail)
kubectl exec -it distroless-pod -- /bin/sh
# Error: /bin/sh not found
# Use kubectl debug to attach an ephemeral container
kubectl debug -it distroless-pod --image=busybox --target=distroless-pod
# Now you have a shell with debugging tools!
# The --target flag shares process namespace with the target container
What just happened?
- kubectl created an ephemeral container using the busybox image
- The ephemeral container shares the process namespace with your target
- You can now see and debug processes from the original container
Example 2: Debug a Crashed Container
# Container keeps crashing - you can't exec into it
kubectl get pods
# NAME READY STATUS RESTARTS AGE
# crashloop-pod 0/1 CrashLoopBackOff 5 3m
# Create a debug copy with process namespace sharing
kubectl debug crashloop-pod -it \
--image=nicolaka/netshoot \
--share-processes \
--copy-to=debug-pod
The --copy-to flag creates a new pod copy specifically for debugging, so you don’t affect the original. This is perfect for production troubleshooting.
Example 3: Node-Level Debugging
Sometimes the issue isn’t with a pod—it’s with the node itself.
# Debug the node directly
kubectl debug node/worker-node-1 -it --image=ubuntu
# This creates a pod that:
# 1. Runs in the host's namespaces
# 2. Mounts the host filesystem at /host
# 3. Has full node access
# Now you can inspect node-level issues
chroot /host
systemctl status kubelet
journalctl -u kubelet
df -h
netstat -tulpn
Example 4: Advanced Namespace Sharing
# Share multiple namespaces for deep debugging
kubectl debug -it my-pod \
--image=nicolaka/netshoot \
--target=app-container \
--profile=general
# The --profile flag sets security contexts
# Available profiles: general, baseline, restricted, netadmin, sysadmin
Debug Profiles Explained
Kubernetes 1.25+ introduced debug profiles to make ephemeral containers safer:
- general: Reasonable defaults, most common use case
- baseline: Minimal privileges
- restricted: Most restrictive (for strict security environments)
- netadmin: Network debugging (tcpdump, etc.)
- sysadmin: System-level debugging (requires elevated privileges)
Pro Tips from Production
1. Pre-load debug images on nodes:
# Add this to your node initialization
docker pull nicolaka/netshoot
docker pull busybox
docker pull alpine
2. Create aliases for common debug commands:
alias kdebug='kubectl debug -it --image=nicolaka/netshoot'
alias kdebug-node='kubectl debug node/$(kubectl get nodes -o jsonpath="{.items[0].metadata.name}") -it --image=ubuntu'
3. Use the right image for the job:
- busybox: Lightweight, basic tools (2MB)
- alpine: Package manager available (5MB)
- nicolaka/netshoot: Network debugging powerhouse (traceroute, tcpdump, iperf)
- ubuntu/debian: Full distribution when you need it
When to Use kubectl debug
✅ Use kubectl debug when:
- Debugging distroless or minimal container images
- Container keeps crashing (CrashLoopBackOff)
- Need to inspect processes without kubectl exec
- Troubleshooting node-level issues
- Want to preserve production pod for forensics
❌ Don’t use kubectl debug when:
- Simple log viewing (
kubectl logsis faster) - Basic container inspection (use
kubectl describe) - Already have shell access via
kubectl exec
2. k9s: Terminal UI for Kubernetes
GitHub: https://github.com/derailed/k9s
Stars: 27,000+
What it solves: Context switching fatigue, real-time cluster visibility
Think of it as: Vim meets Kubernetes
Why k9s Changes Everything
After years of typing kubectl get pods, kubectl describe pod xyz, kubectl logs xyz -f, I discovered k9s, and it transformed how I work with Kubernetes.
k9s is a terminal-based UI that provides:
- Real-time cluster visibility (no more refresh loops)
- Keyboard-driven navigation (blazingly fast)
- Built-in log streaming (no more tail -f nightmares)
- Resource monitoring (CPU/Memory at a glance)
- Quick actions (delete, describe, edit, shell—all with hotkeys)
Installation
bash
# macOS
brew install k9s
# Linux
curl -sL https://github.com/derailed/k9s/releases/latest/download/k9s_Linux_amd64.tar.gz | tar xz
sudo mv k9s /usr/local/bin/
# Windows (using Chocolatey)
choco install k9s
# Or via go
go install github.com/derailed/k9s@latest
Essential k9s Keyboard Shortcuts
Once you launch k9s, here are the commands that will become muscle memory:
Navigation:
:pods # View pods
:deployments # View deployments
:services # View services
:nodes # View nodes
:ns # View namespaces
/ <search> # Search/filter
Esc # Back
Ctrl-a # Show all namespaces
Actions (when item selected):
d # Describe resource
l # View logs
shift-f # Port-forward
e # Edit resource
Del # Delete resource
s # Shell into pod
y # YAML view
Ctrl-k # Kill (force delete)
Log Viewing:
0 # Show all containers logs
1-9 # Show specific container
f # Toggle auto-scroll
w # Toggle log wrapping
p # Previous logs
c # Copy to clipboard
Real-World k9s Workflows
Workflow 1: Debugging a Failing Deployment
# Launch k9s
k9s
# Navigate
:deployments → Find your deployment → Enter
# It shows:
# 1. Replica status
# 2. Pod status
# 3. Age
# 4. Real-time updates
# Select a pod → 'd' to describe
# Instantly see events, conditions, and errors
# Press 'l' for logs
# Toggle between containers with 1-9
# Press 's' to shell into the pod
# Debug directly without typing kubectl exec
Workflow 2: Monitoring Resource Usage
k9s
# Press 'pulse' or :pulse
# See real-time CPU/Memory usage across all pods
# Sort by CPU: Shift-c
# Sort by Memory: Shift-m
# Identify resource hogs instantly
Workflow 3: Multi-Cluster Management
bash
# k9s automatically detects your kubeconfig contexts
# Switch context in k9s:
:contexts → Select → Enter
# Or use kubectx integration:
:ctx
k9s Plugins and Customization
k9s supports custom plugins for extending functionality:
Create ~/.config/k9s/plugins.yaml:
plugins:
# Debug pod with stern
debug-stern:
shortCut: Ctrl-S
description: Tail logs with stern
scopes:
- pods
command: stern
background: false
args:
- $NAME
- -n
- $NAMESPACE
# Get pod events
get-events:
shortCut: Ctrl-E
description: Get events for pod
scopes:
- pods
command: kubectl
background: false
args:
- get
- events
- --field-selector
- involvedObject.name=$NAME
- -n
- $NAMESPACE
k9s vs kubectl: The Speed Difference
Task: Find a failing pod and view its logs
kubectl way:
kubectl get pods -A | grep -v Running # 3 seconds
kubectl logs -n production app-xyz-123 -f # 2 seconds
# Total: 5 seconds + mental overhead
k9s way:
k9s
:pods
/ <type "Running" to inverse filter>
<select pod>
l
# Total: 2 seconds, zero mental overhead
k9s Skins and Themes
# Change skin
k9s --skin <skin-name>
# Available skins:
k9s skins
# Edit skin
vim ~/.config/k9s/skin.yaml
3. Stern: Multi-Pod Log Streaming
GitHub: https://github.com/stern/stern
Stars: 8,000+
What it solves: Tailing logs from multiple pods simultaneously
One-liner: tail -f for Kubernetes on steroids
The Problem Stern Solves
You’re running a microservices application with multiple replicas:
kubectl get pods
# NAME READY STATUS
# api-server-abc123 1/1 Running
# api-server-def456 1/1 Running
# api-server-ghi789 1/1 Running
You want to see logs from all three pods simultaneously. With kubectl, you’d need:
kubectl logs -f api-server-abc123 &
kubectl logs -f api-server-def456 &
kubectl logs -f api-server-ghi789 &
# Now you have three terminal windows... 🤦
Stern does this in one command:
stern api-server
Installation
# macOS
brew install stern
# Linux
curl -LO https://github.com/stern/stern/releases/download/v1.28.0/stern_1.28.0_linux_amd64.tar.gz
tar xvzf stern_1.28.0_linux_amd64.tar.gz
sudo mv stern /usr/local/bin/
# Windows
choco install stern
Basic Usage
# Tail all pods matching pattern
stern api-server
# Tail pods in specific namespace
stern api-server -n production
# Tail all pods in namespace
stern . -n production
# Include timestamps
stern api-server -t
# Since last 5 minutes
stern api-server --since 5m
# Exclude init containers
stern api-server --exclude-container istio-init
# Multiple containers
stern api-server --container app,sidecar
Advanced Filtering
Color-coded by pod name (default):
stern api-server
# api-server-abc123 › app › [timestamp] GET /health 200
# api-server-def456 › app › [timestamp] POST /api/users 201
# api-server-abc123 › app › [timestamp] GET /metrics 200
Filter by log level:
# Only ERROR logs
stern api-server | grep ERROR
# Or use --include flag
stern api-server --include 'ERROR|FATAL'
# Exclude INFO logs
stern api-server --exclude 'INFO|DEBUG'
Multiple namespaces:
bash
# Across all namespaces
stern api-server --all-namespaces
# Multiple specific namespaces
stern api-server -n production,staging
Stern + jq: Power Combo
If your logs are JSON:
# Pretty print JSON logs
stern api-server --template '{{.Message}}' | jq
# Extract specific fields
stern api-server -o json | jq '.message, .timestamp, .level'
# Filter JSON logs
stern api-server -o json | jq 'select(.level == "error")'
Stern Templates
Customize output format:
# Custom template
stern api-server --template '{{.PodName}} | {{.ContainerName}} | {{.Message}}'
# Kubernetes labels
stern api-server --template '{{.PodName}} ({{index .PodLabels "version"}}) | {{.Message}}'
# Simplified output
stern api-server --template '{{.Message}}'
Real-World Stern Use Cases
Use Case 1: Debugging Distributed Request
You’re tracking a request across microservices:
# Follow request ID across all services
stern -n production . | grep "request-id-123"
# Result shows the request flowing through:
# api-gateway: Received request-id-123
# auth-service: Validated request-id-123
# user-service: Processing request-id-123
# database-proxy: Query for request-id-123
Use Case 2: New Deployment Monitoring
# Watch logs of newly deployed pods
stern api-server --since 30s -t
# This shows:
# 1. New pods starting
# 2. Health checks passing
# 3. First requests coming in
# 4. Any startup errors
Use Case 3: Container-Specific Debugging
# Only sidecar logs (e.g., Istio proxy)
stern . --container istio-proxy -n production
# Only init container logs
stern . --init-containers
Stern Aliases I Use Daily
# In ~/.bashrc or ~/.zshrc
alias slogs='stern --all-namespaces --since 1h'
alias sprod='stern -n production'
alias serrors='stern --all-namespaces | grep -E "ERROR|FATAL|error|fatal"'
alias sfollow='stern --tail 0 --since 1s'
Stern + Other Tools
Stern + Loki:
# Send stern output to Loki
stern api-server -o json | promtail --push-url http://loki:3100
Stern + Slack alerts:
# Alert on errors
stern api-server | grep ERROR | while read line; do
curl -X POST https://hooks.slack.com/... -d "{'text': '$line'}"
done
4. Lens with Lens Prism: AI-Powered Kubernetes IDE
Website: https://k8slens.dev
What it solves: Visual cluster management + AI-powered debugging
Think of it as: VS Code for Kubernetes, with ChatGPT built-in
Why Lens is Different
Lens Desktop (formerly known as Lens IDE) is the most popular Kubernetes IDE, with millions of users. In 2025, they added Lens Prism—an AI copilot that can:
- Diagnose issues: Just describe the problem in plain English
- Generate kubectl commands: “Show me pods using more than 1GB RAM”
- Explain errors: Paste an error, get human-readable explanation
- Suggest fixes: AI-powered troubleshooting recommendations
Installation
# Download from https://k8slens.dev
# macOS
brew install --cask lens
# Windows
choco install lens
# Linux
# Download .AppImage from releases
Key Features
1. Multi-Cluster Dashboard
- Manage multiple clusters from one interface
- Real-time resource monitoring
- Visual topology maps
- Cluster health at a glance
2. Terminal Integration
- Built-in terminal with kubectl autocomplete
- Pod shell access with one click
- Multi-terminal support
3. Resource Management
- Edit YAML inline
- Visual logs viewer with filtering
- Port-forward with UI
- Resource metrics (CPU/Memory) per pod
4. Lens Prism (AI Copilot)
The game-changer. Here’s what Lens Prism can do:
Example 1: Diagnose CrashLoopBackOff
You: "My pod api-server-abc123 is in CrashLoopBackOff"
Lens Prism:
"I've analyzed the pod. The issue is:
1. Container exits with code 137 (OOMKilled)
2. Memory limit: 128Mi
3. Actual usage spike: 512Mi
Recommended fixes:
- Increase memory limit to 512Mi
- Add resource requests to prevent over-scheduling
- Check for memory leaks in application code
Would you like me to generate a patch to update the memory limits?"
Example 2: Performance Optimization
You: "Which pods are using the most CPU?"
Lens Prism:
"Top CPU consumers:
1. elasticsearch-0: 3.2 cores (320%)
2. api-server-xyz: 1.8 cores (180%)
3. worker-abc: 1.2 cores (120%)
Analysis:
- elasticsearch-0 is above its 2 core limit
- Recommend vertical scaling or optimization
- Can generate HPA configuration if needed"
Real-World Lens Workflows
Workflow 1: Visual Debugging
- Cluster Overview: See all resources at a glance
- Click on failing pod: Instant access to:
- Events
- Logs (with log level filtering)
- YAML definition
- Metrics
- Shell access
- Ask Lens Prism: “Why is this failing?”
- Apply fix: Edit YAML inline or use generated kubectl command
Workflow 2: Port-Forward Management
Instead of managing multiple kubectl port-forward terminals:
- Click pod → Port Forward → Select port
- Lens manages the tunnel in background
- Access localhost:PORT in browser
- Stop/start with one click
Workflow 3: Helm Release Management
- Visual Helm chart browser
- One-click upgrades
- Diff view before applying changes
- Rollback with one click
Lens Extensions
Lens supports extensions for additional functionality:
bash
# Popular extensions:
- @alebcay/openlens-node-pod-menu
- @nevalla/kube-context-cluster-name
- @spectrocloud/lens-extension
Lens vs k9s: When to Use Each
Use k9s when:
- You live in the terminal
- Need speed (keyboard shortcuts)
- Working on remote SSH sessions
- Prefer CLI workflow
Use Lens when:
- Visual learner / prefer GUIs
- Multi-cluster management
- Need AI assistance (Lens Prism)
- Collaborating with team (screenshots, sharing)
- Managing Helm releases
5. K8sGPT: AI Kubernetes Diagnostics {#k8sgpt}
GitHub: https://github.com/k8sgpt-ai/k8sgpt
Stars: 5,000+
What it solves: Automated Kubernetes cluster analysis
Powered by: OpenAI, Azure OpenAI, Claude, or local LLMs
What is K8sGPT?
K8sGPT is a CLI tool that scans your Kubernetes cluster for issues and uses AI to:
- Identify problems
- Explain root causes
- Suggest fixes
- Generate remediation commands
Think of it as having a Kubernetes expert review your cluster 24/7.
Installation
bash
# macOS
brew install k8sgpt
# Linux
curl -LO https://github.com/k8sgpt-ai/k8sgpt/releases/latest/download/k8sgpt_linux_amd64.tar.gz
tar xvzf k8sgpt_linux_amd64.tar.gz
sudo mv k8sgpt /usr/local/bin/
# Windows
choco install k8sgpt
Setup
bash
# Authenticate with OpenAI
k8sgpt auth add --backend openai --model gpt-4
# Enter your OpenAI API key when prompted
# Or use Azure OpenAI
k8sgpt auth add --backend azureopenai --model gpt-4
# Or use local LLM (privacy-focused)
k8sgpt auth add --backend localai --model ggml-gpt4all-j
Basic Usage
bash
# Analyze cluster
k8sgpt analyze
# Example output:
# 0 api-server-abc123(production)
# - Error: CrashLoopBackOff
# - Analysis: Container is exiting with code 1.
# Logs show "Error: ECONNREFUSED - Cannot connect to database"
# - Solution: Check if database service is running and accessible.
# Verify DATABASE_URL environment variable.
# 1 worker-def456(staging)
# - Error: ImagePullBackOff
# - Analysis: Image "myapp:latest" not found in registry
# - Solution: Either push the image to registry or update deployment
# to use existing image tag.
Advanced Features
Explain Specific Resource
bash
# Analyze specific pod
k8sgpt analyze --filter Pod --name api-server-abc123
# Analyze deployment
k8sgpt analyze --filter Deployment
# Multiple filters
k8sgpt analyze --filter Pod,Service,Ingress
Generate Fixes
bash
# Analyze and explain in detail
k8sgpt analyze --explain
# Generate kubectl commands to fix issues
k8sgpt analyze --explain --with-commands
Integration with Existing Tools
bash
# Output as JSON
k8sgpt analyze --output json
# Send to Slack
k8sgpt analyze --explain | slack-cli send
# Continuous monitoring
while true; do
k8sgpt analyze --explain > cluster-health.txt
sleep 300
done
K8sGPT Filters
Available analysis filters:
- Pod: Analyze pod issues
- Deployment: Deployment problems
- ReplicaSet: ReplicaSet issues
- Service: Service configuration
- Ingress: Ingress problems
- PersistentVolumeClaim: Storage issues
- StatefulSet: StatefulSet analysis
- Node: Node health
bash
# Analyze specific resources
k8sgpt analyze --filter Pod,Service
# Exclude resources
k8sgpt analyze --exclude-filter Ingress
Real-World K8sGPT Examples
Example 1: Automated Morning Cluster Check
bash
#!/bin/bash
# morning-check.sh
echo "🔍 Running K8sGPT cluster analysis..."
ISSUES=$(k8sgpt analyze --explain)
if [ -n "$ISSUES" ]; then
echo "⚠️ Issues found!"
echo "$ISSUES"
# Send to Slack
curl -X POST https://hooks.slack.com/... \
-d "{\"text\": \"Cluster Issues:\n$ISSUES\"}"
else
echo "✅ Cluster healthy!"
fi
Example 2: CI/CD Integration
yaml
# .github/workflows/k8s-health-check.yml
name: Kubernetes Health Check
on:
schedule:
- cron: '0 */6 * * *' # Every 6 hours
jobs:
health-check:
runs-on: ubuntu-latest
steps:
- name: Install K8sGPT
run: brew install k8sgpt
- name: Configure K8sGPT
run: k8sgpt auth add --backend openai --model gpt-4
env:
OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
- name: Analyze Cluster
run: |
k8sgpt analyze --explain > cluster-report.txt
cat cluster-report.txt
- name: Upload Report
uses: actions/upload-artifact@v3
with:
name: cluster-health-report
path: cluster-report.txt
6. Prometheus + Grafana: The Observability Stack {#prometheus-grafana}
Prometheus: https://prometheus.io
Grafana: https://grafana.com
What it solves: Metrics, monitoring, alerting
Status: Both are CNCF Graduated Projects
Why This Combo is Essential
Debugging isn’t just about logs—you need metrics to understand:
- Is CPU/Memory actually a problem?
- What happened right before the crash?
- Are there patterns in failures?
- Is performance degrading over time?
Prometheus + Grafana is the de facto standard for Kubernetes monitoring.
Quick Setup (Helm)
bash
# Add Prometheus Helm repo
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update
# Install kube-prometheus-stack (includes both)
helm install prometheus prometheus-community/kube-prometheus-stack \
--namespace monitoring \
--create-namespace
# Port-forward Grafana
kubectl port-forward -n monitoring svc/prometheus-grafana 3000:80
# Access Grafana at http://localhost:3000
# Default credentials: admin / prom-operator
Key Metrics for Debugging
Pod Metrics
promql
# CPU usage by pod
sum(rate(container_cpu_usage_seconds_total[5m])) by (pod)
# Memory usage by pod
sum(container_memory_usage_bytes) by (pod)
# Pod restart count
kube_pod_container_status_restarts_total
# Pods not ready
kube_pod_status_ready{condition="false"}
Node Metrics
promql
# Node CPU usage
100 - (avg by (node) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)
# Node memory usage
(1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100
# Disk usage
(node_filesystem_size_bytes - node_filesystem_free_bytes) / node_filesystem_size_bytes * 100
Application Metrics
promql
# HTTP request rate
rate(http_requests_total[5m])
# Error rate
rate(http_requests_total{status=~"5.."}[5m]) / rate(http_requests_total[5m])
# Request latency (p95)
histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))
Essential Grafana Dashboards
Pre-built dashboards (Dashboard ID):
- Kubernetes Cluster Monitoring: #7249
- Node Exporter Full: #1860
- Kubernetes Pod Resources: #6417
- Prometheus Stats: #2
- Istio Service Dashboard: #7645
bash
# Import dashboard in Grafana
1. Click "+" → Import
2. Enter dashboard ID
3. Select Prometheus data source
4. Click "Import"
Alerting Rules for Common Issues
Create prometheus-alerts.yaml:
yaml
groups:
- name: kubernetes-alerts
rules:
- alert: PodCrashLooping
expr: rate(kube_pod_container_status_restarts_total[15m]) > 0
for: 5m
labels:
severity: critical
annotations:
summary: "Pod {{ $labels.pod }} is crash looping"
- alert: PodNotReady
expr: sum by (namespace, pod) (kube_pod_status_phase{phase!="Running"}) > 0
for: 5m
labels:
severity: warning
- alert: HighMemoryUsage
expr: (sum(container_memory_usage_bytes) by (pod) /
sum(container_spec_memory_limit_bytes) by (pod)) > 0.9
for: 5m
labels:
severity: warning
annotations:
summary: "Pod {{ $labels.pod }} using > 90% memory"
- alert: NodeDiskPressure
expr: kube_node_status_condition{condition="DiskPressure",status="true"} == 1
for: 5m
labels:
severity: critical
Grafana + Loki (Logs)
Enhance your observability with logs:
bash
# Install Loki
helm install loki grafana/loki-stack \
--namespace monitoring \
--set promtail.enabled=true
Now you can correlate metrics (Prometheus) with logs (Loki) in the same Grafana dashboard!
7. Kubectx & Kubens: Context Switching {#kubectx}
GitHub: https://github.com/ahmetb/kubectx
Stars: 17,000+
What it solves: Switching between clusters and namespaces
File size: 10KB (yes, seriously)
The Frustration It Solves
bash
# Without kubectx/kubens:
kubectl config use-context production-cluster
kubectl config set-context --current --namespace=api-services
kubectl get pods
# With kubectx/kubens:
kubectx production
kubens api-services
kubectl get pods
Time saved: ~10 seconds per switch × 50 switches/day = 8 minutes daily
Installation
bash
# macOS
brew install kubectx
# Linux (manual)
sudo git clone https://github.com/ahmetb/kubectx /opt/kubectx
sudo ln -s /opt/kubectx/kubectx /usr/local/bin/kubectx
sudo ln -s /opt/kubectx/kubens /usr/local/bin/kubens
# Add fzf for interactive mode
brew install fzf
Basic Usage
kubectx (cluster switching)
bash
# List contexts
kubectx
# Switch context
kubectx production
# Switch to previous context
kubectx -
# Rename context
kubectx new-name=old-name
# Delete context
kubectx -d context-name
kubens (namespace switching)
bash
# List namespaces
kubens
# Switch namespace
kubens production
# Switch to previous namespace
kubens -
# Show current namespace
kubens -c
Power User Tips
Aliases
bash
# ~/.bashrc or ~/.zshrc
alias kx='kubectx'
alias kn='kubens'
# Now:
kx production
kn api-services
Interactive Fuzzy Search (with fzf)
bash
# Just type kubectx (no args)
kubectx
# Get interactive menu:
# ─────────────────
# development
# staging
# production
# > production-east
# production-west
# ─────────────────
# Use arrow keys to select!
Bash/Zsh Completion
bash
# Add to ~/.zshrc
source /opt/kubectx/completion/kubectx.zsh
source /opt/kubectx/completion/kubens.zsh
# Now tab completion works:
kubectx prod<TAB>
# Completes to: kubectx production
8. Telepresence: Local Development Bridge {#telepresence}
Website: https://www.telepresence.io
What it solves: Debugging microservices locally while connected to cluster
Magic level: 🪄 High
The Problem
You’re developing a microservice that depends on 15 other services in your Kubernetes cluster. Options:
- Run everything locally: Impossible (database, cache, other services)
- Deploy to cluster for every change: Slow (build → push → deploy = 5 minutes)
- Mock everything: Tedious and unrealistic
Telepresence’s solution: Run your service locally while it appears to be in the cluster.
How It Works
Telepresence creates a bidirectional network proxy:
- Your local code can call services in the cluster (as if it’s deployed)
- Services in the cluster can call your local code (as if it’s deployed)
Installation
bash
# macOS
brew install datawire/blackbird/telepresence
# Linux
sudo curl -fL https://app.getambassador.io/download/tel2/linux/amd64/latest/telepresence -o /usr/local/bin/telepresence
sudo chmod +x /usr/local/bin/telepresence
# Windows
choco install telepresence
Basic Usage
bash
# Connect to cluster
telepresence connect
# Your laptop is now "inside" the cluster!
# You can access cluster services by their DNS names:
curl http://api-service.production.svc.cluster.local
# Intercept a deployment
telepresence intercept api-service --port 8080
# Now traffic to api-service goes to your localhost:8080
# Run your local code:
npm run dev # or whatever your local command is
Intercept Patterns
1. Global Intercept (all traffic)
bash
telepresence intercept api-service --port 8080
# ALL traffic to api-service → localhost:8080
2. Selective Intercept (only your traffic)
bash
telepresence intercept api-service \
--port 8080 \
--http-header "x-user=ajeet"
# Only traffic with header x-user=ajeet → localhost:8080
# Other traffic → cluster as normal
3. Preview URLs (share with team)
bash
telepresence intercept api-service \
--port 8080 \
--preview-url=true
# Generates URL: https://abc123.preview.edgestack.me
# Share with team to test your local changes!
Real-World Workflow
bash
# Morning workflow:
telepresence connect
# Start intercepting
telepresence intercept my-service --port 3000
# Run local development server
npm run dev
# Now you can:
# 1. Debug with breakpoints
# 2. Hot reload on code changes
# 3. Test against real cluster services
# 4. Use production data
# When done:
telepresence leave my-service
telepresence quit
9. Pixie: eBPF-Based Observability {#pixie}
Website: https://px.dev
Status: CNCF Sandbox Project
What it solves: Zero-instrumentation observability
Superpower: See everything without changing code
What Makes Pixie Special
Traditional monitoring requires instrumentation—you modify your code to emit metrics/traces. Pixie uses eBPF (extended Berkeley Packet Filter) to capture data at the kernel level without any code changes.
What Pixie can see:
- HTTP/gRPC requests and responses
- Database queries (MySQL, PostgreSQL, Redis)
- DNS requests
- Network connections
- CPU/Memory profiles
- SSL/TLS data (even encrypted traffic!)
Quick Start
bash
# Install Pixie CLI
bash -c "$(curl -fsSL https://withpixie.ai/install.sh)"
# Deploy to cluster
px deploy
# Open web UI
px live
Live Debugging with Pixie
Example 1: HTTP Traffic Analysis
python
# In Pixie UI, run PxL (Pixie Language) script:
import px
# Get all HTTP requests to api-service
df = px.DataFrame('http_events')
df = df[df.ctx['service'] == 'api-service']
df = df[['time_', 'remote_addr', 'req_method', 'req_path', 'resp_status', 'latency_ms']]
px.display(df)
Output:
time_ remote_addr req_method req_path resp_status latency_ms
2025-12-19 10:23:45 10.1.2.3 GET /api/users 200 45
2025-12-19 10:23:47 10.1.2.5 POST /api/orders 201 123
2025-12-19 10:23:48 10.1.2.3 GET /api/products 500 2100
Example 2: Database Query Performance
python
# Analyze slow MySQL queries
df = px.DataFrame('mysql_events')
df = df[df.latency_ms > 1000] # Queries > 1 second
df = df[['time_', 'query', 'latency_ms']]
df = df.sort_values('latency_ms', ascending=False)
px.display(df)
Example 3: Service Dependency Map
python
# Automatic service dependency graph
px.display(px.ServiceGraph())
Shows visual map of which services call which, discovered automatically!
Pixie Use Cases
- Performance debugging: Find slow endpoints
- Security: Detect unusual network patterns
- Cost optimization: Identify chatty services
- Compliance: Audit all database queries
10. Devtron: Kubernetes Dashboard with AI {#devtron}
Website: https://devtron.ai
What it solves: End-to-end Kubernetes application management
Think of it as: Kubernetes + CI/CD + Security in one dashboard
Key Features
- Multi-cluster Management: Single pane of glass
- Application Store: One-click Helm deployments
- CI/CD Pipelines: Built-in automation
- Security Scanning: Image vulnerability detection
- Resource Browser: Visual cluster exploration
- AI-Assisted Debugging: Smart error analysis
Quick Setup
bash
helm repo add devtron https://helm.devtron.ai
helm install devtron devtron/devtron-operator \
--create-namespace --namespace devtroncd
Dashboard Features
- Live Manifest Editing: Edit YAML in production (carefully!)
- Log Streaming: Multi-pod log aggregation
- Terminal Access: Built-in shell
- Event Monitoring: Real-time event viewer
- Resource Topology: Visual relationship maps
Bonus Tools Worth Mentioning
Kubewatch
What: Slack/Teams notifications for cluster events
Use: Get alerts when pods crash, deployments fail, etc.
Kubent (Kube No Trouble)
What: Detect deprecated API versions
Use: Before upgrading Kubernetes, find breaking changes
Popeye
What: Cluster sanitizer
Use: Find misconfigurations and best practice violations
bash
# Install
brew install derailed/popeye/popeye
# Scan cluster
popeye
# Generates report with scores:
# Pods: 85/100 ✅
# Deployments: 72/100 ⚠️
# Services: 95/100 ✅
Building Your Debugging Toolkit
The Minimalist Setup (Start Here)
- kubectl + kubectl debug (built-in)
- k9s (terminal UI)
- stern (logs)
Why: These three cover 80% of debugging scenarios with minimal setup.
The Professional Setup
Add to minimalist setup: 4. Lens (visual + AI) 5. kubectx/kubens (context switching) 6. Prometheus + Grafana (metrics)
The Enterprise Setup
All professional tools plus: 7. K8sGPT (AI analysis) 8. Telepresence (local dev) 9. Pixie (deep observability) 10. Devtron (unified platform)
Debugging Workflow: Putting It All Together
Here’s my actual workflow when debugging a production issue:
Step 1: Initial Triage (k9s)
bash
k9s
# Quick visual: which pods are failing?
# Check events: what's the error?
# View logs: is there a clear error message?
Step 2: Deep Dive (kubectl debug)
bash
# If minimal image or crashed container:
kubectl debug failing-pod -it --image=nicolaka/netshoot --target=app
# Debug node if needed:
kubectl debug node/worker-1 -it --image=ubuntu
Step 3: Log Analysis (stern)
bash
# Tail logs across all replicas:
stern api-service | grep ERROR
# Check last 30 minutes:
stern api-service --since 30m
Step 4: Metrics Review (Grafana)
bash
# Check dashboard:
# - CPU/Memory spikes?
# - Request rate changes?
# - Error rate increase?
Step 5: AI Analysis (K8sGPT)
bash
# Get AI recommendations:
k8sgpt analyze --explain --filter Pod
# Often provides insights I missed
Step 6: Fix and Verify
bash
# Apply fix
kubectl apply -f fix.yaml
# Monitor with k9s
# Verify with stern
# Check metrics in Grafana
Common Debugging Scenarios Solved
Scenario 1: CrashLoopBackOff
Symptoms: Pod keeps restarting
Debugging:
bash
# View recent logs (even from crashed containers)
kubectl logs pod-name --previous
# Use kubectl debug if container exits too fast
kubectl debug pod-name -it --image=busybox --copy-to=debug-pod
# Check events
kubectl describe pod pod-name | grep -A 10 Events
# AI analysis
k8sgpt analyze --filter Pod --name pod-name --explain
Common causes:
- Application crashes on startup
- Missing environment variables
- Cannot connect to dependencies
- OOMKilled (memory limit too low)
Scenario 2: ImagePullBackOff
Symptoms: Can’t pull container image
Debugging:
bash
# Check image name
kubectl describe pod pod-name | grep Image
# Verify image exists
docker pull <image-name>
# Check image pull secrets
kubectl get secrets
kubectl describe secret <secret-name>
Common causes:
- Typo in image name/tag
- Private registry without credentials
- Network issues reaching registry
- Image doesn’t exist
Scenario 3: Pending Pod
Symptoms: Pod stuck in Pending state
Debugging:
bash
# Check why it's pending
kubectl describe pod pod-name | grep -A 10 Events
# Check node resources
kubectl top nodes
# Check for taints/tolerations
kubectl describe nodes | grep Taints
Common causes:
- Insufficient CPU/memory on nodes
- No nodes match pod’s node selector
- Volume mount issues
- Pod priority/preemption
Scenario 4: Networking Issues
Symptoms: Services can’t communicate
Debugging:
bash
# Use debug container with network tools
kubectl debug pod-name -it --image=nicolaka/netshoot
# Inside debug container:
# Test DNS
nslookup service-name
dig service-name.namespace.svc.cluster.local
# Test connectivity
curl http://service-name:port
telnet service-name port
# Check network policies
kubectl get networkpolicies -A
Scenario 5: Performance Issues
Debugging:
bash
# Check resource usage
kubectl top pods
kubectl top nodes
# Use Pixie for deep analysis
px live
# Check metrics in Grafana
# Look for:
# - CPU throttling
# - Memory pressure
# - High request latency
# - Error rates
Pro Tips from 10 Years of Kubernetes
1. Always Use Resource Limits
yaml
resources:
requests:
memory: "128Mi"
cpu: "100m"
limits:
memory: "512Mi"
cpu: "500m"
Why: Prevents one pod from starving others. Makes debugging resource issues easier.
2. Enable Process Namespace Sharing When Debugging
yaml
spec:
shareProcessNamespace: true
Why: Allows ephemeral containers to see processes from other containers.
3. Use Liveness/Readiness Probes
yaml
livenessProbe:
httpGet:
path: /health
port: 8080
initialDelaySeconds: 30
periodSeconds: 10
readinessProbe:
httpGet:
path: /ready
port: 8080
initialDelaySeconds: 5
periodSeconds: 5
Why: Kubernetes can detect and handle unhealthy pods automatically.
4. Log to stdout/stderr (not files)
javascript
// ✅ Good
console.log('Request received');
// ❌ Bad
fs.appendFile('/var/log/app.log', 'Request received');
Why: Makes logs accessible via kubectl logs and log aggregation tools.
5. Use Structured Logging (JSON)
javascript
console.log(JSON.stringify({
level: 'info',
timestamp: new Date().toISOString(),
message: 'Request received',
requestId: req.id,
userId: req.user.id
}));
Why: Easier to parse, filter, and search in log aggregation tools.
The Future of Kubernetes Debugging
AI-Powered Debugging (2025 and Beyond)
Tools like Lens Prism and K8sGPT are just the beginning. Expect:
- Predictive debugging: AI predicts failures before they happen
- Automated remediation: AI fixes issues without human intervention
- Natural language queries: “Show me why latency increased”
- Root cause analysis: AI traces issues across the entire stack
eBPF Everywhere
Tools like Pixie prove eBPF is the future:
- Zero-instrumentation observability
- Kernel-level visibility
- No performance overhead
- Works with any language/framework
Quick Reference Cheat Sheet
kubectl debug
bash
# Debug distroless container
kubectl debug pod-name -it --image=busybox --target=container-name
# Debug crashed pod
kubectl debug pod-name -it --image=busybox --copy-to=debug-pod
# Debug node
kubectl debug node/node-name -it --image=ubuntu
k9s
:pods # View pods
d # Describe
l # Logs
s # Shell
Del # Delete
/ # Filter
stern
bash
stern app-name # Tail logs
stern app-name -n namespace # Specific namespace
stern . -n namespace # All pods in namespace
stern app-name --since 5m # Last 5 minutes
stern app-name | grep ERROR # Filter logs
kubectx/kubens
bash
kubectx production # Switch cluster
kubens staging # Switch namespace
kubectx - # Previous cluster
kubens - # Previous namespace
Conclusion: Your Debugging Superpowers
After years of Kubernetes debugging, here’s what I’ve learned:
The 80/20 Rule
80% of debugging can be done with:
kubectl debug(ephemeral containers)k9s(visual exploration)stern(log aggregation)
The remaining 20% requires:
- Metrics (Prometheus + Grafana)
- AI assistance (K8sGPT, Lens Prism)
- Deep observability (Pixie)
Start Small, Scale Up
Week 1: Master kubectl debug and k9s
Week 2: Add stern for log analysis
Week 3: Set up Prometheus + Grafana
Month 2: Explore AI tools (K8sGPT, Lens Prism)
Month 3: Add advanced tools based on your needs