Best Practices Cloud Computing Debugging

Top 10 Kubernetes Debugging Tools Every DevOps Engineer Needs in 2026

Kubernetes has revolutionized container orchestration, but let’s be honest—debugging Kubernetes clusters can be a nightmare. I’ve spent countless hours at 2 AM staring at CrashLoopBackOff errors, tracking down mysterious network issues, and trying to understand why a perfectly working application suddenly refuses to start.

The Kubernetes ecosystem has matured significantly in 2025, and with it, the debugging tools have evolved from basic kubectl logs commands to AI-powered assistants that can diagnose issues faster than most humans. According to CNCF’s 2025 State of Cloud Native Development report, 77% of backend developers now use cloud native technologies, which means Kubernetes debugging skills are no longer optional—they’re a career requirement.

In this comprehensive guide, I’ll walk you through the top 10 Kubernetes debugging tools that have saved me (and the 17,000+ members of the Kubetools community) countless hours of troubleshooting. Whether you’re debugging a pod that won’t start, tracking down network connectivity issues, or optimizing cluster performance, these tools will become your best friends.

Why Kubernetes Debugging is Different (and Harder)

Before we dive into the tools, let’s understand why debugging Kubernetes is uniquely challenging:

The Complexity Factors

  1. Ephemeral Nature: Pods come and go, making it hard to debug crashed containers
  2. Distributed Systems: Your application spans multiple nodes, namespaces, and clusters
  3. Minimal Images: Best practices recommend distroless images with no debugging tools
  4. Too Much Data: Logs, events, and metrics flood in from every direction
  5. Dynamic State: Autoscaling, rolling updates, and self-healing create constant change

The Statistics Don’t Lie:

  • Average time to detect issues in Kubernetes: 23 minutes
  • Average time to resolve: 4 hours 12 minutes
  • Cost of downtime: $5,600 per minute for enterprise applications

The right debugging tools can cut these times by 60-70%. Let’s explore them.


1. kubectl debug: The Game-Changer

Status: Stable (GA) since Kubernetes 1.25
What it solves: Debugging distroless containers and crashed pods
Why it’s #1: Changes everything about how we debug Kubernetes

The Problem It Solves

Imagine this scenario (probably familiar to you):

$ kubectl exec -it my-pod -- /bin/sh
error: Internal error occurred: error executing command in container: 
failed to exec in container: failed to start exec "xxx": 
OCI runtime exec failed: exec failed: container_linux.go:380: 
starting container process caused: exec: "/bin/sh": 
stat /bin/sh: no such file or directory: unknown

Your production pod is running a distroless image (as it should for security), but now you can’t debug it because there’s no shell, no curl, no debugging tools at all.

Enter kubectl debug and ephemeral containers.

What Are Ephemeral Containers?

Ephemeral containers are temporary debugging containers that can be attached to running pods. Think of them as “debug sidecars” that you inject on-demand without restarting your pod.

Key Characteristics:

  • Run temporarily in existing pods
  • Share namespaces (network, IPC, PID) with target containers
  • Can’t be restarted or have ports exposed
  • No resource guarantees
  • Can’t be added at pod creation (API enforced)

Practical Examples

Example 1: Debug a Distroless Container

# Create a distroless nginx pod
kubectl run distroless-pod --image=gcr.io/distroless/base

# Try to exec into it (this will fail)
kubectl exec -it distroless-pod -- /bin/sh
# Error: /bin/sh not found

# Use kubectl debug to attach an ephemeral container
kubectl debug -it distroless-pod --image=busybox --target=distroless-pod

# Now you have a shell with debugging tools!
# The --target flag shares process namespace with the target container

What just happened?

  1. kubectl created an ephemeral container using the busybox image
  2. The ephemeral container shares the process namespace with your target
  3. You can now see and debug processes from the original container

Example 2: Debug a Crashed Container

# Container keeps crashing - you can't exec into it
kubectl get pods
# NAME                          READY   STATUS             RESTARTS   AGE
# crashloop-pod                 0/1     CrashLoopBackOff   5          3m

# Create a debug copy with process namespace sharing
kubectl debug crashloop-pod -it \
  --image=nicolaka/netshoot \
  --share-processes \
  --copy-to=debug-pod

The --copy-to flag creates a new pod copy specifically for debugging, so you don’t affect the original. This is perfect for production troubleshooting.

Example 3: Node-Level Debugging

Sometimes the issue isn’t with a pod—it’s with the node itself.

# Debug the node directly
kubectl debug node/worker-node-1 -it --image=ubuntu

# This creates a pod that:
# 1. Runs in the host's namespaces
# 2. Mounts the host filesystem at /host
# 3. Has full node access

# Now you can inspect node-level issues
chroot /host
systemctl status kubelet
journalctl -u kubelet
df -h
netstat -tulpn

Example 4: Advanced Namespace Sharing

# Share multiple namespaces for deep debugging
kubectl debug -it my-pod \
  --image=nicolaka/netshoot \
  --target=app-container \
  --profile=general

# The --profile flag sets security contexts
# Available profiles: general, baseline, restricted, netadmin, sysadmin

Debug Profiles Explained

Kubernetes 1.25+ introduced debug profiles to make ephemeral containers safer:

  • general: Reasonable defaults, most common use case
  • baseline: Minimal privileges
  • restricted: Most restrictive (for strict security environments)
  • netadmin: Network debugging (tcpdump, etc.)
  • sysadmin: System-level debugging (requires elevated privileges)

Pro Tips from Production

1. Pre-load debug images on nodes:

# Add this to your node initialization
docker pull nicolaka/netshoot
docker pull busybox
docker pull alpine

2. Create aliases for common debug commands:

alias kdebug='kubectl debug -it --image=nicolaka/netshoot'
alias kdebug-node='kubectl debug node/$(kubectl get nodes -o jsonpath="{.items[0].metadata.name}") -it --image=ubuntu'

3. Use the right image for the job:

  • busybox: Lightweight, basic tools (2MB)
  • alpine: Package manager available (5MB)
  • nicolaka/netshoot: Network debugging powerhouse (traceroute, tcpdump, iperf)
  • ubuntu/debian: Full distribution when you need it

When to Use kubectl debug

Use kubectl debug when:

  • Debugging distroless or minimal container images
  • Container keeps crashing (CrashLoopBackOff)
  • Need to inspect processes without kubectl exec
  • Troubleshooting node-level issues
  • Want to preserve production pod for forensics

Don’t use kubectl debug when:

  • Simple log viewing (kubectl logs is faster)
  • Basic container inspection (use kubectl describe)
  • Already have shell access via kubectl exec

2. k9s: Terminal UI for Kubernetes

GitHub: https://github.com/derailed/k9s
Stars: 27,000+
What it solves: Context switching fatigue, real-time cluster visibility
Think of it as: Vim meets Kubernetes

Why k9s Changes Everything

After years of typing kubectl get pods, kubectl describe pod xyz, kubectl logs xyz -f, I discovered k9s, and it transformed how I work with Kubernetes.

k9s is a terminal-based UI that provides:

  • Real-time cluster visibility (no more refresh loops)
  • Keyboard-driven navigation (blazingly fast)
  • Built-in log streaming (no more tail -f nightmares)
  • Resource monitoring (CPU/Memory at a glance)
  • Quick actions (delete, describe, edit, shell—all with hotkeys)

Installation

bash

# macOS
brew install k9s

# Linux
curl -sL https://github.com/derailed/k9s/releases/latest/download/k9s_Linux_amd64.tar.gz | tar xz
sudo mv k9s /usr/local/bin/

# Windows (using Chocolatey)
choco install k9s

# Or via go
go install github.com/derailed/k9s@latest

Essential k9s Keyboard Shortcuts

Once you launch k9s, here are the commands that will become muscle memory:

Navigation:

:pods         # View pods
:deployments  # View deployments
:services     # View services
:nodes        # View nodes
:ns           # View namespaces
/ <search>    # Search/filter
Esc           # Back
Ctrl-a        # Show all namespaces

Actions (when item selected):

d             # Describe resource
l             # View logs
shift-f       # Port-forward
e             # Edit resource
Del           # Delete resource
s             # Shell into pod
y             # YAML view
Ctrl-k        # Kill (force delete)

Log Viewing:

0             # Show all containers logs
1-9           # Show specific container
f             # Toggle auto-scroll
w             # Toggle log wrapping
p             # Previous logs
c             # Copy to clipboard

Real-World k9s Workflows

Workflow 1: Debugging a Failing Deployment

# Launch k9s
k9s

# Navigate
:deployments → Find your deployment → Enter

# It shows:
# 1. Replica status
# 2. Pod status
# 3. Age
# 4. Real-time updates

# Select a pod → 'd' to describe
# Instantly see events, conditions, and errors

# Press 'l' for logs
# Toggle between containers with 1-9

# Press 's' to shell into the pod
# Debug directly without typing kubectl exec

Workflow 2: Monitoring Resource Usage

k9s

# Press 'pulse' or :pulse
# See real-time CPU/Memory usage across all pods

# Sort by CPU: Shift-c
# Sort by Memory: Shift-m

# Identify resource hogs instantly

Workflow 3: Multi-Cluster Management

bash

# k9s automatically detects your kubeconfig contexts

# Switch context in k9s:
:contexts → Select → Enter

# Or use kubectx integration:
:ctx

k9s Plugins and Customization

k9s supports custom plugins for extending functionality:

Create ~/.config/k9s/plugins.yaml:

plugins:
  # Debug pod with stern
  debug-stern:
    shortCut: Ctrl-S
    description: Tail logs with stern
    scopes:
      - pods
    command: stern
    background: false
    args:
      - $NAME
      - -n
      - $NAMESPACE
      
  # Get pod events
  get-events:
    shortCut: Ctrl-E
    description: Get events for pod
    scopes:
      - pods
    command: kubectl
    background: false
    args:
      - get
      - events
      - --field-selector
      - involvedObject.name=$NAME
      - -n
      - $NAMESPACE

k9s vs kubectl: The Speed Difference

Task: Find a failing pod and view its logs

kubectl way:

kubectl get pods -A | grep -v Running    # 3 seconds
kubectl logs -n production app-xyz-123 -f # 2 seconds
# Total: 5 seconds + mental overhead

k9s way:

k9s
:pods
/ <type "Running" to inverse filter>
<select pod>
l
# Total: 2 seconds, zero mental overhead

k9s Skins and Themes

# Change skin
k9s --skin <skin-name>

# Available skins:
k9s skins

# Edit skin
vim ~/.config/k9s/skin.yaml

3. Stern: Multi-Pod Log Streaming

GitHub: https://github.com/stern/stern
Stars: 8,000+
What it solves: Tailing logs from multiple pods simultaneously
One-liner: tail -f for Kubernetes on steroids

The Problem Stern Solves

You’re running a microservices application with multiple replicas:

kubectl get pods
# NAME                    READY   STATUS
# api-server-abc123       1/1     Running
# api-server-def456       1/1     Running
# api-server-ghi789       1/1     Running

You want to see logs from all three pods simultaneously. With kubectl, you’d need:

kubectl logs -f api-server-abc123 &
kubectl logs -f api-server-def456 &
kubectl logs -f api-server-ghi789 &
# Now you have three terminal windows... 🤦

Stern does this in one command:

stern api-server

Installation

# macOS
brew install stern

# Linux
curl -LO https://github.com/stern/stern/releases/download/v1.28.0/stern_1.28.0_linux_amd64.tar.gz
tar xvzf stern_1.28.0_linux_amd64.tar.gz
sudo mv stern /usr/local/bin/

# Windows
choco install stern

Basic Usage

# Tail all pods matching pattern
stern api-server

# Tail pods in specific namespace
stern api-server -n production

# Tail all pods in namespace
stern . -n production

# Include timestamps
stern api-server -t

# Since last 5 minutes
stern api-server --since 5m

# Exclude init containers
stern api-server --exclude-container istio-init

# Multiple containers
stern api-server --container app,sidecar

Advanced Filtering

Color-coded by pod name (default):

stern api-server
# api-server-abc123 › app › [timestamp] GET /health 200
# api-server-def456 › app › [timestamp] POST /api/users 201
# api-server-abc123 › app › [timestamp] GET /metrics 200

Filter by log level:

# Only ERROR logs
stern api-server | grep ERROR

# Or use --include flag
stern api-server --include 'ERROR|FATAL'

# Exclude INFO logs
stern api-server --exclude 'INFO|DEBUG'

Multiple namespaces:

bash

# Across all namespaces
stern api-server --all-namespaces

# Multiple specific namespaces
stern api-server -n production,staging

Stern + jq: Power Combo

If your logs are JSON:

# Pretty print JSON logs
stern api-server --template '{{.Message}}' | jq

# Extract specific fields
stern api-server -o json | jq '.message, .timestamp, .level'

# Filter JSON logs
stern api-server -o json | jq 'select(.level == "error")'

Stern Templates

Customize output format:

# Custom template
stern api-server --template '{{.PodName}} | {{.ContainerName}} | {{.Message}}'

# Kubernetes labels
stern api-server --template '{{.PodName}} ({{index .PodLabels "version"}}) | {{.Message}}'

# Simplified output
stern api-server --template '{{.Message}}'

Real-World Stern Use Cases

Use Case 1: Debugging Distributed Request

You’re tracking a request across microservices:

# Follow request ID across all services
stern -n production . | grep "request-id-123"

# Result shows the request flowing through:
# api-gateway: Received request-id-123
# auth-service: Validated request-id-123
# user-service: Processing request-id-123
# database-proxy: Query for request-id-123

Use Case 2: New Deployment Monitoring

# Watch logs of newly deployed pods
stern api-server --since 30s -t

# This shows:
# 1. New pods starting
# 2. Health checks passing
# 3. First requests coming in
# 4. Any startup errors

Use Case 3: Container-Specific Debugging

# Only sidecar logs (e.g., Istio proxy)
stern . --container istio-proxy -n production

# Only init container logs
stern . --init-containers

Stern Aliases I Use Daily

# In ~/.bashrc or ~/.zshrc
alias slogs='stern --all-namespaces --since 1h'
alias sprod='stern -n production'
alias serrors='stern --all-namespaces | grep -E "ERROR|FATAL|error|fatal"'
alias sfollow='stern --tail 0 --since 1s'

Stern + Other Tools

Stern + Loki:

# Send stern output to Loki
stern api-server -o json | promtail --push-url http://loki:3100

Stern + Slack alerts:

# Alert on errors
stern api-server | grep ERROR | while read line; do
    curl -X POST https://hooks.slack.com/... -d "{'text': '$line'}"
done

4. Lens with Lens Prism: AI-Powered Kubernetes IDE

Website: https://k8slens.dev
What it solves: Visual cluster management + AI-powered debugging
Think of it as: VS Code for Kubernetes, with ChatGPT built-in

Why Lens is Different

Lens Desktop (formerly known as Lens IDE) is the most popular Kubernetes IDE, with millions of users. In 2025, they added Lens Prism—an AI copilot that can:

  • Diagnose issues: Just describe the problem in plain English
  • Generate kubectl commands: “Show me pods using more than 1GB RAM”
  • Explain errors: Paste an error, get human-readable explanation
  • Suggest fixes: AI-powered troubleshooting recommendations

Installation

# Download from https://k8slens.dev

# macOS
brew install --cask lens

# Windows
choco install lens

# Linux
# Download .AppImage from releases

Key Features

1. Multi-Cluster Dashboard

  • Manage multiple clusters from one interface
  • Real-time resource monitoring
  • Visual topology maps
  • Cluster health at a glance

2. Terminal Integration

  • Built-in terminal with kubectl autocomplete
  • Pod shell access with one click
  • Multi-terminal support

3. Resource Management

  • Edit YAML inline
  • Visual logs viewer with filtering
  • Port-forward with UI
  • Resource metrics (CPU/Memory) per pod

4. Lens Prism (AI Copilot)

The game-changer. Here’s what Lens Prism can do:

Example 1: Diagnose CrashLoopBackOff

You: "My pod api-server-abc123 is in CrashLoopBackOff"

Lens Prism: 
"I've analyzed the pod. The issue is:
1. Container exits with code 137 (OOMKilled)
2. Memory limit: 128Mi
3. Actual usage spike: 512Mi

Recommended fixes:
- Increase memory limit to 512Mi
- Add resource requests to prevent over-scheduling
- Check for memory leaks in application code

Would you like me to generate a patch to update the memory limits?"

Example 2: Performance Optimization

You: "Which pods are using the most CPU?"

Lens Prism:
"Top CPU consumers:
1. elasticsearch-0: 3.2 cores (320%)
2. api-server-xyz: 1.8 cores (180%)
3. worker-abc: 1.2 cores (120%)

Analysis:
- elasticsearch-0 is above its 2 core limit
- Recommend vertical scaling or optimization
- Can generate HPA configuration if needed"

Real-World Lens Workflows

Workflow 1: Visual Debugging

  1. Cluster Overview: See all resources at a glance
  2. Click on failing pod: Instant access to:
    • Events
    • Logs (with log level filtering)
    • YAML definition
    • Metrics
    • Shell access
  3. Ask Lens Prism: “Why is this failing?”
  4. Apply fix: Edit YAML inline or use generated kubectl command

Workflow 2: Port-Forward Management

Instead of managing multiple kubectl port-forward terminals:

  1. Click pod → Port Forward → Select port
  2. Lens manages the tunnel in background
  3. Access localhost:PORT in browser
  4. Stop/start with one click

Workflow 3: Helm Release Management

  • Visual Helm chart browser
  • One-click upgrades
  • Diff view before applying changes
  • Rollback with one click

Lens Extensions

Lens supports extensions for additional functionality:

bash

# Popular extensions:
- @alebcay/openlens-node-pod-menu
- @nevalla/kube-context-cluster-name
- @spectrocloud/lens-extension

Lens vs k9s: When to Use Each

Use k9s when:

  • You live in the terminal
  • Need speed (keyboard shortcuts)
  • Working on remote SSH sessions
  • Prefer CLI workflow

Use Lens when:

  • Visual learner / prefer GUIs
  • Multi-cluster management
  • Need AI assistance (Lens Prism)
  • Collaborating with team (screenshots, sharing)
  • Managing Helm releases

5. K8sGPT: AI Kubernetes Diagnostics {#k8sgpt}

GitHub: https://github.com/k8sgpt-ai/k8sgpt
Stars: 5,000+
What it solves: Automated Kubernetes cluster analysis
Powered by: OpenAI, Azure OpenAI, Claude, or local LLMs

What is K8sGPT?

K8sGPT is a CLI tool that scans your Kubernetes cluster for issues and uses AI to:

  • Identify problems
  • Explain root causes
  • Suggest fixes
  • Generate remediation commands

Think of it as having a Kubernetes expert review your cluster 24/7.

Installation

bash

# macOS
brew install k8sgpt

# Linux
curl -LO https://github.com/k8sgpt-ai/k8sgpt/releases/latest/download/k8sgpt_linux_amd64.tar.gz
tar xvzf k8sgpt_linux_amd64.tar.gz
sudo mv k8sgpt /usr/local/bin/

# Windows
choco install k8sgpt

Setup

bash

# Authenticate with OpenAI
k8sgpt auth add --backend openai --model gpt-4
# Enter your OpenAI API key when prompted

# Or use Azure OpenAI
k8sgpt auth add --backend azureopenai --model gpt-4

# Or use local LLM (privacy-focused)
k8sgpt auth add --backend localai --model ggml-gpt4all-j

Basic Usage

bash

# Analyze cluster
k8sgpt analyze

# Example output:
# 0 api-server-abc123(production)
# - Error: CrashLoopBackOff
# - Analysis: Container is exiting with code 1. 
#   Logs show "Error: ECONNREFUSED - Cannot connect to database"
# - Solution: Check if database service is running and accessible.
#   Verify DATABASE_URL environment variable.

# 1 worker-def456(staging)  
# - Error: ImagePullBackOff
# - Analysis: Image "myapp:latest" not found in registry
# - Solution: Either push the image to registry or update deployment 
#   to use existing image tag.

Advanced Features

Explain Specific Resource

bash

# Analyze specific pod
k8sgpt analyze --filter Pod --name api-server-abc123

# Analyze deployment
k8sgpt analyze --filter Deployment

# Multiple filters
k8sgpt analyze --filter Pod,Service,Ingress

Generate Fixes

bash

# Analyze and explain in detail
k8sgpt analyze --explain

# Generate kubectl commands to fix issues
k8sgpt analyze --explain --with-commands

Integration with Existing Tools

bash

# Output as JSON
k8sgpt analyze --output json

# Send to Slack
k8sgpt analyze --explain | slack-cli send

# Continuous monitoring
while true; do
  k8sgpt analyze --explain > cluster-health.txt
  sleep 300
done

K8sGPT Filters

Available analysis filters:

  • Pod: Analyze pod issues
  • Deployment: Deployment problems
  • ReplicaSet: ReplicaSet issues
  • Service: Service configuration
  • Ingress: Ingress problems
  • PersistentVolumeClaim: Storage issues
  • StatefulSet: StatefulSet analysis
  • Node: Node health

bash

# Analyze specific resources
k8sgpt analyze --filter Pod,Service

# Exclude resources
k8sgpt analyze --exclude-filter Ingress

Real-World K8sGPT Examples

Example 1: Automated Morning Cluster Check

bash

#!/bin/bash
# morning-check.sh

echo "🔍 Running K8sGPT cluster analysis..."
ISSUES=$(k8sgpt analyze --explain)

if [ -n "$ISSUES" ]; then
    echo "⚠️  Issues found!"
    echo "$ISSUES"
    # Send to Slack
    curl -X POST https://hooks.slack.com/... \
        -d "{\"text\": \"Cluster Issues:\n$ISSUES\"}"
else
    echo "✅ Cluster healthy!"
fi

Example 2: CI/CD Integration

yaml

# .github/workflows/k8s-health-check.yml
name: Kubernetes Health Check
on:
  schedule:
    - cron: '0 */6 * * *'  # Every 6 hours
  
jobs:
  health-check:
    runs-on: ubuntu-latest
    steps:
      - name: Install K8sGPT
        run: brew install k8sgpt
      
      - name: Configure K8sGPT
        run: k8sgpt auth add --backend openai --model gpt-4
        env:
          OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
      
      - name: Analyze Cluster
        run: |
          k8sgpt analyze --explain > cluster-report.txt
          cat cluster-report.txt
      
      - name: Upload Report
        uses: actions/upload-artifact@v3
        with:
          name: cluster-health-report
          path: cluster-report.txt

6. Prometheus + Grafana: The Observability Stack {#prometheus-grafana}

Prometheus: https://prometheus.io
Grafana: https://grafana.com
What it solves: Metrics, monitoring, alerting
Status: Both are CNCF Graduated Projects

Why This Combo is Essential

Debugging isn’t just about logs—you need metrics to understand:

  • Is CPU/Memory actually a problem?
  • What happened right before the crash?
  • Are there patterns in failures?
  • Is performance degrading over time?

Prometheus + Grafana is the de facto standard for Kubernetes monitoring.

Quick Setup (Helm)

bash

# Add Prometheus Helm repo
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update

# Install kube-prometheus-stack (includes both)
helm install prometheus prometheus-community/kube-prometheus-stack \
  --namespace monitoring \
  --create-namespace

# Port-forward Grafana
kubectl port-forward -n monitoring svc/prometheus-grafana 3000:80

# Access Grafana at http://localhost:3000
# Default credentials: admin / prom-operator

Key Metrics for Debugging

Pod Metrics

promql

# CPU usage by pod
sum(rate(container_cpu_usage_seconds_total[5m])) by (pod)

# Memory usage by pod
sum(container_memory_usage_bytes) by (pod)

# Pod restart count
kube_pod_container_status_restarts_total

# Pods not ready
kube_pod_status_ready{condition="false"}

Node Metrics

promql

# Node CPU usage
100 - (avg by (node) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)

# Node memory usage
(1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100

# Disk usage
(node_filesystem_size_bytes - node_filesystem_free_bytes) / node_filesystem_size_bytes * 100

Application Metrics

promql

# HTTP request rate
rate(http_requests_total[5m])

# Error rate
rate(http_requests_total{status=~"5.."}[5m]) / rate(http_requests_total[5m])

# Request latency (p95)
histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))

Essential Grafana Dashboards

Pre-built dashboards (Dashboard ID):

  1. Kubernetes Cluster Monitoring: #7249
  2. Node Exporter Full: #1860
  3. Kubernetes Pod Resources: #6417
  4. Prometheus Stats: #2
  5. Istio Service Dashboard: #7645

bash

# Import dashboard in Grafana
1. Click "+" → Import
2. Enter dashboard ID
3. Select Prometheus data source
4. Click "Import"

Alerting Rules for Common Issues

Create prometheus-alerts.yaml:

yaml

groups:
  - name: kubernetes-alerts
    rules:
      - alert: PodCrashLooping
        expr: rate(kube_pod_container_status_restarts_total[15m]) > 0
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "Pod {{ $labels.pod }} is crash looping"
          
      - alert: PodNotReady
        expr: sum by (namespace, pod) (kube_pod_status_phase{phase!="Running"}) > 0
        for: 5m
        labels:
          severity: warning
          
      - alert: HighMemoryUsage
        expr: (sum(container_memory_usage_bytes) by (pod) / 
               sum(container_spec_memory_limit_bytes) by (pod)) > 0.9
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Pod {{ $labels.pod }} using > 90% memory"
          
      - alert: NodeDiskPressure
        expr: kube_node_status_condition{condition="DiskPressure",status="true"} == 1
        for: 5m
        labels:
          severity: critical

Grafana + Loki (Logs)

Enhance your observability with logs:

bash

# Install Loki
helm install loki grafana/loki-stack \
  --namespace monitoring \
  --set promtail.enabled=true

Now you can correlate metrics (Prometheus) with logs (Loki) in the same Grafana dashboard!


7. Kubectx & Kubens: Context Switching {#kubectx}

GitHub: https://github.com/ahmetb/kubectx
Stars: 17,000+
What it solves: Switching between clusters and namespaces
File size: 10KB (yes, seriously)

The Frustration It Solves

bash

# Without kubectx/kubens:
kubectl config use-context production-cluster
kubectl config set-context --current --namespace=api-services
kubectl get pods

# With kubectx/kubens:
kubectx production
kubens api-services
kubectl get pods

Time saved: ~10 seconds per switch × 50 switches/day = 8 minutes daily

Installation

bash

# macOS
brew install kubectx

# Linux (manual)
sudo git clone https://github.com/ahmetb/kubectx /opt/kubectx
sudo ln -s /opt/kubectx/kubectx /usr/local/bin/kubectx
sudo ln -s /opt/kubectx/kubens /usr/local/bin/kubens

# Add fzf for interactive mode
brew install fzf

Basic Usage

kubectx (cluster switching)

bash

# List contexts
kubectx

# Switch context
kubectx production

# Switch to previous context
kubectx -

# Rename context
kubectx new-name=old-name

# Delete context
kubectx -d context-name

kubens (namespace switching)

bash

# List namespaces
kubens

# Switch namespace
kubens production

# Switch to previous namespace
kubens -

# Show current namespace
kubens -c

Power User Tips

Aliases

bash

# ~/.bashrc or ~/.zshrc
alias kx='kubectx'
alias kn='kubens'

# Now:
kx production
kn api-services

Interactive Fuzzy Search (with fzf)

bash

# Just type kubectx (no args)
kubectx

# Get interactive menu:
# ─────────────────
# development
# staging
# production
# > production-east
# production-west
# ─────────────────
# Use arrow keys to select!

Bash/Zsh Completion

bash

# Add to ~/.zshrc
source /opt/kubectx/completion/kubectx.zsh
source /opt/kubectx/completion/kubens.zsh

# Now tab completion works:
kubectx prod<TAB>
# Completes to: kubectx production

8. Telepresence: Local Development Bridge {#telepresence}

Website: https://www.telepresence.io
What it solves: Debugging microservices locally while connected to cluster
Magic level: 🪄 High

The Problem

You’re developing a microservice that depends on 15 other services in your Kubernetes cluster. Options:

  1. Run everything locally: Impossible (database, cache, other services)
  2. Deploy to cluster for every change: Slow (build → push → deploy = 5 minutes)
  3. Mock everything: Tedious and unrealistic

Telepresence’s solution: Run your service locally while it appears to be in the cluster.

How It Works

Telepresence creates a bidirectional network proxy:

  • Your local code can call services in the cluster (as if it’s deployed)
  • Services in the cluster can call your local code (as if it’s deployed)

Installation

bash

# macOS
brew install datawire/blackbird/telepresence

# Linux
sudo curl -fL https://app.getambassador.io/download/tel2/linux/amd64/latest/telepresence -o /usr/local/bin/telepresence
sudo chmod +x /usr/local/bin/telepresence

# Windows
choco install telepresence

Basic Usage

bash

# Connect to cluster
telepresence connect

# Your laptop is now "inside" the cluster!
# You can access cluster services by their DNS names:
curl http://api-service.production.svc.cluster.local

# Intercept a deployment
telepresence intercept api-service --port 8080

# Now traffic to api-service goes to your localhost:8080
# Run your local code:
npm run dev  # or whatever your local command is

Intercept Patterns

1. Global Intercept (all traffic)

bash

telepresence intercept api-service --port 8080

# ALL traffic to api-service → localhost:8080

2. Selective Intercept (only your traffic)

bash

telepresence intercept api-service \
  --port 8080 \
  --http-header "x-user=ajeet"

# Only traffic with header x-user=ajeet → localhost:8080
# Other traffic → cluster as normal

3. Preview URLs (share with team)

bash

telepresence intercept api-service \
  --port 8080 \
  --preview-url=true

# Generates URL: https://abc123.preview.edgestack.me
# Share with team to test your local changes!

Real-World Workflow

bash

# Morning workflow:
telepresence connect

# Start intercepting
telepresence intercept my-service --port 3000

# Run local development server
npm run dev

# Now you can:
# 1. Debug with breakpoints
# 2. Hot reload on code changes
# 3. Test against real cluster services
# 4. Use production data

# When done:
telepresence leave my-service
telepresence quit

9. Pixie: eBPF-Based Observability {#pixie}

Website: https://px.dev
Status: CNCF Sandbox Project
What it solves: Zero-instrumentation observability
Superpower: See everything without changing code

What Makes Pixie Special

Traditional monitoring requires instrumentation—you modify your code to emit metrics/traces. Pixie uses eBPF (extended Berkeley Packet Filter) to capture data at the kernel level without any code changes.

What Pixie can see:

  • HTTP/gRPC requests and responses
  • Database queries (MySQL, PostgreSQL, Redis)
  • DNS requests
  • Network connections
  • CPU/Memory profiles
  • SSL/TLS data (even encrypted traffic!)

Quick Start

bash

# Install Pixie CLI
bash -c "$(curl -fsSL https://withpixie.ai/install.sh)"

# Deploy to cluster
px deploy

# Open web UI
px live

Live Debugging with Pixie

Example 1: HTTP Traffic Analysis

python

# In Pixie UI, run PxL (Pixie Language) script:

import px

# Get all HTTP requests to api-service
df = px.DataFrame('http_events')
df = df[df.ctx['service'] == 'api-service']
df = df[['time_', 'remote_addr', 'req_method', 'req_path', 'resp_status', 'latency_ms']]
px.display(df)

Output:

time_               remote_addr     req_method  req_path        resp_status  latency_ms
2025-12-19 10:23:45 10.1.2.3       GET         /api/users      200          45
2025-12-19 10:23:47 10.1.2.5       POST        /api/orders     201          123
2025-12-19 10:23:48 10.1.2.3       GET         /api/products   500          2100

Example 2: Database Query Performance

python

# Analyze slow MySQL queries
df = px.DataFrame('mysql_events')
df = df[df.latency_ms > 1000]  # Queries > 1 second
df = df[['time_', 'query', 'latency_ms']]
df = df.sort_values('latency_ms', ascending=False)
px.display(df)

Example 3: Service Dependency Map

python

# Automatic service dependency graph
px.display(px.ServiceGraph())

Shows visual map of which services call which, discovered automatically!

Pixie Use Cases

  1. Performance debugging: Find slow endpoints
  2. Security: Detect unusual network patterns
  3. Cost optimization: Identify chatty services
  4. Compliance: Audit all database queries

10. Devtron: Kubernetes Dashboard with AI {#devtron}

Website: https://devtron.ai
What it solves: End-to-end Kubernetes application management
Think of it as: Kubernetes + CI/CD + Security in one dashboard

Key Features

  1. Multi-cluster Management: Single pane of glass
  2. Application Store: One-click Helm deployments
  3. CI/CD Pipelines: Built-in automation
  4. Security Scanning: Image vulnerability detection
  5. Resource Browser: Visual cluster exploration
  6. AI-Assisted Debugging: Smart error analysis

Quick Setup

bash

helm repo add devtron https://helm.devtron.ai
helm install devtron devtron/devtron-operator \
  --create-namespace --namespace devtroncd

Dashboard Features

  • Live Manifest Editing: Edit YAML in production (carefully!)
  • Log Streaming: Multi-pod log aggregation
  • Terminal Access: Built-in shell
  • Event Monitoring: Real-time event viewer
  • Resource Topology: Visual relationship maps

Bonus Tools Worth Mentioning

Kubewatch

What: Slack/Teams notifications for cluster events
Use: Get alerts when pods crash, deployments fail, etc.

Kubent (Kube No Trouble)

What: Detect deprecated API versions
Use: Before upgrading Kubernetes, find breaking changes

Popeye

What: Cluster sanitizer
Use: Find misconfigurations and best practice violations

bash

# Install
brew install derailed/popeye/popeye

# Scan cluster
popeye

# Generates report with scores:
# Pods: 85/100 ✅
# Deployments: 72/100 ⚠️
# Services: 95/100 ✅

Building Your Debugging Toolkit

The Minimalist Setup (Start Here)

  1. kubectl + kubectl debug (built-in)
  2. k9s (terminal UI)
  3. stern (logs)

Why: These three cover 80% of debugging scenarios with minimal setup.

The Professional Setup

Add to minimalist setup: 4. Lens (visual + AI) 5. kubectx/kubens (context switching) 6. Prometheus + Grafana (metrics)

The Enterprise Setup

All professional tools plus: 7. K8sGPT (AI analysis) 8. Telepresence (local dev) 9. Pixie (deep observability) 10. Devtron (unified platform)


Debugging Workflow: Putting It All Together

Here’s my actual workflow when debugging a production issue:

Step 1: Initial Triage (k9s)

bash

k9s
# Quick visual: which pods are failing?
# Check events: what's the error?
# View logs: is there a clear error message?

Step 2: Deep Dive (kubectl debug)

bash

# If minimal image or crashed container:
kubectl debug failing-pod -it --image=nicolaka/netshoot --target=app

# Debug node if needed:
kubectl debug node/worker-1 -it --image=ubuntu

Step 3: Log Analysis (stern)

bash

# Tail logs across all replicas:
stern api-service | grep ERROR

# Check last 30 minutes:
stern api-service --since 30m

Step 4: Metrics Review (Grafana)

bash

# Check dashboard:
# - CPU/Memory spikes?
# - Request rate changes?
# - Error rate increase?

Step 5: AI Analysis (K8sGPT)

bash

# Get AI recommendations:
k8sgpt analyze --explain --filter Pod

# Often provides insights I missed

Step 6: Fix and Verify

bash

# Apply fix
kubectl apply -f fix.yaml

# Monitor with k9s
# Verify with stern
# Check metrics in Grafana

Common Debugging Scenarios Solved

Scenario 1: CrashLoopBackOff

Symptoms: Pod keeps restarting

Debugging:

bash

# View recent logs (even from crashed containers)
kubectl logs pod-name --previous

# Use kubectl debug if container exits too fast
kubectl debug pod-name -it --image=busybox --copy-to=debug-pod

# Check events
kubectl describe pod pod-name | grep -A 10 Events

# AI analysis
k8sgpt analyze --filter Pod --name pod-name --explain

Common causes:

  • Application crashes on startup
  • Missing environment variables
  • Cannot connect to dependencies
  • OOMKilled (memory limit too low)

Scenario 2: ImagePullBackOff

Symptoms: Can’t pull container image

Debugging:

bash

# Check image name
kubectl describe pod pod-name | grep Image

# Verify image exists
docker pull <image-name>

# Check image pull secrets
kubectl get secrets
kubectl describe secret <secret-name>

Common causes:

  • Typo in image name/tag
  • Private registry without credentials
  • Network issues reaching registry
  • Image doesn’t exist

Scenario 3: Pending Pod

Symptoms: Pod stuck in Pending state

Debugging:

bash

# Check why it's pending
kubectl describe pod pod-name | grep -A 10 Events

# Check node resources
kubectl top nodes

# Check for taints/tolerations
kubectl describe nodes | grep Taints

Common causes:

  • Insufficient CPU/memory on nodes
  • No nodes match pod’s node selector
  • Volume mount issues
  • Pod priority/preemption

Scenario 4: Networking Issues

Symptoms: Services can’t communicate

Debugging:

bash

# Use debug container with network tools
kubectl debug pod-name -it --image=nicolaka/netshoot

# Inside debug container:
# Test DNS
nslookup service-name
dig service-name.namespace.svc.cluster.local

# Test connectivity
curl http://service-name:port
telnet service-name port

# Check network policies
kubectl get networkpolicies -A

Scenario 5: Performance Issues

Debugging:

bash

# Check resource usage
kubectl top pods
kubectl top nodes

# Use Pixie for deep analysis
px live

# Check metrics in Grafana
# Look for:
# - CPU throttling
# - Memory pressure
# - High request latency
# - Error rates

Pro Tips from 10 Years of Kubernetes

1. Always Use Resource Limits

yaml

resources:
  requests:
    memory: "128Mi"
    cpu: "100m"
  limits:
    memory: "512Mi"
    cpu: "500m"

Why: Prevents one pod from starving others. Makes debugging resource issues easier.

2. Enable Process Namespace Sharing When Debugging

yaml

spec:
  shareProcessNamespace: true

Why: Allows ephemeral containers to see processes from other containers.

3. Use Liveness/Readiness Probes

yaml

livenessProbe:
  httpGet:
    path: /health
    port: 8080
  initialDelaySeconds: 30
  periodSeconds: 10

readinessProbe:
  httpGet:
    path: /ready
    port: 8080
  initialDelaySeconds: 5
  periodSeconds: 5

Why: Kubernetes can detect and handle unhealthy pods automatically.

4. Log to stdout/stderr (not files)

javascript

// ✅ Good
console.log('Request received');

// ❌ Bad
fs.appendFile('/var/log/app.log', 'Request received');

Why: Makes logs accessible via kubectl logs and log aggregation tools.

5. Use Structured Logging (JSON)

javascript

console.log(JSON.stringify({
  level: 'info',
  timestamp: new Date().toISOString(),
  message: 'Request received',
  requestId: req.id,
  userId: req.user.id
}));

Why: Easier to parse, filter, and search in log aggregation tools.


The Future of Kubernetes Debugging

AI-Powered Debugging (2025 and Beyond)

Tools like Lens Prism and K8sGPT are just the beginning. Expect:

  • Predictive debugging: AI predicts failures before they happen
  • Automated remediation: AI fixes issues without human intervention
  • Natural language queries: “Show me why latency increased”
  • Root cause analysis: AI traces issues across the entire stack

eBPF Everywhere

Tools like Pixie prove eBPF is the future:

  • Zero-instrumentation observability
  • Kernel-level visibility
  • No performance overhead
  • Works with any language/framework

Quick Reference Cheat Sheet

kubectl debug

bash

# Debug distroless container
kubectl debug pod-name -it --image=busybox --target=container-name

# Debug crashed pod
kubectl debug pod-name -it --image=busybox --copy-to=debug-pod

# Debug node
kubectl debug node/node-name -it --image=ubuntu

k9s

:pods         # View pods
d             # Describe
l             # Logs
s             # Shell
Del           # Delete
/             # Filter

stern

bash

stern app-name                    # Tail logs
stern app-name -n namespace       # Specific namespace
stern . -n namespace              # All pods in namespace
stern app-name --since 5m         # Last 5 minutes
stern app-name | grep ERROR       # Filter logs

kubectx/kubens

bash

kubectx production    # Switch cluster
kubens staging        # Switch namespace
kubectx -             # Previous cluster
kubens -              # Previous namespace

Conclusion: Your Debugging Superpowers

After years of Kubernetes debugging, here’s what I’ve learned:

The 80/20 Rule

80% of debugging can be done with:

  1. kubectl debug (ephemeral containers)
  2. k9s (visual exploration)
  3. stern (log aggregation)

The remaining 20% requires:

  • Metrics (Prometheus + Grafana)
  • AI assistance (K8sGPT, Lens Prism)
  • Deep observability (Pixie)

Start Small, Scale Up

Week 1: Master kubectl debug and k9s
Week 2: Add stern for log analysis
Week 3: Set up Prometheus + Grafana
Month 2: Explore AI tools (K8sGPT, Lens Prism)
Month 3: Add advanced tools based on your needs

Leave a Reply

Your email address will not be published. Required fields are marked *