π Quick Takeaways (TL;DR
- Pod failures account for 60% of K8s issuesβmaster
kubectl describeand logs analysis - Network debugging requires understanding CNI plugins, DNS resolution, and service mesh configuration
- Resource constraints cause silent failuresβimplement proper requests/limits monitoring
- Container image problems are preventable with proper registry and pull policy configuration
- Node issues need systematic health checks and kubelet log analysis
- Use structured troubleshooting methodology: Identify β Isolate β Diagnose β Resolve
The 2 AM Wake-Up Call Every DevOps Engineer Dreads
It’s 2:17 AM. Your phone buzzes with alerts. The production Kubernetes cluster is hemorrhaging pod failures. Users can’t access services. Your engineering lead is already on Slack, typing…
This scenario plays out in infrastructure teams worldwide, costing companies an average of $300,000 per hour of downtime. Yet most Kubernetes troubleshooting guides read like cryptic spell books, leaving engineers frantically Googling at the worst possible moment.
Here’s the truth: Kubernetes troubleshooting isn’t about memorizing 50 kubectl commands. It’s about understanding the system’s anatomy and knowing exactly where to look when things break.
After debugging hundreds of production K8s clustersβfrom startups running 10 nodes to enterprises managing 5,000+ podsβI’ve distilled the troubleshooting process into actionable techniques that work under pressure. Whether you’re dealing with CrashLoopBackOff nightmares or mysterious networking black holes, this guide will transform you from panic-driven debugging to systematic problem-solving.
Understanding the Kubernetes Troubleshooting Landscape
Before diving into specific techniques, let’s establish context. Kubernetes orchestrates containerized applications across distributed systems, which creates multiple failure points:
Common failure categories:
- Application-level issues (40%)
- Infrastructure and node problems (25%)
- Networking and connectivity failures (20%)
- Configuration and RBAC errors (15%)
The key to effective Kubernetes debugging lies in understanding which layer is failing and applying targeted diagnostic techniques.
1. Mastering Pod Lifecycle Troubleshooting
Pods are the fundamental execution unit in Kubernetes, and pod failures represent the majority of troubleshooting scenarios you’ll encounter.

The Critical First Command
When a pod fails, your first move should always be:
kubectl describe pod <pod-name> -n <namespace>
This command reveals the complete pod history, including events that explain why containers aren’t starting. Look specifically at the Events section at the bottomβthis is where Kubernetes tells you exactly what went wrong.
Real-world example: A fintech client experienced random pod evictions during peak trading hours. The kubectl describe output revealed OOMKilled events, indicating insufficient memory allocation. Adjusting memory limits from 512Mi to 1Gi eliminated the issue entirely.
Decoding Common Pod States
Understanding pod states accelerates diagnosis:
- Pending: Scheduler can’t find suitable nodes (check resource availability and node selectors)
- CrashLoopBackOff: Container crashes immediately after starting (examine application logs and liveness probes)
- ImagePullBackOff: Cannot retrieve container image (verify registry authentication and image name)
- Error/Completed: Batch jobs or init containers with exit codes (check container logs for specific errors)
Container Log Analysis Strategy
Logs contain the smoking gun for most application failures:
kubectl logs <pod-name> -n <namespace> --previous
The --previous flag is crucialβit shows logs from the crashed container instance, not the current restart attempt. This reveals what actually caused the failure.
Pro tip: For multi-container pods, specify the container name: kubectl logs <pod-name> -c <container-name>. Follow logs in real-time with -f flag during active debugging.
2. Network Troubleshooting: Solving the Invisible Layer
Kubernetes networking is notoriously complex, involving CNI plugins, kube-proxy, DNS services, and potentially service meshes. Network issues manifest as timeout errors, connection refused messages, or mysterious “502 Bad Gateway” responses.
The Service Discovery Debug Process
When services can’t communicate, verify connectivity systematically:
Step 1: Test DNS resolution from inside a pod:
kubectl run debug-pod --image=nicolaka/netshoot --rm -it -- /bin/sh
nslookup <service-name>.<namespace>.svc.cluster.local
Step 2: Verify service endpoints are populated:
kubectl get endpoints <service-name> -n <namespace>
Empty endpoints mean your service selector doesn’t match any pod labelsβa configuration mismatch.
Step 3: Test direct pod-to-pod connectivity:
kubectl exec -it <source-pod> -- curl http://<destination-pod-ip>:<port>
This bypasses the service layer to isolate whether the issue is networking infrastructure or service configuration.
NetworkPolicy Debugging
NetworkPolicies act as firewall rules between pods. When implementing zero-trust networking, misconfigurations block legitimate traffic.
Diagnostic technique: Temporarily remove NetworkPolicies to test if they’re causing connectivity issues:
kubectl delete networkpolicy --all -n <namespace>
If connectivity restores, the issue is policy configuration. Review ingress/egress rules and pod selectors carefully.
Real-world case study: An e-commerce platform couldn’t connect their payment processing pods to the database after implementing network segmentation. The culprit? A NetworkPolicy missing the database port in the egress rules. Adding port 5432 to the egress specification resolved the issue immediately.
3. Resource Management and Node Health
Silent failures due to resource constraints are particularly insidious because Kubernetes may not generate obvious error messages.
Detecting Resource Exhaustion
Monitor resource utilization across your cluster:
kubectl top nodes
kubectl top pods -n <namespace> --sort-by=memory
When nodes approach 90% CPU or memory utilization, the kubelet starts evicting pods based on QoS class priority.
Understanding Resource Requests vs Limits
Many engineers misunderstand the critical difference:
- Requests: Guaranteed resources for scheduling decisions
- Limits: Maximum resources before throttling (CPU) or termination (memory)
Configuration anti-pattern to avoid: Setting limits without requests creates unpredictable scheduling behavior. Always set both explicitly:
resources:
requests:
memory: "256Mi"
cpu: "250m"
limits:
memory: "512Mi"
cpu: "500m"
Node Troubleshooting Workflow
When pods won’t schedule or nodes appear unhealthy:
- Check node status:
kubectl get nodes - Investigate node conditions:
kubectl describe node <node-name> - Examine kubelet logs:
journalctl -u kubelet -n 100(SSH to node) - Verify container runtime:
systemctl status containerdorsystemctl status docker
Key insight: Look for “DiskPressure,” “MemoryPressure,” or “PIDPressure” conditionsβthese prevent new pod scheduling even if the node status shows “Ready.”
4. Configuration and Secret Management Issues
Misconfiguration causes approximately 15% of production incidents. These errors are preventable with proper validation.
ConfigMap and Secret Mounting Problems
When applications can’t find configuration files or environment variables:
Verification checklist:
- Confirm ConfigMap/Secret exists in the correct namespace
- Check volume mount paths match application expectations
- Verify file permissions allow container user to read mounted files
- Ensure key names in ConfigMap match environment variable names
Inspect mounted configuration:
kubectl exec <pod-name> -- ls -la /path/to/mounted/config
kubectl exec <pod-name> -- cat /path/to/mounted/config/file
RBAC Permission Debugging
“Forbidden” errors indicate insufficient permissions. Kubernetes RBAC involves ServiceAccounts, Roles, and RoleBindings.
Diagnostic command:
kubectl auth can-i <verb> <resource> --as=system:serviceaccount:<namespace>:<serviceaccount-name>
This tests whether a ServiceAccount has specific permissions without trial-and-error deployment attempts.
Example scenario: A monitoring agent couldn’t scrape metrics across namespaces. The issue? The ServiceAccount had a Role (namespace-scoped) instead of ClusterRole (cluster-wide). Converting to ClusterRoleBinding granted the necessary permissions.
5. Image and Registry Troubleshooting
Container image problems prevent pod startup and are often mistaken for application issues.
ImagePullBackOff Root Cause Analysis
This state indicates Kubernetes cannot retrieve your container image. Common causes:
Authentication failures: Verify image pull secrets are configured correctly:
kubectl get secret <imagepullsecret> -n <namespace> -o jsonpath='{.data.\.dockerconfigjson}' | base64 -d
Image name typos: Confirm the full image path including registry, repository, and tag:
image: registry.example.com/team/application:v1.2.3
Registry connectivity: Test registry access from within the cluster:
kubectl run test --image=curlimages/curl --rm -it -- curl -I https://registry.example.com
Image Pull Policy Pitfalls
The imagePullPolicy field controls when Kubernetes pulls images:
- Always: Pulls on every pod creation (use for
:latesttags) - IfNotPresent: Only pulls if not cached locally (default for versioned tags)
- Never: Only uses locally cached images
Best practice: Use specific version tags (not :latest) and set imagePullPolicy: IfNotPresent to reduce registry load and improve pod startup time.
π§ Technical Deep Dive: Advanced Debugging Techniques
<details> <summary><strong>Click to expand advanced debugging methods for experienced engineers</strong></summary>
Ephemeral Debug Containers (Kubernetes 1.23+)
Debug containers allow you to attach debugging tools to running pods without modifying the original container image:
kubectl debug <pod-name> -it --image=nicolaka/netshoot --target=<container-name>
This launches a debugging container in the same process namespace, allowing network and process inspection without pod restart.
API Server Audit Log Analysis
For security incidents or mysterious permission changes, audit logs reveal the complete history of API requests:
kubectl get events --all-namespaces --sort-by='.lastTimestamp'
Filter by specific resources or users to trace configuration changes back to their source.
Custom Metrics and Prometheus Queries
Beyond basic resource monitoring, query custom application metrics:
kubectl port-forward -n monitoring svc/prometheus 9090:9090
Access Prometheus UI and query pod restart rates, request latencies, or custom business metrics that indicate application health beyond Kubernetes status.
CoreDNS Troubleshooting
DNS resolution failures plague many clusters. Inspect CoreDNS configuration and logs:
kubectl logs -n kube-system -l k8s-app=kube-dns
kubectl get configmap coredns -n kube-system -o jsonpath='{.data.Corefile}'
Common fixes include adjusting timeout values, configuring upstream resolvers correctly, or scaling CoreDNS replicas during high query loads. </details>
Systematic Troubleshooting Methodology
Effective debugging follows a structured approach regardless of the specific issue:
The 5-Step Troubleshooting Framework
- Identify symptoms: Gather error messages, metrics, and user reports
- Isolate the component: Determine if the issue is application, network, node, or configuration
- Diagnose root cause: Use targeted commands to pinpoint the exact failure point
- Resolve the issue: Apply the minimal fix that addresses the root cause
- Verify and document: Confirm resolution and record the solution for future reference
Pro tip: Create runbooks for common scenarios your team encounters. Document the specific commands, expected outputs, and resolution steps. This transforms tribal knowledge into repeatable processes.
Essential Tools for Your Troubleshooting Toolkit
Beyond kubectl, these tools accelerate diagnosis:
- k9s: Terminal-based cluster navigator with real-time metrics
- Lens: Desktop GUI for multi-cluster management and debugging
- stern: Tail logs from multiple pods simultaneously with color-coded output
- kubectx/kubens: Quickly switch between clusters and namespaces
- Popeye: Automated cluster hygiene scanner that identifies configuration issues
Install these tools to reduce cognitive load during high-pressure incidents.
Preventive Measures: Stop Issues Before They Start
The best troubleshooting is prevention. Implement these practices:
Proactive Monitoring
- Deploy comprehensive monitoring (Prometheus + Grafana stack)
- Set up alerting for pod restarts, resource saturation, and failed deployments
- Monitor cluster-level metrics (API server latency, etcd health, scheduler queue depth)
Configuration Validation
- Use admission controllers (OPA/Gatekeeper) to enforce policies
- Implement CI/CD pipeline validation for manifests
- Adopt GitOps practices for auditable configuration changes
Chaos Engineering
Deliberately inject failures to verify your troubleshooting skills and cluster resilience:
- Randomly terminate pods
- Introduce network latency
- Simulate node failures
Tools like Chaos Mesh and Litmus make chaos engineering accessible for Kubernetes environments.
Frequently Asked Questions
Q: How do I troubleshoot pods that are running but not responding? A: Check liveness and readiness probes first. Use kubectl exec to access the container and test the application directly. Verify network connectivity to the pod using the debug techniques outlined in section 2.
Q: What causes “Evicted” pod states? A: Resource pressure on nodes triggers pod eviction. The kubelet evicts pods when disk, memory, or PID resources are exhausted. Check kubectl describe node for pressure conditions and review your resource requests/limits configuration.
Q: How can I debug intermittent issues that don’t appear consistently? A: Enable verbose logging in your applications, increase metric collection frequency, and use distributed tracing (Jaeger or Zipkin). Intermittent issues often stem from race conditions, transient network failures, or load-dependent behavior.
Q: Should I restart pods when troubleshooting? A: Avoid restarting until you’ve collected diagnostic information. Restarting destroys logs and runtime state. Use kubectl logs --previous, describe commands, and exec into pods before restarting.
Q: What’s the fastest way to check if my cluster is healthy? A: Run these three commands: kubectl get nodes, kubectl get pods --all-namespaces, and kubectl get componentstatuses. These provide a quick overview of node health, pod states, and control plane status.
Your Kubernetes Troubleshooting Action Plan
Kubernetes debugging mastery doesn’t happen overnight, but systematic practice with these techniques builds confidence and speed. The difference between junior and senior platform engineers isn’t knowing every possible errorβit’s having a mental framework for diagnosing any error methodically.
Start here:
- Bookmark this guide for your next production incident
- Practice these commands in a test cluster before you need them under pressure
- Build runbooks for your three most common failure scenarios
- Share this knowledge with your team to elevate everyone’s troubleshooting capabilities
Remember: Every production incident is a learning opportunity. Document what you discover, refine your processes, and gradually reduce mean time to resolution (MTTR).
Join the Conversation
Have a Kubernetes troubleshooting war story? Share your most challenging debugging experience in the comments below. Let’s learn from each other’s production battles.
Questions about a specific scenario? Drop your question, and I’ll provide targeted guidance based on the symptoms you’re experiencing.
Found this guide helpful? Share it with your DevOps team and help them debug faster. Every engineer deserves to sleep through the night without PagerDuty alerts.