Orchestration

Kubernetes Network Debug: kubectl & tcpdump

Introduction

Kubernetes networking can be a labyrinth, even for seasoned professionals. Pods can’t talk to services, services can’t reach external endpoints, or external traffic isn’t hitting your applications. These scenarios are frustratingly common and can bring your deployments to a screeching halt. The distributed nature of Kubernetes, with its overlay networks, CNI plugins, and intricate Service abstractions, introduces numerous layers where communication can break down. Pinpointing the exact cause often feels like searching for a needle in a haystack.

This guide aims to demystify Kubernetes network troubleshooting by equipping you with powerful, readily available tools: kubectl and tcpdump. We’ll explore how to leverage kubectl for initial diagnostics, inspecting resources and logs, and then dive into the granular world of packet analysis with tcpdump directly within your cluster. By combining these two utilities, you’ll gain unparalleled visibility into your cluster’s network fabric, allowing you to quickly identify and resolve even the most elusive networking issues. Whether you’re debugging a simple DNS lookup failure or a complex NetworkPolicy blockage, this comprehensive guide will provide the step-by-step instructions and insights you need.

TL;DR: Kubernetes Network Troubleshooting Essentials

Kubernetes networking issues are complex, but kubectl and tcpdump are your best friends. Start with kubectl to inspect resources (Pods, Services, Endpoints, NetworkPolicies) and logs. When you need deeper insight into packet flow, use tcpdump directly in a debug container on the affected node or pod. Remember to check DNS, Service IPs, and Network Policies first. For advanced observability, consider tools like eBPF Observability with Hubble.


# Basic connectivity check from a debug pod
kubectl run -it --rm debug-pod --image=busybox --restart=Never -- sh
/ # ping 
/ # nslookup 

# Get pod details including IP and node
kubectl get pod  -o wide

# Describe service and endpoints
kubectl describe service 
kubectl describe endpoints 

# Check network policies affecting a pod
kubectl get networkpolicy -A -o yaml | grep -B 5 -A 5 

# Run tcpdump on a node (requires privileged access or host-level tools)
# Or, run tcpdump inside a debug container on the target node:
kubectl debug node/ -it --image=nicolaka/netshoot -- /bin/bash
# (Inside debug container)
tcpdump -i any -nn host  and port  -vvv

# Run tcpdump inside an existing application pod (if tcpdump is installed)
kubectl exec -it  -- tcpdump -i eth0 -nn host 
    

Prerequisites

  • A running Kubernetes cluster (Minikube, Kind, or a cloud-managed cluster like GKE, EKS, AKS).
  • kubectl configured to communicate with your cluster.
  • Basic understanding of Kubernetes concepts: Pods, Services, Deployments, Namespaces.
  • Familiarity with basic networking concepts: IP addresses, ports, DNS.
  • For tcpdump on nodes, you might need SSH access to the node or privileged access to run a debug container.
  • Some examples will use netshoot or busybox images, which are excellent for debugging.

Step-by-Step Guide: Kubernetes Network Troubleshooting with kubectl and tcpdump

1. Initial Sanity Checks with kubectl

Before diving deep, always start with the basics. Verify that your core Kubernetes resources are in a healthy state. This involves checking the status of your pods, services, and deployments. Often, a simple misconfiguration or a pending pod can be the root cause of a perceived network issue. We’ll use kubectl get and kubectl describe to gather initial information.


# Get all pods in a namespace and their statuses
kubectl get pods -n default

# Example Output:
# NAME                             READY   STATUS    RESTARTS   AGE
# my-app-deployment-789c687d5-abcde   1/1     Running   0          5m
# my-db-pod-fghij                   1/1     Running   0          5m

# Get detailed information about a specific pod, including its IP and node
kubectl describe pod my-app-deployment-789c687d5-abcde -n default

# Example Output Snippet:
# ...
# IP:             10.42.0.10
# Node:           kube-worker-1/192.168.1.10
# ...
# Events:
#   Type    Reason     Age    From               Message
#   ----    ------     ----   ----               -------
#   Normal  Scheduled  5m     default-scheduler  Successfully assigned default/my-app-deployment-789c687d5-abcde to kube-worker-1
#   Normal  Pulled     5m     kubelet            Container image "nginx:latest" already present on machine
#   Normal  Created    5m     kubelet            Created container my-app
#   Normal  Started    5m     kubelet            Started container my-app
# ...

Verify: Ensure all relevant pods are in the Running state and that their READY column shows 1/1 (or appropriate for multi-container pods). Note down the Pod IP and the Node where it’s running. Check the Events section for any warnings or errors that might indicate issues like image pull failures or scheduling problems.

2. Inspecting Services and Endpoints

Kubernetes Services provide a stable IP address and DNS name for a set of pods. If your application can’t reach a backend, the Service or its backing Endpoints are prime suspects. Endpoints are crucial as they map the Service’s virtual IP to the actual Pod IPs. Misconfigured selectors or unhealthy pods can lead to an empty or incorrect Endpoints list.


# Get all services in a namespace
kubectl get services -n default

# Example Output:
# NAME         TYPE        CLUSTER-IP     EXTERNAL-IP   PORT(S)    AGE
# kubernetes   ClusterIP   10.96.0.1              443/TCP    1h
# my-app-svc   ClusterIP   10.100.0.50            80/TCP     5m
# my-db-svc    ClusterIP   10.100.0.60            5432/TCP   5m

# Describe a specific service to see its selector and port mappings
kubectl describe service my-app-svc -n default

# Example Output Snippet:
# Name:              my-app-svc
# Namespace:         default
# Labels:            
# Annotations:       
# Selector:          app=my-app
# Type:              ClusterIP
# IP Family Policy:  SingleStack
# IP:                10.100.0.50
# IPs:               10.100.0.50
# Port:              http  80/TCP
# TargetPort:        8080/TCP
# Endpoints:         10.42.0.10:8080
# Session Affinity:  None
# Events:            

# Crucially, check the Endpoints object for the service
kubectl get endpoints my-app-svc -n default

# Example Output:
# NAME         ENDPOINTS           AGE
# my-app-svc   10.42.0.10:8080     5m

Verify:

  • Does the Selector in the Service description match the labels of your target pods?
  • Is the TargetPort correct for your application?
  • Are there actual IP addresses listed under Endpoints? If not, no pods are backing your service, which is a common issue.
  • Note the Cluster-IP of the service. This is the IP internal pods will use to communicate with it.

3. Testing Connectivity from a Debug Pod

The quickest way to diagnose internal network reachability is to deploy a temporary debug pod within the same namespace or node. This pod can then be used to run basic network utilities like ping, nslookup, wget, or netcat. This helps isolate whether the issue is with the application pod itself or the network path.


# Create a busybox pod for debugging
kubectl run -it --rm debug-pod --image=busybox --restart=Never -- /bin/sh

# Inside the debug pod:
/ # ping 10.100.0.50  # Ping the ClusterIP of my-app-svc
PING 10.100.0.50 (10.100.0.50): 56 data bytes
64 bytes from 10.100.0.50: seq=0 ttl=62 time=0.088 ms
64 bytes from 10.100.0.50: seq=1 ttl=62 time=0.076 ms
^C
--- 10.100.0.50 ping statistics ---
2 packets transmitted, 2 packets received, 0% packet loss
round-trip min/avg/max = 0.076/0.082/0.088 ms

/ # nslookup my-app-svc.default.svc.cluster.local # Resolve the service DNS name
Server:    10.96.0.10
Address 1: 10.96.0.10 kube-dns.kube-system.svc.cluster.local

Name:      my-app-svc.default.svc.cluster.local
Address 1: 10.100.0.50

/ # wget -O- http://my-app-svc:80 # Test HTTP connectivity (assuming port 80)
Connecting to my-app-svc:80 (10.100.0.50:80)
index.html           100% |******************************************************************************************************************************************************************************************|   612  0:00:00 ETA

Verify:

  • Can you ping the Service ClusterIP? If not, there’s a fundamental routing issue or the service isn’t correctly configured.
  • Does nslookup resolve the Service DNS name to the correct ClusterIP? DNS issues are a very common cause of connectivity problems. For more on networking, check out our Kubernetes Gateway API vs Ingress: The Complete Migration Guide.
  • Can you make a successful HTTP request with wget or curl? This tests the full application layer connectivity.

4. Checking Network Policies

Kubernetes NetworkPolicies are crucial for securing your cluster, but they are also a frequent source of connectivity issues. A policy might inadvertently block traffic that your application relies on. It’s essential to understand which policies apply to your pods and what rules they enforce. For a deep dive, see our Kubernetes Network Policies: Complete Security Hardening Guide.


# Get all network policies in a namespace
kubectl get networkpolicy -n default

# Example Output:
# NAME                 POD-SELECTOR     AGE
# allow-frontend-only  app=backend      10m
# allow-egress-dns                10m

# Describe a specific network policy to understand its rules
kubectl describe networkpolicy allow-frontend-only -n default

# Example Output Snippet:
# Name:         allow-frontend-only
# Namespace:    default
# Labels:       
# Annotations:  
# Spec:
#   PodSelector: app=backend
#   Policy Types: Ingress
#   Ingress:
#     From:
#       PodSelector: app=frontend
#     Ports:
#       - Protocol: TCP
#         Port: 8080

# To see ALL policies and try to identify ones affecting a pod (e.g., app=my-app)
kubectl get networkpolicy -A -o yaml | grep -B 5 -A 5 "selector: {matchLabels: {app: my-app}}"

Verify:

  • Identify the PodSelector for each NetworkPolicy. Does it match the labels of your affected pods?
  • If a policy applies, examine its Ingress and Egress rules. Is the traffic you expect to flow explicitly allowed? Remember, if a pod is selected by *any* NetworkPolicy, all traffic not explicitly allowed is implicitly denied.
  • Consider temporarily disabling NetworkPolicies (if safe to do so in a dev environment) to confirm if they are the cause.

5. Examining Container Logs

Application logs can provide valuable clues about network issues from the application’s perspective. Connection refused errors, timeouts, or DNS resolution failures reported by your application are direct indicators of problems.


# Get logs from a specific pod
kubectl logs my-app-deployment-789c687d5-abcde -n default

# Example Output Snippet (if application logging connection errors):
# 2023-10-27 10:30:05.123 ERROR [main] com.example.MyApp - Failed to connect to database: Connection refused (Connection refused)
# 2023-10-27 10:30:05.124 INFO  [main] com.example.MyApp - Attempting to reconnect in 5 seconds...
# 2023-10-27 10:30:10.125 ERROR [main] com.example.MyApp - DNS lookup failed for my-db-svc: Name or service not known

Verify: Look for keywords like “connection refused,” “timeout,” “DNS lookup failed,” “host unreachable,” or “network error.” These directly point to where the application believes the network problem lies. This can guide your next steps, whether it’s checking firewall rules, DNS, or service availability.

6. Advanced Diagnostics with tcpdump

When kubectl and logs only get you so far, tcpdump provides a deep dive into the actual packets flowing (or not flowing) across your network interfaces. This tool can run inside a debug container on a node or even within your application pod if it has tcpdump installed. We’ll focus on running it from a debug container for broader access.

Option A: Running tcpdump on a Node (Recommended for broader visibility)

This method allows you to capture traffic on the host’s network interfaces, including the CNI bridge, veth pairs, and physical interfaces. This is powerful for diagnosing issues related to CNI plugins, Cilium WireGuard Encryption, or node-level routing.


# First, identify the node where your affected pod is running
kubectl get pod  -o wide

# Example: my-app-deployment-789c687d5-abcde is on node 'kube-worker-1'

# Use kubectl debug to spawn a privileged debug container on the node
# We'll use nicolaka/netshoot as it comes with many network tools including tcpdump
kubectl debug node/kube-worker-1 -it --image=nicolaka/netshoot -- /bin/bash

# Inside the debug container on the node:
# List network interfaces
ip a

# Example Output Snippet:
# 1: lo:  mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
# ...
# 3: eth0@if4:  mtu 1500 qdisc noqueue state UP group default
# ...
# 5: cni0:  mtu 1500 qdisc noqueue state UP group default qlen 1000
#     link/ether 0a:58:fd:1c:1d:01 brd ff:ff:ff:ff:ff:ff
#     inet 10.42.0.1/24 brd 10.42.0.255 scope global cni0
#        valid_lft forever preferred_lft forever
# ...
# 7: vethf4f6e@if6:  mtu 1500 qdisc noqueue master cni0 state UP group default
# ...

# Now, run tcpdump.
# Replace  with the IP of the pod you are troubleshooting (e.g., 10.42.0.10)
# Replace  with the ClusterIP of the service it's trying to reach (e.g., 10.100.0.50)
# Replace  with the target port (e.g., 8080 or 5432)

# Capture traffic to/from a specific pod IP
tcpdump -i any -nn host 10.42.0.10 -vvv

# Capture traffic between a pod and a service (e.g., pod 10.42.0.10 talking to service 10.100.0.50 on port 80)
tcpdump -i any -nn host 10.42.0.10 and host 10.100.0.50 and port 80 -vvv

# Capture traffic to a specific port on the node's CNI bridge (e.g., to see if traffic hits the CNI network)
tcpdump -i cni0 -nn port 8080 -vvv

# Capture only SYN packets (useful for connection attempts)
tcpdump -i any -nn 'tcp[tcpflags] & tcp-syn != 0 and host 10.42.0.10' -vvv

Verify:

  • Are you seeing packets? If not, the traffic isn’t even reaching the node, or it’s being dropped earlier (e.g., by NetworkPolicies on the source pod, or a CNI issue).
  • Are the source and destination IPs/ports correct?
  • Are you seeing SYN packets without corresponding SYN-ACKs? This indicates a firewall (NetworkPolicy, iptables) or a non-listening service.
  • Are you seeing ICMP “Destination Unreachable” messages? This points to routing issues.
  • For more advanced eBPF-based observability that can replace many tcpdump use cases, explore eBPF Observability: Building Custom Metrics with Hubble.
Option B: Running tcpdump inside an Application Pod

This is less common as most application images don’t include tcpdump, but if yours does (or if you can install it temporarily), it provides the most granular view of what the *application container itself* sees.


# Check if tcpdump is available in your pod
kubectl exec -it my-app-deployment-789c687d5-abcde -n default -- which tcpdump

# If it's not found, you might need to install it (if the image allows and you have permissions)
# For example, on a Debian-based image:
# kubectl exec -it my-app-deployment-789c687d5-abcde -n default -- apt update && apt install -y tcpdump

# Once installed or if already present, run tcpdump
# Assuming the main network interface is eth0
kubectl exec -it my-app-deployment-789c687d5-abcde -n default -- tcpdump -i eth0 -nn host 10.100.0.60 and port 5432 -vvv

Verify: This is similar to the node-level tcpdump, but focuses on the container’s perspective. If you see outgoing SYN packets but no incoming SYN-ACKs, the problem is likely outside the container (e.g., NetworkPolicy, CNI, or the destination service itself). If you don’t even see outgoing SYNs, the application isn’t attempting to connect, or DNS resolution failed internally.

Production Considerations

  • Security: Running kubectl debug node/ creates a privileged container. Be extremely cautious in production environments. Limit its use to authorized personnel and ensure it’s removed immediately after use. Consider using a dedicated debug image with minimal tools.
  • Performance Impact: tcpdump can be resource-intensive, especially with broad filters (e.g., -i any without host/port filters) on busy nodes. Use specific filters to minimize impact.
  • Observability Tools: While kubectl and tcpdump are invaluable, consider integrating more comprehensive observability solutions for production. Tools like eBPF-based solutions like Cilium Hubble, Prometheus with network metrics, or service meshes like Istio Ambient Mesh provide deeper insights into network flow, latency, and errors without manual intervention.
  • Logging: Ensure your applications log network errors effectively. Centralized logging solutions (ELK, Grafana Loki) make it easier to spot trends and specific failure messages.
  • NetworkPolicy Management: In production, NetworkPolicies are critical. Use tools that visualize or validate your policies to prevent accidental blocks.
  • Ephemeral Nature: Remember that pods are ephemeral. Troubleshoot quickly or use persistent debug pods if you need to install tools.
  • Cloud Provider Networking: If using a cloud provider, remember that underlying cloud network configurations (VPCs, Security Groups, Network ACLs, Load Balancers) can also impact connectivity. Consult your cloud provider’s documentation (e.g., GKE Networking, EKS Networking, AKS Networking).

Troubleshooting (Common Issues and Solutions)

1. Pods Cannot Resolve DNS

Problem: Application logs show “Name or service not known” or “host not found” when trying to connect to other services by name.

Solution:

  1. Check CoreDNS: Ensure CoreDNS pods are running in the kube-system namespace.
    
    kubectl get pods -n kube-system -l k8s-app=kube-dns
            
  2. Test from Debug Pod:
    
    kubectl run -it --rm debug-dns --image=busybox --restart=Never -- sh
    / # nslookup ..svc.cluster.local
    / # cat /etc/resolv.conf
            

    Verify that /etc/resolv.conf points to the ClusterIP of CoreDNS (usually 10.96.0.10 by default). If not, check your CNI configuration or pod spec’s dnsPolicy.

  3. CoreDNS Logs: Check CoreDNS logs for errors:
    
    kubectl logs -n kube-system -l k8s-app=kube-dns
            

2. Pod Cannot Connect to a Service (Connection Refused)

Problem: An application pod tries to connect to a service, but gets “Connection Refused.”

Solution:

  1. Service Endpoints: Check if the service has active endpoints. If not, the service has no backing pods.
    
    kubectl get endpoints  -n 
            

    Ensure the service selector matches your pod labels and the pods are healthy.

  2. Target Port: Verify the targetPort in the Service definition matches the port your application is listening on inside the container.
    
    kubectl describe service  -n 
            
  3. Network Policies: Check for NetworkPolicies blocking ingress to the target pod. A common mistake is applying a NetworkPolicy that denies all traffic by default and forgets to explicitly allow internal cluster traffic. Refer to Network Policies Security Guide for details.
  4. Application Listening: Confirm the application inside the target pod is actually listening on the expected port. Use kubectl exec to run netstat -tulnp or ss -tulnp inside the pod.
    
    kubectl exec -it  -n  -- netstat -tulnp
            

3. External Traffic Not Reaching Service (LoadBalancer/NodePort)

Problem: You have a LoadBalancer or NodePort service, but external requests time out or are refused.

Solution:

  1. Service Type and External-IP: Verify the service type is correct and an EXTERNAL-IP is assigned (for LoadBalancer).
    
    kubectl get service  -n 
            

    If EXTERNAL-IP is <pending> for a LoadBalancer, your cloud provider’s load balancer controller might be having issues.

  2. Firewall/Security Groups: For cloud providers, ensure the necessary ports are open in the security groups/firewalls associated with your cluster nodes or the LoadBalancer itself.
  3. NodePort Reachability: For NodePort services, ensure you are hitting the correct node IP and NodePort. Also, check any host-level firewalls (e.g., ufw, firewalld) on your nodes.
  4. Ingress Controller: If using an Ingress, ensure the Ingress controller (e.g., Nginx Ingress, Traefik) is running and correctly configured. See Kubernetes Gateway API vs Ingress for modern alternatives.
  5. tcpdump on Node: Use tcpdump on a node to see if traffic hits the node’s external interface and if it’s being forwarded to the pod.
    
    kubectl debug node/ -it --image=nicolaka/netshoot -- /bin/bash
    # (Inside debug container)
    tcpdump -i eth0 -nn port  -vvv
            

4. Pod-to-Pod Communication Issues Across Nodes

Problem: Pods on the same node can communicate, but pods on different nodes cannot.

Solution:

  1. CNI Plugin: This is almost always a CNI (Container Network Interface) plugin issue. Ensure your CNI (e.g., Calico, Flannel, Cilium) is healthy and running on all nodes.
    
    kubectl get pods -n kube-system -l k8s-app=
            

    Check logs of CNI pods for errors. For example, for Cilium, see Cilium WireGuard Encryption for Pod-to-Pod Traffic.

  2. Node Routing Tables: On the nodes, check routing tables. The CNI should configure routes for pod CIDRs across nodes.
    
    # Inside kubectl debug node/
    ip r
            

    You should see routes to other node’s pod CIDRs pointing to the correct next hop (usually the other node’s IP).

  3. Firewalls: Check host-level firewalls (iptables, firewalld) on the nodes. They should be configured by the CNI or Kubernetes to allow inter-node pod traffic.

5. Pod Cannot Reach External Internet

Problem: Pods can communicate internally but cannot access external websites or services.

Solution:

  1. DNS Resolution: First, verify DNS resolution to external domains works from a debug pod.
    
    kubectl run -it --rm debug-internet --image=busybox --restart=Never -- sh
    / # nslookup google.com
            

    If DNS fails, troubleshoot CoreDNS as described above.

  2. Egress NetworkPolicy: Check if any NetworkPolicies are applied with Egress rules that restrict outbound access. If a policy selects your pod and has egress rules, all unlisted external traffic will be denied.
  3. NAT/Firewall on Node: Ensure the node where your pod is running has proper NAT (Masquerading) rules configured by your CNI or kube-proxy to allow outbound traffic. Check iptables -t nat -L POSTROUTING on the node (from kubectl debug node/).
  4. Cloud Provider Firewall: Verify your cloud provider’s network security groups or network ACLs allow outbound traffic from your worker nodes to the internet.

FAQ Section

Q1: What’s the difference between kubectl exec and kubectl debug node/?
A1: kubectl exec runs a command inside an *existing* application container within a pod. kubectl debug node/<NODE_NAME> creates a *new, ephemeral, privileged* container directly on the specified node’s host OS. This debug container has access to the node’s file system, network interfaces, and processes, making it ideal for host-level troubleshooting with tools like tcpdump, ip, or netstat that might not be present in application containers.

Q2: My tcpdump output is overwhelming. How can I filter it effectively?
A2: Effective filtering is key. Use predicates like host , port , src host , dst port , tcp, udp, icmp. Combine them with and, or, not. For example, tcpdump -i any -nn host 10.42.0.10 and port 8080 and not src host 10.42.0.10 would show incoming TCP traffic on port 8080 to pod 10.42.0.10. Use -vvv for verbose output.

Q3: Why can I ping a service ClusterIP but not connect via HTTP?
A3: ping uses ICMP, which only tests basic IP reachability. HTTP uses TCP. If ping works but HTTP fails (e.g., “Connection Refused”), it often points to:

  • The application in the target pod is not listening on the correct port.
  • A NetworkPolicy is blocking TCP traffic but allowing ICMP.
  • The targetPort in your Service definition is incorrect.
  • The application is crashing or not starting correctly.

Q4: How do I troubleshoot issues with Kubernetes Ingress?
A4: Ingress troubleshooting involves several layers:

  1. Ingress Controller: Ensure your Ingress controller (e.g., NGINX Ingress Controller, Traefik) pods are running and healthy. Check their logs.
  2. Ingress Resource: Verify your Ingress resource’s rules, host, and path are correctly

Leave a Reply

Your email address will not be published. Required fields are marked *