Introduction
In the complex world of Kubernetes, understanding the underlying behavior of your applications and the cluster itself is paramount. Traditional monitoring tools often provide high-level metrics, but sometimes you need to dive deeper—right into the kernel—to diagnose elusive performance bottlenecks, security anomalies, or resource contention issues. This is where eBPF (extended Berkeley Packet Filter) shines. eBPF allows you to run sandboxed programs in the Linux kernel without changing kernel source code or loading kernel modules, providing unparalleled visibility and powerful extensibility.
While eBPF offers incredible power, writing raw eBPF programs can be challenging. This is where tools like bpftrace come into play. bpftrace is a high-level tracing language for Linux eBPF, simplifying the creation of custom eBPF programs. It enables you to write concise scripts to trace kernel functions, user-space functions, system calls, and more, extracting custom metrics and insights that are otherwise impossible to obtain. In this guide, we’ll explore how to harness bpftrace within a Kubernetes environment to craft your own kernel-level metrics, giving you a superpower in debugging and performance analysis.
TL;DR: Custom Kernel Metrics with bpftrace
Dive deep into your Kubernetes nodes by tracing kernel events with bpftrace. This guide shows you how to deploy bpftrace as a privileged DaemonSet to gather custom metrics directly from the Linux kernel. Gain unparalleled visibility into system calls, network events, and process behavior to diagnose complex issues.
Key Commands:
# Deploy bpftrace as a DaemonSet
kubectl apply -f https://raw.githubusercontent.com/kubezilla/bpftrace-daemonset/main/bpftrace-daemonset.yaml
# Access the bpftrace pod on a specific node
kubectl exec -it -n kube-system -- bash
# Example: Trace new process executions
bpftrace -e 'tracepoint:syscalls:sys_enter_execve { printf("New process: %s\n", comm); }'
# Example: Monitor disk I/O for specific files
bpftrace -e 'tracepoint:syscalls:sys_enter_openat { printf("Open file: %s (PID: %d)\n", str(args->filename), pid); }'
# Example: Count network send/receive bytes per process
bpftrace -e 'tracepoint:net:net_dev_queue { @tx_bytes[comm] = sum(args->len); } interval:s:1 { print(@tx_bytes); clear(@tx_bytes); }'
Prerequisites
- A running Kubernetes cluster (v1.18+ recommended).
kubectlinstalled and configured to connect to your cluster.- Basic understanding of Kubernetes concepts (Pods, DaemonSets, RBAC).
- Familiarity with Linux command line and kernel concepts (system calls, tracepoints).
- Nodes should be running a relatively recent Linux kernel (5.x+ is ideal for full eBPF feature set).
- Root access or elevated privileges on the nodes for some direct bpftrace usage (handled by DaemonSet).
Step-by-Step Guide
Step 1: Understand eBPF and bpftrace for Kubernetes
Before we jump into deployment, it’s crucial to grasp why eBPF and bpftrace are so powerful in a Kubernetes context. Kubernetes abstracts away much of the underlying host infrastructure, which is great for portability but can be a nightmare for deep-dive debugging. When a pod is misbehaving, is it a container issue, a kernel issue, a network issue, or something else entirely?
eBPF allows you to attach custom programs to various kernel hooks (e.g., system calls, network events, function entries/exits) and collect data without modifying the kernel or rebooting. This data can then be used for monitoring, security, or networking. For instance, tools like Cilium heavily leverage eBPF for high-performance networking and security policies. Our guide on Cilium WireGuard Encryption provides a deeper dive into one such application.
bpftrace simplifies this by providing a C-like language to write eBPF programs. You specify probes (where to attach in the kernel), filters (when to trigger), and actions (what to do, like print data or count events). This makes it incredibly efficient for creating custom, on-demand observability tools without the overhead of traditional kernel modules or heavy agents. For more on eBPF’s observability capabilities, check out eBPF Observability: Building Custom Metrics with Hubble.
Step 2: Deploy bpftrace as a Privileged DaemonSet
To trace kernel events across all your Kubernetes nodes, bpftrace needs to run with significant privileges, including access to the host’s kernel. The most effective way to deploy this in Kubernetes is as a DaemonSet. A DaemonSet ensures that a copy of the bpftrace pod runs on every (or selected) node in your cluster. This allows you to connect to any bpftrace pod and trace the kernel of its host node.
The DaemonSet configuration will include a hostPID: true setting, which allows the pod to see all processes on the host, and privileged: true, granting it full capabilities. It also mounts the /sys/kernel/debug and /lib/modules directories, which are essential for bpftrace to interact with the kernel.
# bpftrace-daemonset.yaml
apiVersion: v1
kind: ServiceAccount
metadata:
name: bpftrace
namespace: kube-system
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
name: bpftrace
rules:
- apiGroups: [""]
resources: ["pods"]
verbs: ["get", "list"] # Needed for kubectl exec
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
name: bpftrace
subjects:
- kind: ServiceAccount
name: bpftrace
namespace: kube-system
roleRef:
kind: ClusterRole
name: bpftrace
apiGroup: rbac.authorization.k8s.io
---
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: bpftrace
namespace: kube-system
labels:
app: bpftrace
spec:
selector:
matchLabels:
app: bpftrace
template:
metadata:
labels:
app: bpftrace
spec:
serviceAccountName: bpftrace
hostPID: true
hostNetwork: true # Optional, but can be useful for network tracing
tolerations:
- operator: Exists # Tolerates all taints, ensuring it runs on all nodes
containers:
- name: bpftrace
image: quay.io/iovisor/bpftrace:latest # Or a specific version
imagePullPolicy: IfNotPresent
securityContext:
privileged: true
command: ["sleep", "infinity"] # Keep the container running
volumeMounts:
- name: lib-modules
mountPath: /lib/modules
readOnly: true
- name: sys-kernel-debug
mountPath: /sys/kernel/debug
- name: sys-fs-bpf
mountPath: /sys/fs/bpf
volumes:
- name: lib-modules
hostPath:
path: /lib/modules
- name: sys-kernel-debug
hostPath:
path: /sys/kernel/debug
- name: sys-fs-bpf
hostPath:
path: /sys/fs/bpf
Apply this manifest to your cluster:
kubectl apply -f bpftrace-daemonset.yaml
Verify
Check if the DaemonSet pods are running on your nodes:
kubectl get pods -n kube-system -l app=bpftrace
Expected Output:
NAME READY STATUS RESTARTS AGE
bpftrace-abcde 1/1 Running 0 2m
bpftrace-fghij 1/1 Running 0 2m
# ... one pod per node
Step 3: Access a bpftrace Pod and Start Tracing
Now that bpftrace is deployed, you can access any of the running pods to execute bpftrace commands. This allows you to trace the kernel of the specific node where that pod is running. Choose a node you want to investigate.
First, get the name of one of the bpftrace pods:
kubectl get pods -n kube-system -l app=bpftrace -o jsonpath='{.items[0].metadata.name}'
Example Output:
bpftrace-abcde
Now, exec into that pod:
kubectl exec -it bpftrace-abcde -n kube-system -- bash
You are now inside the bpftrace container, with access to the bpftrace command-line tool. You can start crafting your custom kernel metrics.
Verify
Once inside the pod, try a simple bpftrace command to list available probes. This confirms bpftrace is working correctly.
bpftrace -l 'tracepoint:syscalls:sys_enter_*' | head -n 5
Expected Output (example, will vary by kernel):
tracepoint:syscalls:sys_enter_read
tracepoint:syscalls:sys_enter_write
tracepoint:syscalls:sys_enter_open
tracepoint:syscalls:sys_enter_close
tracepoint:syscalls:sys_enter_stat
Step 4: Crafting Custom Kernel Traces
This is where the real power lies. bpftrace scripts follow a simple structure: probe /filter/ { action }. Let’s look at some practical examples to gather custom metrics.
Example 1: Tracing New Process Executions
Understand which processes are being started on your node. This can be useful for security auditing or identifying unexpected activity. The tracepoint:syscalls:sys_enter_execve probe fires whenever a new program is executed.
bpftrace -e 'tracepoint:syscalls:sys_enter_execve { printf("New process: %s (PID: %d) by User: %d\n", comm, pid, uid); }'
Expected Output (while new processes are started on the node):
Attaching 1 probe...
New process: date (PID: 12345) by User: 0
New process: bash (PID: 12346) by User: 1000
New process: kubectl (PID: 12347) by User: 1000
To stop the trace, press Ctrl+C.
Example 2: Monitoring Disk I/O Latency for Specific System Calls
Measure the latency of read() and write() system calls, which are fundamental to disk I/O. This can help identify slow storage or applications that are bottlenecked by I/O.
bpftrace -e '
tracepoint:syscalls:sys_enter_read { @start[tid] = nsecs; }
tracepoint:syscalls:sys_exit_read /@start[tid]/ { @read_latency = hist((nsecs - @start[tid]) / 1000); delete(@start[tid]); }
tracepoint:syscalls:sys_enter_write { @start[tid] = nsecs; }
tracepoint:syscalls:sys_exit_write /@start[tid]/ { @write_latency = hist((nsecs - @start[tid]) / 1000); delete(@start[tid]); }
interval:s:5 {
printf("\n--- Read Latency (us) ---\n");
print(@read_latency);
printf("\n--- Write Latency (us) ---\n");
print(@write_latency);
clear(@read_latency);
clear(@write_latency);
}'
Expected Output (every 5 seconds):
Attaching 4 probes...
--- Read Latency (us) ---
@read_latency:
[0, 1) 8 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
[1, 2) 4 |@@@@@@@@@@@@@@@@@@@@@@|
[2, 4) 2 |@@@@@@@@@@|
[4, 8) 1 |@@@@@|
--- Write Latency (us) ---
@write_latency:
[0, 1) 5 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
[1, 2) 2 |@@@@@@@@@@@@@@@@|
[2, 4) 1 |@@@@@@@@|
This script uses bpftrace built-in functions like hist() to create a histogram of latencies in microseconds.
Example 3: Counting Network Bytes Sent/Received per Process
Identify which processes are consuming the most network bandwidth directly from the kernel network stack. This is invaluable for troubleshooting network performance in Kubernetes, especially for identifying noisy neighbors or misconfigured applications.
bpftrace -e '
tracepoint:net:net_dev_queue { @tx_bytes[comm] = sum(args->len); }
tracepoint:net:net_dev_receive_skb { @rx_bytes[comm] = sum(args->len); }
interval:s:5 {
printf("\n--- Network Activity (Bytes/5s) ---\n");
printf("TX Bytes:\n");
print(@tx_bytes);
printf("RX Bytes:\n");
print(@rx_bytes);
clear(@tx_bytes);
clear(@rx_bytes);
}'
Expected Output (every 5 seconds, will vary based on network activity):
Attaching 2 probes...
--- Network Activity (Bytes/5s) ---
TX Bytes:
@tx_bytes[kube-proxy]: 12000
@tx_bytes[nginx]: 850000
@tx_bytes[kubelet]: 3400
RX Bytes:
@rx_bytes[kube-proxy]: 15000
@rx_bytes[nginx]: 1200000
@rx_bytes[kubelet]: 4000
This script uses tracepoints from the net subsystem to count bytes. For more advanced network tracing, Kubernetes Network Policies can help control traffic, but bpftrace gives you the granular visibility.
Step 5: Exporting Metrics and Integration (Optional)
While bpftrace is excellent for interactive debugging, for continuous monitoring, you might want to export these metrics to a time-series database like Prometheus. This usually involves piping bpftrace output to a script that parses it and exposes it via a Prometheus exporter.
A common pattern is to run a bpftrace script with an interval, parse its output using awk or python, and then push these metrics to a Pushgateway or expose them via a simple HTTP server that Prometheus can scrape.
For example, a simple Python script could wrap bpftrace:
# metrics_exporter.py
import subprocess
import time
import re
from prometheus_client import Gauge, start_http_server
# Prometheus metrics
process_exec_total = Gauge('process_exec_total', 'Total number of process executions', ['comm', 'pid', 'uid'])
def run_bpftrace():
# Example bpftrace script to count execve calls
bpftrace_script = """
tracepoint:syscalls:sys_enter_execve {
printf("METRIC:process_exec_total{comm=\"%s\",pid=\"%d\",uid=\"%d\"} 1\\n", comm, pid, uid);
}
"""
process = subprocess.Popen(['bpftrace', '-e', bpftrace_script],
stdout=subprocess.PIPE,
stderr=subprocess.PIPE,
text=True)
for line in iter(process.stdout.readline, ''):
if line.startswith("METRIC:"):
# Parse the line, e.g., METRIC:process_exec_total{comm="bash",pid="123",uid="1000"} 1
match = re.match(r"METRIC:(\w+){comm=\"(.+?)\",pid=\"(\d+)\",uid=\"(\d+)\"}\s(\d+)", line)
if match:
metric_name, comm, pid, uid, value = match.groups()
if metric_name == 'process_exec_total':
process_exec_total.labels(comm=comm, pid=pid, uid=uid).inc(int(value))
time.sleep(0.01) # Small delay to avoid busy-waiting
if __name__ == '__main__':
start_http_server(8000) # Expose metrics on port 8000
print("Prometheus metrics server started on port 8000.")
run_bpftrace()
You would then deploy this Python script alongside bpftrace in your DaemonSet, exposing port 8000 for Prometheus to scrape. This approach allows you to build sophisticated custom monitoring dashboards based on kernel events.
Production Considerations
- Resource Overhead: While eBPF is designed to be efficient, poorly written bpftrace scripts can consume CPU. Be mindful of the probes you attach and the complexity of your actions. Test your scripts thoroughly in non-production environments.
- Security: Running privileged containers with
hostPID: trueandprivileged: trueis a significant security risk. This should only be done for specific, well-understood debugging or monitoring purposes, and ideally, in a dedicated namespace with strict RBAC. Consider using Pod Security Standards or Kyverno/OPA to restrict such deployments. Our guide on Securing Container Supply Chains with Sigstore and Kyverno offers insights into policy enforcement. - Kernel Version Compatibility: eBPF features evolve with the Linux kernel. Newer kernels (5.x+) offer more tracepoints and eBPF capabilities. Ensure your nodes have reasonably up-to-date kernels for the best experience.
- Data Volume and Latency: Interactive bpftrace is great, but for continuous collection, consider the volume of data. High-frequency events can generate a lot of output. If exporting, ensure your metric pipeline can handle the load.
- Integration with Existing Monitoring: For production, integrate your custom bpftrace metrics into your existing observability stack (Prometheus, Grafana, ELK). This provides a unified view of your system.
- Node Taints and Tolerations: The provided DaemonSet uses an
Existstoleration to run on all nodes. If you have specific node groups or taints, adjust the tolerations accordingly. For example, if you’re tracing GPU-related events, you might only want it on nodes with GPUs, similar to how one might schedule LLM workloads with GPU scheduling. - Cleanup: Always ensure you have a plan for cleaning up your DaemonSet when it’s no longer needed to minimize the security blast radius.
Troubleshooting
-
Issue:
bpftrace: failed to attach probe: Invalid argumentor similar errors.Solution: This often indicates that the specified kernel probe (tracepoint, kprobe, uprobe) does not exist on your kernel version, or bpftrace doesn’t have the necessary debug info.
- Verify the probe name using
bpftrace -l. - Ensure
/sys/kernel/debugis mounted and accessible (check the DaemonSet volume mounts). - Check your kernel version. Some probes are kernel version-specific.
- Ensure the kernel has
CONFIG_BPF_EVENTSandCONFIG_KPROBESenabled.
- Verify the probe name using
-
Issue: bpftrace pod is stuck in
PendingorCrashLoopBackOff.Solution:
- Check pod events:
kubectl describe pod <bpftrace-pod-name> -n kube-system. - Ensure the image is pulling correctly.
- Verify host paths for volumes exist (
/lib/modules,/sys/kernel/debug,/sys/fs/bpf) on the node. - If
hostPID: trueorprivileged: trueis causing issues with your cluster’s Pod Security Policies, you might need to adjust them or use a different approach (e.g., node-level systemd service for bpftrace if K8s is too restrictive).
- Check pod events:
-
Issue: bpftrace script runs but produces no output, even for common events.
Solution:
- Double-check your probe name for typos.
- Ensure the event you’re tracing is actually occurring. For example, if tracing network activity, ensure there’s network traffic on the node.
- Verify your filter condition (
/filter/) is not too restrictive. - Sometimes, the kernel might not have debug symbols available, which can limit the information bpftrace can extract.
-
Issue: High CPU usage on the node when running bpftrace.
Solution:
- Your bpftrace script might be too aggressive. Review the probes. Are you tracing very high-frequency events without sufficient filtering?
- Reduce the frequency of probes, add more specific filters, or use aggregation functions like
sum()orcount()instead of printing every event. - Avoid complex user-space string operations in the kernel path if possible.
- Increase the interval for
interval:s:Xprobes.
-
Issue:
kubectl execinto bpftrace pod fails with permission errors.Solution:
- Ensure the
ServiceAccountandClusterRoleBindingforbpftraceare correctly applied and grantget/listpermissions on pods in thekube-systemnamespace. - Verify your own user has permissions to exec into pods in that namespace.
- Ensure the
-
Issue: bpftrace doesn’t see all processes or network traffic from containers.
Solution:
- Ensure
hostPID: trueis set in your DaemonSet. Without it, bpftrace will only see processes within its own PID namespace. - Ensure
hostNetwork: trueis set if you are tracing network events that might be isolated by container network namespaces. This gives the bpftrace pod access to the host’s network stack directly.
- Ensure
FAQ Section
-
Q: What is the difference between eBPF and bpftrace?
A: eBPF is the underlying technology that allows programs to run in the Linux kernel. It’s a virtual machine that executes bytecode. bpftrace is a high-level language and tool that compiles its scripts into eBPF bytecode, making it much easier to write and deploy eBPF programs for tracing and monitoring without needing to write complex C code.
-
Q: Is it safe to run privileged containers in Kubernetes?
A: Generally, no. Running privileged containers with
hostPID: trueandprivileged: truegives the container nearly unrestricted access to the host system, which is a significant security risk. It should only be done when absolutely necessary for specific tasks like kernel tracing, and with strict security controls in place (e.g., dedicated namespaces, RBAC, network policies, and potentially restricted Pod Security Standards to prevent other pods from gaining similar privileges). For guidance on securing your cluster, refer to our Kubernetes Network Policies: Complete Security Hardening Guide. -
Q: Can bpftrace slow down my production system?
A: Yes, it can, but typically only if used improperly. eBPF is designed for low overhead. However, if your bpftrace script attaches to very high-frequency events and performs complex actions (especially string manipulations or extensive data collection), it can introduce noticeable overhead. Always test your scripts in a staging environment and monitor CPU usage. Start with simple scripts and gradually increase complexity.
-
Q: How does bpftrace compare to traditional Linux tracing tools like
straceorperf?A:
straceis great for process-specific system call tracing but has high overhead and can’t trace kernel internals.perfis a powerful profiling tool for kernel and user-space but can be complex to use for specific event tracing. bpftrace, powered by eBPF, offers a unique blend: it has much lower overhead thanstrace, can trace a wider range of kernel events thanstrace, and is often simpler to write custom scripts for thanperf, especially for aggregation and custom metrics. It excels at custom, on-demand, kernel-level observability. -
Q: Can I use bpftrace to trace application-level functions (e.g., in Python or Java)?
A: Yes, you can! bpftrace supports uprobes, which allow you to attach to functions within user-space applications. This requires debug symbols for the application or its libraries. For interpreted languages, tracing can be more challenging but still possible by attaching to the interpreter’s internal functions or specific library calls. This level of tracing goes beyond kernel metrics but demonstrates the full power of eBPF.
Cleanup Commands
When you are finished with bpftrace, it’s important to remove the privileged DaemonSet from your cluster.
kubectl delete -f bpftrace-daemonset.yaml
This command will delete the DaemonSet, ServiceAccount, ClusterRole, and ClusterRoleBinding, effectively removing bpftrace from your cluster. Always ensure you clean up resources you no longer need, especially privileged ones.
Next Steps / Further Reading
- Explore more bpftrace examples: The bpftrace tools directory on GitHub contains a wealth of pre-written scripts for various scenarios.
- Dive deeper into eBPF: The eBPF.io website is an excellent resource for understanding the underlying technology.
- Learn about other eBPF tools: Explore projects like BCC (BPF Compiler Collection) for more complex eBPF program development using Python or C.
- Kubernetes Observability: Integrate your custom metrics with tools like Prometheus and Grafana for comprehensive dashboards. Consider exploring tools like eBPF Observability: Building Custom Metrics with Hubble for network-focused eBPF insights.
- Service Mesh Integration: For advanced network traffic management and observability, look into service meshes like Istio. Our Istio Ambient Mesh Production Guide offers a deep dive into its capabilities.
- Cost Optimization: Understanding resource usage at the kernel level can inform decisions for cost optimization, similar to how Karpenter Cost Optimization helps manage node resources efficiently.
Conclusion
eBPF and bpftrace provide an unparalleled window into the Linux kernel, offering a level of observability that traditional tools often cannot match. By deploying bpftrace as a privileged DaemonSet in your Kubernetes clusters, you gain the ability to craft custom kernel metrics on demand, diagnose complex performance issues, and identify obscure security anomalies. While the power comes with the responsibility of careful security and resource management, the insights gained can be invaluable for maintaining high-performing and stable Kubernetes environments. Embrace the power of kernel tracing and elevate your Kubernetes debugging capabilities to the next level.