Orchestration

bpftrace: Custom Kernel Metrics with eBPF

Introduction

In the complex world of Kubernetes, understanding the underlying behavior of your applications and the cluster itself is paramount. Traditional monitoring tools often provide high-level metrics, but sometimes you need to dive deeper—right into the kernel—to diagnose elusive performance bottlenecks, security anomalies, or resource contention issues. This is where eBPF (extended Berkeley Packet Filter) shines. eBPF allows you to run sandboxed programs in the Linux kernel without changing kernel source code or loading kernel modules, providing unparalleled visibility and powerful extensibility.

While eBPF offers incredible power, writing raw eBPF programs can be challenging. This is where tools like bpftrace come into play. bpftrace is a high-level tracing language for Linux eBPF, simplifying the creation of custom eBPF programs. It enables you to write concise scripts to trace kernel functions, user-space functions, system calls, and more, extracting custom metrics and insights that are otherwise impossible to obtain. In this guide, we’ll explore how to harness bpftrace within a Kubernetes environment to craft your own kernel-level metrics, giving you a superpower in debugging and performance analysis.

TL;DR: Custom Kernel Metrics with bpftrace

Dive deep into your Kubernetes nodes by tracing kernel events with bpftrace. This guide shows you how to deploy bpftrace as a privileged DaemonSet to gather custom metrics directly from the Linux kernel. Gain unparalleled visibility into system calls, network events, and process behavior to diagnose complex issues.

Key Commands:


# Deploy bpftrace as a DaemonSet
kubectl apply -f https://raw.githubusercontent.com/kubezilla/bpftrace-daemonset/main/bpftrace-daemonset.yaml

# Access the bpftrace pod on a specific node
kubectl exec -it  -n kube-system -- bash

# Example: Trace new process executions
bpftrace -e 'tracepoint:syscalls:sys_enter_execve { printf("New process: %s\n", comm); }'

# Example: Monitor disk I/O for specific files
bpftrace -e 'tracepoint:syscalls:sys_enter_openat { printf("Open file: %s (PID: %d)\n", str(args->filename), pid); }'

# Example: Count network send/receive bytes per process
bpftrace -e 'tracepoint:net:net_dev_queue { @tx_bytes[comm] = sum(args->len); } interval:s:1 { print(@tx_bytes); clear(@tx_bytes); }'
    

Prerequisites

  • A running Kubernetes cluster (v1.18+ recommended).
  • kubectl installed and configured to connect to your cluster.
  • Basic understanding of Kubernetes concepts (Pods, DaemonSets, RBAC).
  • Familiarity with Linux command line and kernel concepts (system calls, tracepoints).
  • Nodes should be running a relatively recent Linux kernel (5.x+ is ideal for full eBPF feature set).
  • Root access or elevated privileges on the nodes for some direct bpftrace usage (handled by DaemonSet).

Step-by-Step Guide

Step 1: Understand eBPF and bpftrace for Kubernetes

Before we jump into deployment, it’s crucial to grasp why eBPF and bpftrace are so powerful in a Kubernetes context. Kubernetes abstracts away much of the underlying host infrastructure, which is great for portability but can be a nightmare for deep-dive debugging. When a pod is misbehaving, is it a container issue, a kernel issue, a network issue, or something else entirely?

eBPF allows you to attach custom programs to various kernel hooks (e.g., system calls, network events, function entries/exits) and collect data without modifying the kernel or rebooting. This data can then be used for monitoring, security, or networking. For instance, tools like Cilium heavily leverage eBPF for high-performance networking and security policies. Our guide on Cilium WireGuard Encryption provides a deeper dive into one such application.

bpftrace simplifies this by providing a C-like language to write eBPF programs. You specify probes (where to attach in the kernel), filters (when to trigger), and actions (what to do, like print data or count events). This makes it incredibly efficient for creating custom, on-demand observability tools without the overhead of traditional kernel modules or heavy agents. For more on eBPF’s observability capabilities, check out eBPF Observability: Building Custom Metrics with Hubble.

Step 2: Deploy bpftrace as a Privileged DaemonSet

To trace kernel events across all your Kubernetes nodes, bpftrace needs to run with significant privileges, including access to the host’s kernel. The most effective way to deploy this in Kubernetes is as a DaemonSet. A DaemonSet ensures that a copy of the bpftrace pod runs on every (or selected) node in your cluster. This allows you to connect to any bpftrace pod and trace the kernel of its host node.

The DaemonSet configuration will include a hostPID: true setting, which allows the pod to see all processes on the host, and privileged: true, granting it full capabilities. It also mounts the /sys/kernel/debug and /lib/modules directories, which are essential for bpftrace to interact with the kernel.


# bpftrace-daemonset.yaml
apiVersion: v1
kind: ServiceAccount
metadata:
  name: bpftrace
  namespace: kube-system
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: bpftrace
rules:
- apiGroups: [""]
  resources: ["pods"]
  verbs: ["get", "list"] # Needed for kubectl exec
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: bpftrace
subjects:
- kind: ServiceAccount
  name: bpftrace
  namespace: kube-system
roleRef:
  kind: ClusterRole
  name: bpftrace
  apiGroup: rbac.authorization.k8s.io
---
apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: bpftrace
  namespace: kube-system
  labels:
    app: bpftrace
spec:
  selector:
    matchLabels:
      app: bpftrace
  template:
    metadata:
      labels:
        app: bpftrace
    spec:
      serviceAccountName: bpftrace
      hostPID: true
      hostNetwork: true # Optional, but can be useful for network tracing
      tolerations:
      - operator: Exists # Tolerates all taints, ensuring it runs on all nodes
      containers:
      - name: bpftrace
        image: quay.io/iovisor/bpftrace:latest # Or a specific version
        imagePullPolicy: IfNotPresent
        securityContext:
          privileged: true
        command: ["sleep", "infinity"] # Keep the container running
        volumeMounts:
        - name: lib-modules
          mountPath: /lib/modules
          readOnly: true
        - name: sys-kernel-debug
          mountPath: /sys/kernel/debug
        - name: sys-fs-bpf
          mountPath: /sys/fs/bpf
      volumes:
      - name: lib-modules
        hostPath:
          path: /lib/modules
      - name: sys-kernel-debug
        hostPath:
          path: /sys/kernel/debug
      - name: sys-fs-bpf
        hostPath:
          path: /sys/fs/bpf

Apply this manifest to your cluster:


kubectl apply -f bpftrace-daemonset.yaml

Verify

Check if the DaemonSet pods are running on your nodes:


kubectl get pods -n kube-system -l app=bpftrace

Expected Output:


NAME             READY   STATUS    RESTARTS   AGE
bpftrace-abcde   1/1     Running   0          2m
bpftrace-fghij   1/1     Running   0          2m
# ... one pod per node

Step 3: Access a bpftrace Pod and Start Tracing

Now that bpftrace is deployed, you can access any of the running pods to execute bpftrace commands. This allows you to trace the kernel of the specific node where that pod is running. Choose a node you want to investigate.

First, get the name of one of the bpftrace pods:


kubectl get pods -n kube-system -l app=bpftrace -o jsonpath='{.items[0].metadata.name}'

Example Output:


bpftrace-abcde

Now, exec into that pod:


kubectl exec -it bpftrace-abcde -n kube-system -- bash

You are now inside the bpftrace container, with access to the bpftrace command-line tool. You can start crafting your custom kernel metrics.

Verify

Once inside the pod, try a simple bpftrace command to list available probes. This confirms bpftrace is working correctly.


bpftrace -l 'tracepoint:syscalls:sys_enter_*' | head -n 5

Expected Output (example, will vary by kernel):


tracepoint:syscalls:sys_enter_read
tracepoint:syscalls:sys_enter_write
tracepoint:syscalls:sys_enter_open
tracepoint:syscalls:sys_enter_close
tracepoint:syscalls:sys_enter_stat

Step 4: Crafting Custom Kernel Traces

This is where the real power lies. bpftrace scripts follow a simple structure: probe /filter/ { action }. Let’s look at some practical examples to gather custom metrics.

Example 1: Tracing New Process Executions

Understand which processes are being started on your node. This can be useful for security auditing or identifying unexpected activity. The tracepoint:syscalls:sys_enter_execve probe fires whenever a new program is executed.


bpftrace -e 'tracepoint:syscalls:sys_enter_execve { printf("New process: %s (PID: %d) by User: %d\n", comm, pid, uid); }'

Expected Output (while new processes are started on the node):


Attaching 1 probe...
New process: date (PID: 12345) by User: 0
New process: bash (PID: 12346) by User: 1000
New process: kubectl (PID: 12347) by User: 1000

To stop the trace, press Ctrl+C.

Example 2: Monitoring Disk I/O Latency for Specific System Calls

Measure the latency of read() and write() system calls, which are fundamental to disk I/O. This can help identify slow storage or applications that are bottlenecked by I/O.


bpftrace -e '
tracepoint:syscalls:sys_enter_read { @start[tid] = nsecs; }
tracepoint:syscalls:sys_exit_read /@start[tid]/ { @read_latency = hist((nsecs - @start[tid]) / 1000); delete(@start[tid]); }

tracepoint:syscalls:sys_enter_write { @start[tid] = nsecs; }
tracepoint:syscalls:sys_exit_write /@start[tid]/ { @write_latency = hist((nsecs - @start[tid]) / 1000); delete(@start[tid]); }

interval:s:5 {
  printf("\n--- Read Latency (us) ---\n");
  print(@read_latency);
  printf("\n--- Write Latency (us) ---\n");
  print(@write_latency);
  clear(@read_latency);
  clear(@write_latency);
}'

Expected Output (every 5 seconds):


Attaching 4 probes...
--- Read Latency (us) ---
@read_latency:
[0, 1)                8 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
[1, 2)                4 |@@@@@@@@@@@@@@@@@@@@@@|
[2, 4)                2 |@@@@@@@@@@|
[4, 8)                1 |@@@@@|
--- Write Latency (us) ---
@write_latency:
[0, 1)                5 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
[1, 2)                2 |@@@@@@@@@@@@@@@@|
[2, 4)                1 |@@@@@@@@|

This script uses bpftrace built-in functions like hist() to create a histogram of latencies in microseconds.

Example 3: Counting Network Bytes Sent/Received per Process

Identify which processes are consuming the most network bandwidth directly from the kernel network stack. This is invaluable for troubleshooting network performance in Kubernetes, especially for identifying noisy neighbors or misconfigured applications.


bpftrace -e '
tracepoint:net:net_dev_queue { @tx_bytes[comm] = sum(args->len); }
tracepoint:net:net_dev_receive_skb { @rx_bytes[comm] = sum(args->len); }

interval:s:5 {
  printf("\n--- Network Activity (Bytes/5s) ---\n");
  printf("TX Bytes:\n");
  print(@tx_bytes);
  printf("RX Bytes:\n");
  print(@rx_bytes);
  clear(@tx_bytes);
  clear(@rx_bytes);
}'

Expected Output (every 5 seconds, will vary based on network activity):


Attaching 2 probes...
--- Network Activity (Bytes/5s) ---
TX Bytes:
@tx_bytes[kube-proxy]: 12000
@tx_bytes[nginx]: 850000
@tx_bytes[kubelet]: 3400
RX Bytes:
@rx_bytes[kube-proxy]: 15000
@rx_bytes[nginx]: 1200000
@rx_bytes[kubelet]: 4000

This script uses tracepoints from the net subsystem to count bytes. For more advanced network tracing, Kubernetes Network Policies can help control traffic, but bpftrace gives you the granular visibility.

Step 5: Exporting Metrics and Integration (Optional)

While bpftrace is excellent for interactive debugging, for continuous monitoring, you might want to export these metrics to a time-series database like Prometheus. This usually involves piping bpftrace output to a script that parses it and exposes it via a Prometheus exporter.

A common pattern is to run a bpftrace script with an interval, parse its output using awk or python, and then push these metrics to a Pushgateway or expose them via a simple HTTP server that Prometheus can scrape.

For example, a simple Python script could wrap bpftrace:


# metrics_exporter.py
import subprocess
import time
import re
from prometheus_client import Gauge, start_http_server

# Prometheus metrics
process_exec_total = Gauge('process_exec_total', 'Total number of process executions', ['comm', 'pid', 'uid'])

def run_bpftrace():
    # Example bpftrace script to count execve calls
    bpftrace_script = """
    tracepoint:syscalls:sys_enter_execve {
        printf("METRIC:process_exec_total{comm=\"%s\",pid=\"%d\",uid=\"%d\"} 1\\n", comm, pid, uid);
    }
    """
    process = subprocess.Popen(['bpftrace', '-e', bpftrace_script],
                               stdout=subprocess.PIPE,
                               stderr=subprocess.PIPE,
                               text=True)

    for line in iter(process.stdout.readline, ''):
        if line.startswith("METRIC:"):
            # Parse the line, e.g., METRIC:process_exec_total{comm="bash",pid="123",uid="1000"} 1
            match = re.match(r"METRIC:(\w+){comm=\"(.+?)\",pid=\"(\d+)\",uid=\"(\d+)\"}\s(\d+)", line)
            if match:
                metric_name, comm, pid, uid, value = match.groups()
                if metric_name == 'process_exec_total':
                    process_exec_total.labels(comm=comm, pid=pid, uid=uid).inc(int(value))
        time.sleep(0.01) # Small delay to avoid busy-waiting

if __name__ == '__main__':
    start_http_server(8000) # Expose metrics on port 8000
    print("Prometheus metrics server started on port 8000.")
    run_bpftrace()

You would then deploy this Python script alongside bpftrace in your DaemonSet, exposing port 8000 for Prometheus to scrape. This approach allows you to build sophisticated custom monitoring dashboards based on kernel events.

Production Considerations

  • Resource Overhead: While eBPF is designed to be efficient, poorly written bpftrace scripts can consume CPU. Be mindful of the probes you attach and the complexity of your actions. Test your scripts thoroughly in non-production environments.
  • Security: Running privileged containers with hostPID: true and privileged: true is a significant security risk. This should only be done for specific, well-understood debugging or monitoring purposes, and ideally, in a dedicated namespace with strict RBAC. Consider using Pod Security Standards or Kyverno/OPA to restrict such deployments. Our guide on Securing Container Supply Chains with Sigstore and Kyverno offers insights into policy enforcement.
  • Kernel Version Compatibility: eBPF features evolve with the Linux kernel. Newer kernels (5.x+) offer more tracepoints and eBPF capabilities. Ensure your nodes have reasonably up-to-date kernels for the best experience.
  • Data Volume and Latency: Interactive bpftrace is great, but for continuous collection, consider the volume of data. High-frequency events can generate a lot of output. If exporting, ensure your metric pipeline can handle the load.
  • Integration with Existing Monitoring: For production, integrate your custom bpftrace metrics into your existing observability stack (Prometheus, Grafana, ELK). This provides a unified view of your system.
  • Node Taints and Tolerations: The provided DaemonSet uses an Exists toleration to run on all nodes. If you have specific node groups or taints, adjust the tolerations accordingly. For example, if you’re tracing GPU-related events, you might only want it on nodes with GPUs, similar to how one might schedule LLM workloads with GPU scheduling.
  • Cleanup: Always ensure you have a plan for cleaning up your DaemonSet when it’s no longer needed to minimize the security blast radius.

Troubleshooting

  1. Issue: bpftrace: failed to attach probe: Invalid argument or similar errors.

    Solution: This often indicates that the specified kernel probe (tracepoint, kprobe, uprobe) does not exist on your kernel version, or bpftrace doesn’t have the necessary debug info.

    • Verify the probe name using bpftrace -l.
    • Ensure /sys/kernel/debug is mounted and accessible (check the DaemonSet volume mounts).
    • Check your kernel version. Some probes are kernel version-specific.
    • Ensure the kernel has CONFIG_BPF_EVENTS and CONFIG_KPROBES enabled.
  2. Issue: bpftrace pod is stuck in Pending or CrashLoopBackOff.

    Solution:

    • Check pod events: kubectl describe pod <bpftrace-pod-name> -n kube-system.
    • Ensure the image is pulling correctly.
    • Verify host paths for volumes exist (/lib/modules, /sys/kernel/debug, /sys/fs/bpf) on the node.
    • If hostPID: true or privileged: true is causing issues with your cluster’s Pod Security Policies, you might need to adjust them or use a different approach (e.g., node-level systemd service for bpftrace if K8s is too restrictive).
  3. Issue: bpftrace script runs but produces no output, even for common events.

    Solution:

    • Double-check your probe name for typos.
    • Ensure the event you’re tracing is actually occurring. For example, if tracing network activity, ensure there’s network traffic on the node.
    • Verify your filter condition (/filter/) is not too restrictive.
    • Sometimes, the kernel might not have debug symbols available, which can limit the information bpftrace can extract.
  4. Issue: High CPU usage on the node when running bpftrace.

    Solution:

    • Your bpftrace script might be too aggressive. Review the probes. Are you tracing very high-frequency events without sufficient filtering?
    • Reduce the frequency of probes, add more specific filters, or use aggregation functions like sum() or count() instead of printing every event.
    • Avoid complex user-space string operations in the kernel path if possible.
    • Increase the interval for interval:s:X probes.
  5. Issue: kubectl exec into bpftrace pod fails with permission errors.

    Solution:

    • Ensure the ServiceAccount and ClusterRoleBinding for bpftrace are correctly applied and grant get/list permissions on pods in the kube-system namespace.
    • Verify your own user has permissions to exec into pods in that namespace.
  6. Issue: bpftrace doesn’t see all processes or network traffic from containers.

    Solution:

    • Ensure hostPID: true is set in your DaemonSet. Without it, bpftrace will only see processes within its own PID namespace.
    • Ensure hostNetwork: true is set if you are tracing network events that might be isolated by container network namespaces. This gives the bpftrace pod access to the host’s network stack directly.

FAQ Section

  1. Q: What is the difference between eBPF and bpftrace?

    A: eBPF is the underlying technology that allows programs to run in the Linux kernel. It’s a virtual machine that executes bytecode. bpftrace is a high-level language and tool that compiles its scripts into eBPF bytecode, making it much easier to write and deploy eBPF programs for tracing and monitoring without needing to write complex C code.

  2. Q: Is it safe to run privileged containers in Kubernetes?

    A: Generally, no. Running privileged containers with hostPID: true and privileged: true gives the container nearly unrestricted access to the host system, which is a significant security risk. It should only be done when absolutely necessary for specific tasks like kernel tracing, and with strict security controls in place (e.g., dedicated namespaces, RBAC, network policies, and potentially restricted Pod Security Standards to prevent other pods from gaining similar privileges). For guidance on securing your cluster, refer to our Kubernetes Network Policies: Complete Security Hardening Guide.

  3. Q: Can bpftrace slow down my production system?

    A: Yes, it can, but typically only if used improperly. eBPF is designed for low overhead. However, if your bpftrace script attaches to very high-frequency events and performs complex actions (especially string manipulations or extensive data collection), it can introduce noticeable overhead. Always test your scripts in a staging environment and monitor CPU usage. Start with simple scripts and gradually increase complexity.

  4. Q: How does bpftrace compare to traditional Linux tracing tools like strace or perf?

    A: strace is great for process-specific system call tracing but has high overhead and can’t trace kernel internals. perf is a powerful profiling tool for kernel and user-space but can be complex to use for specific event tracing. bpftrace, powered by eBPF, offers a unique blend: it has much lower overhead than strace, can trace a wider range of kernel events than strace, and is often simpler to write custom scripts for than perf, especially for aggregation and custom metrics. It excels at custom, on-demand, kernel-level observability.

  5. Q: Can I use bpftrace to trace application-level functions (e.g., in Python or Java)?

    A: Yes, you can! bpftrace supports uprobes, which allow you to attach to functions within user-space applications. This requires debug symbols for the application or its libraries. For interpreted languages, tracing can be more challenging but still possible by attaching to the interpreter’s internal functions or specific library calls. This level of tracing goes beyond kernel metrics but demonstrates the full power of eBPF.

Cleanup Commands

When you are finished with bpftrace, it’s important to remove the privileged DaemonSet from your cluster.


kubectl delete -f bpftrace-daemonset.yaml

This command will delete the DaemonSet, ServiceAccount, ClusterRole, and ClusterRoleBinding, effectively removing bpftrace from your cluster. Always ensure you clean up resources you no longer need, especially privileged ones.

Next Steps / Further Reading

  • Explore more bpftrace examples: The bpftrace tools directory on GitHub contains a wealth of pre-written scripts for various scenarios.
  • Dive deeper into eBPF: The eBPF.io website is an excellent resource for understanding the underlying technology.
  • Learn about other eBPF tools: Explore projects like BCC (BPF Compiler Collection) for more complex eBPF program development using Python or C.
  • Kubernetes Observability: Integrate your custom metrics with tools like Prometheus and Grafana for comprehensive dashboards. Consider exploring tools like eBPF Observability: Building Custom Metrics with Hubble for network-focused eBPF insights.
  • Service Mesh Integration: For advanced network traffic management and observability, look into service meshes like Istio. Our Istio Ambient Mesh Production Guide offers a deep dive into its capabilities.
  • Cost Optimization: Understanding resource usage at the kernel level can inform decisions for cost optimization, similar to how Karpenter Cost Optimization helps manage node resources efficiently.

Conclusion

eBPF and bpftrace provide an unparalleled window into the Linux kernel, offering a level of observability that traditional tools often cannot match. By deploying bpftrace as a privileged DaemonSet in your Kubernetes clusters, you gain the ability to craft custom kernel metrics on demand, diagnose complex performance issues, and identify obscure security anomalies. While the power comes with the responsibility of careful security and resource management, the insights gained can be invaluable for maintaining high-performing and stable Kubernetes environments. Embrace the power of kernel tracing and elevate your Kubernetes debugging capabilities to the next level.

Leave a Reply

Your email address will not be published. Required fields are marked *