Orchestration

Kubernetes Tracing: Jaeger & OpenTelemetry

Introduction

In the complex, distributed landscape of modern microservice architectures running on Kubernetes, understanding the flow of requests and pinpointing performance bottlenecks can be a daunting challenge. A single user request might traverse dozens of services, databases, and external APIs, making traditional logging and monitoring insufficient for granular insights. This is where distributed tracing shines, providing an end-to-end view of a request’s journey through your system.

Distributed tracing helps you visualize the entire lifecycle of a request, breaking it down into individual operations (spans) and showing their relationships, timing, and any associated metadata. This invaluable observability data allows developers and operations teams to quickly identify latency issues, errors, and performance degradation across services. In this guide, we’ll delve into implementing distributed tracing within your Kubernetes clusters using two powerful CNCF projects: Jaeger for trace storage and visualization, and OpenTelemetry as the vendor-neutral instrumentation standard.

By the end of this tutorial, you’ll have a robust tracing infrastructure in place, enabling you to gain unprecedented visibility into your Kubernetes-native applications. This setup will not only enhance your ability to debug and optimize but also provide a clearer understanding of your microservices’ interactions, transforming how you approach performance analysis and issue resolution in your distributed environment.

TL;DR: Kubernetes Tracing with Jaeger & OpenTelemetry

Deploy Jaeger and OpenTelemetry Collector to trace microservices in Kubernetes. Instrument your applications with OpenTelemetry SDKs, configure them to send traces to the Collector, and then visualize end-to-end request flows in the Jaeger UI.

# 1. Install Jaeger Operator
kubectl apply -f https://github.com/jaegertracing/jaeger-operator/releases/download/v1.54.0/jaeger-operator.yaml

# 2. Deploy Jaeger instance
kubectl apply -f - <

Prerequisites

Before we embark on our tracing journey, ensure you have the following in place:

  • Kubernetes Cluster: A running Kubernetes cluster (v1.20+ recommended). This can be a local cluster like Kind or Minikube, or a cloud-managed service such as GKE, EKS, or AKS. For production, consider a multi-node cluster.
  • kubectl: The Kubernetes command-line tool, configured to connect to your cluster. You can find installation instructions in the official Kubernetes documentation.
  • Helm (Optional but Recommended): While we'll use raw Kubernetes manifests for core components, Helm simplifies the deployment of complex applications. Download it from the Helm website.
  • Basic Understanding of Kubernetes Concepts: Familiarity with Deployments, Services, ConfigMaps, and Namespaces will be helpful. If you're new to Kubernetes, check out the Kubernetes Concepts overview.
  • Application to Trace: You'll need a sample application (or your own microservice) that you can modify to include OpenTelemetry instrumentation. We'll provide a simple Python example later.

Step-by-Step Guide: Kubernetes Tracing with Jaeger and OpenTelemetry

This guide will walk you through deploying Jaeger, setting up the OpenTelemetry Collector, and instrumenting a sample application to send traces.

Step 1: Deploy the Jaeger Operator

The Jaeger Operator simplifies the deployment and management of Jaeger instances in Kubernetes. It understands Jaeger's specific needs and acts as a controller, creating and configuring the necessary Kubernetes resources for you. This allows for easier scaling, updates, and maintenance of your tracing infrastructure.

By deploying the operator, you gain a declarative way to manage Jaeger. Instead of manually creating Deployments, Services, and PersistentVolumes, you define a custom resource (CR) of kind `Jaeger`, and the operator handles the rest. This aligns perfectly with the Kubernetes philosophy of desired state management.

kubectl apply -f https://github.com/jaegertracing/jaeger-operator/releases/download/v1.54.0/jaeger-operator.yaml

Verify:

Check if the Jaeger Operator deployment is running in the `observability` namespace (or `default` if you're using an older version or different configuration). Look for a pod named `jaeger-operator-...`.

kubectl get deployments -n observability

Expected Output (name and namespace might vary slightly based on operator version):

NAME              READY   UP-TO-DATE   AVAILABLE   AGE
jaeger-operator   1/1     1            1           2m
kubectl get pods -n observability

Expected Output:

NAME                               READY   STATUS    RESTARTS   AGE
jaeger-operator-5dc9c7c4f4-abcde   1/1     Running   0          2m

Step 2: Deploy a Jaeger Instance

Now that the operator is running, we can deploy a Jaeger instance using a Custom Resource (CR). For simplicity, we'll start with an `allInOne` strategy, which combines all Jaeger components (collector, query, agent, and in-memory storage) into a single pod. This is suitable for development and testing environments. For production, you would typically use a `production` strategy with separate components and persistent storage.

The `allInOne` deployment is quick to set up and provides a fully functional Jaeger UI and collector endpoint. We specify an `image` for the Jaeger instance and add `query.base-path: /jaeger` to ensure the UI is accessible under a subpath, which is useful when exposing it via an Ingress or Gateway. For more advanced networking configurations, you might explore our guide on the Kubernetes Gateway API.

apiVersion: jaegertracing.io/v1
kind: Jaeger
metadata:
  name: simple-jaeger
spec:
  strategy: allInOne
  allInOne:
    image: jaegertracing/all-in-one:1.54
    options:
      query.base-path: /jaeger
kubectl apply -f jaeger-instance.yaml
# Or directly:
kubectl apply -f - <

Verify:

Check if the Jaeger pod and associated services are created and running. The operator will create a Deployment, Service, and potentially other resources based on the `Jaeger` CR.

kubectl get pods -l app=jaeger

Expected Output:

NAME                             READY   STATUS    RESTARTS   AGE
simple-jaeger-674f849649-abcde   1/1     Running   0          1m
kubectl get svc -l app=jaeger

Expected Output:

NAME                      TYPE        CLUSTER-IP       EXTERNAL-IP   PORT(S)                                  AGE
simple-jaeger-agent       ClusterIP   None             <none>        5775/UDP,5778/UDP,6831/UDP,6832/UDP      1m
simple-jaeger-collector   ClusterIP   10.96.100.101    <none>        9411/TCP,14250/TCP,14268/TCP,14269/TCP   1m
simple-jaeger-query       ClusterIP   10.96.200.202    <none>        16686/TCP                                1m

Step 3: Deploy OpenTelemetry Collector

The OpenTelemetry Collector is a vendor-agnostic proxy that receives, processes, and exports telemetry data (traces, metrics, and logs). It's a crucial component in our setup, acting as an intermediary between your instrumented applications and Jaeger. This decouples your applications from the backend, allowing you to change tracing backends (e.g., from Jaeger to Zipkin or a commercial SaaS offering) without re-instrumenting your code.

We'll deploy the collector as a Kubernetes Deployment. It exposes OTLP (OpenTelemetry Protocol) gRPC and HTTP endpoints for applications to send their traces. The collector's configuration, defined in a ConfigMap, will instruct it to export these traces to our Jaeger collector service. This architecture promotes flexibility and reduces the resource footprint on your application pods.

apiVersion: apps/v1
kind: Deployment
metadata:
  name: otel-collector
  labels:
    app: otel-collector
spec:
  replicas: 1
  selector:
    matchLabels:
      app: otel-collector
  template:
    metadata:
      labels:
        app: otel-collector
    spec:
      containers:
      - name: otel-collector
        image: otel/opentelemetry-collector:0.99.0
        command: ["/otelcol"]
        args: ["--config=/etc/otelcol-config.yaml"]
        ports:
        - name: otlp-grpc
          containerPort: 4317
        - name: otlp-http
          containerPort: 4318
        volumeMounts:
        - name: otel-collector-config
          mountPath: /etc/otelcol-config.yaml
          subPath: otelcol-config.yaml
      volumes:
      - name: otel-collector-config
        configMap:
          name: otel-collector-config
kubectl apply -f otel-collector-deployment.yaml
# Or directly:
kubectl apply -f - <

Verify:

Ensure the OpenTelemetry Collector pod is running.

kubectl get pods -l app=otel-collector

Expected Output:

NAME                             READY   STATUS    RESTARTS   AGE
otel-collector-84f7b447c-vwxyz   1/1     Running   0          1m

Step 4: Configure OpenTelemetry Collector with ConfigMap

The OpenTelemetry Collector needs a configuration to define how it receives, processes, and exports telemetry data. We'll provide this configuration via a Kubernetes ConfigMap, which is then mounted into the collector pod. This configuration specifies an OTLP receiver (for gRPC and HTTP) and a Jaeger exporter.

The `jaeger` exporter points to the `simple-jaeger-collector` service created by the Jaeger Operator. The endpoint is `simple-jaeger-collector.default.svc.cluster.local:14250`, using the standard Kubernetes DNS for service discovery. We set `insecure: true` for simplicity in this tutorial; in production, you would configure proper TLS. The `service.pipelines.traces` section connects the `otlp` receiver to the `jaeger` exporter, ensuring all incoming OTLP traces are forwarded to Jaeger.

apiVersion: v1
kind: ConfigMap
metadata:
  name: otel-collector-config
data:
  otelcol-config.yaml: |
    receivers:
      otlp:
        protocols:
          grpc:
          http:
    exporters:
      jaeger:
        endpoint: simple-jaeger-collector.default.svc.cluster.local:14250
        tls:
          insecure: true
    service:
      pipelines:
        traces:
          receivers: [otlp]
          exporters: [jaeger]
kubectl apply -f otel-collector-config.yaml
# Or directly:
kubectl apply -f - <

Verify:

Check if the ConfigMap was created. If you already applied the collector deployment, you might need to restart the collector pod for the new ConfigMap to take effect.

kubectl get configmap otel-collector-config

Expected Output:

NAME                  DATA   AGE
otel-collector-config   1      1m

To restart the collector, delete its pod, and the Deployment will recreate it:

kubectl delete pod -l app=otel-collector

Step 5: Expose Jaeger UI

To access the Jaeger UI from outside the cluster, we need to expose its `simple-jaeger-query` service. For development and testing, a `NodePort` service is a quick way to do this. In a production environment, you would typically use an Ingress Controller (like NGINX Ingress or Traefik) or the Kubernetes Gateway API for more robust and secure access.

The `NodePort` service will expose Jaeger's UI (port `16686`) on a high-numbered port on each node in your cluster. You can then access it via `http://:`. Remember the `query.base-path: /jaeger` option we set earlier; you'll need to append `/jaeger` to the URL if you configured it.

apiVersion: v1
kind: Service
metadata:
  name: simple-jaeger-query-nodeport
spec:
  type: NodePort
  selector:
    app: jaeger
    jaeger-query: simple-jaeger
  ports:
    - protocol: TCP
      port: 16686
      targetPort: 16686
      nodePort: 30080 # Choose an available NodePort, e.g., 30000-32767
kubectl apply -f jaeger-ui-nodeport.yaml
# Or directly:
kubectl apply -f - <

Verify:

Get the NodePort and a node's IP to access the UI.

kubectl get svc simple-jaeger-query-nodeport

Expected Output:

NAME                           TYPE       CLUSTER-IP      EXTERNAL-IP   PORT(S)           AGE
simple-jaeger-query-nodeport   NodePort   10.96.150.150           16686:30080/TCP   1m

Find a node's IP address:

kubectl get nodes -o wide

Then, navigate to `http://:30080/jaeger` in your browser.

Step 6: Instrument Your Application with OpenTelemetry

This is the core step where you modify your application to generate traces. OpenTelemetry provides SDKs for various languages. We'll demonstrate with a simple Python Flask application. The key is to configure the SDK to send traces to the OpenTelemetry Collector we deployed.

The application needs to:

  1. Install OpenTelemetry SDKs/instrumentation libraries.
  2. Initialize a TracerProvider.
  3. Configure an OTLP Span Exporter pointing to the OpenTelemetry Collector.
  4. Apply automatic or manual instrumentation to generate spans.

Here's a sample Python Flask application and its Kubernetes Deployment:

Python Application (app.py):

from flask import Flask, request
from opentelemetry import trace
from opentelemetry.sdk.resources import Resource
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import SimpleSpanProcessor
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
from opentelemetry.instrumentation.flask import FlaskInstrumentor
from opentelemetry.instrumentation.requests import RequestsInstrumentor
import requests
import os

# 1. Configure OpenTelemetry
resource = Resource.create({
    "service.name": os.environ.get("OTEL_SERVICE_NAME", "my-python-app"),
    "environment": "development"
})

trace.set_tracer_provider(
    TracerProvider(
        resource=resource
    )
)

# Configure the OTLP exporter to send traces to the OpenTelemetry Collector
otlp_exporter = OTLPSpanExporter(
    endpoint=os.environ.get("OTEL_EXPORTER_OTLP_ENDPOINT", "http://otel-collector.default.svc.cluster.local:4317"),
    insecure=True # Use insecure for simplicity in this tutorial
)

# Add the exporter to the TracerProvider
trace.get_tracer_provider().add_span_processor(
    SimpleSpanProcessor(otlp_exporter)
)

app = Flask(__name__)

# 2. Instrument Flask and Requests libraries
FlaskInstrumentor().instrument_app(app)
RequestsInstrumentor().instrument()

tracer = trace.get_tracer(__name__)

@app.route("/")
def hello_world():
    with tracer.start_as_current_span("hello-endpoint"):
        # Make an internal HTTP request to demonstrate chained traces
        try:
            response = requests.get("http://localhost:5000/internal")
            internal_status = response.status_code
        except requests.exceptions.ConnectionError:
            internal_status = "error"

        return f"Hello, World! Internal call status: {internal_status}"

@app.route("/internal")
def internal_endpoint():
    with tracer.start_as_current_span("internal-endpoint"):
        # Simulate some work
        import time
        time.sleep(0.05)
        return "This is an internal endpoint."

if __name__ == "__main__":
    app.run(host="0.0.0.0", port=5000)

requirements.txt:

Flask==2.0.3
opentelemetry-api==1.24.0
opentelemetry-sdk==1.24.0
opentelemetry-exporter-otlp==1.24.0
opentelemetry-instrumentation-flask==0.45b0
opentelemetry-instrumentation-requests==0.45b0
requests==2.28.1

Dockerfile:

FROM python:3.9-slim-buster

WORKDIR /app

COPY requirements.txt .
RUN pip install -r requirements.txt

COPY app.py .

ENV OTEL_SERVICE_NAME=my-python-app
ENV OTEL_EXPORTER_OTLP_ENDPOINT=http://otel-collector.default.svc.cluster.local:4317

EXPOSE 5000

CMD ["python", "app.py"]

Build and push the Docker image (replace `your-docker-repo` with your actual repo):

docker build -t your-docker-repo/my-python-app:latest .
docker push your-docker-repo/my-python-app:latest

Kubernetes Deployment (my-app-deployment.yaml):

Note the environment variables `OTEL_SERVICE_NAME` and `OTEL_EXPORTER_OTLP_ENDPOINT` pointing to our collector.

apiVersion: apps/v1
kind: Deployment
metadata:
  name: my-python-app
  labels:
    app: my-python-app
spec:
  replicas: 1
  selector:
    matchLabels:
      app: my-python-app
  template:
    metadata:
      labels:
        app: my-python-app
    spec:
      containers:
      - name: my-python-app
        image: your-docker-repo/my-python-app:latest # REPLACE WITH YOUR IMAGE
        ports:
        - containerPort: 5000
        env:
        - name: OTEL_SERVICE_NAME
          value: my-python-app
        - name: OTEL_EXPORTER_OTLP_ENDPOINT
          value: http://otel-collector.default.svc.cluster.local:4317 # Ensure this matches your collector service
---
apiVersion: v1
kind: Service
metadata:
  name: my-python-app-service
spec:
  selector:
    app: my-python-app
  ports:
    - protocol: TCP
      port: 80
      targetPort: 5000
  type: ClusterIP
kubectl apply -f my-app-deployment.yaml

Verify:

Access your application and check Jaeger UI.

kubectl get pods -l app=my-python-app

Expected Output:

NAME                              READY   STATUS    RESTARTS   AGE
my-python-app-76c944f79c-pqrst    1/1     Running   0          1m

Now, send some requests to your application. If you have a NodePort or Ingress for it, use that. Otherwise, you can port-forward:

kubectl port-forward svc/my-python-app-service 8080:80

Then, in another terminal, hit the endpoint:

curl http://localhost:8080

Repeat this a few times. Now, open your Jaeger UI (`http://:30080/jaeger` or your Ingress URL). Select `my-python-app` from the "Service" dropdown and click "Find Traces". You should see traces representing your requests, with spans for "hello-endpoint" and "internal-endpoint".

This demonstrates end-to-end tracing, allowing you to visualize the flow of requests and identify performance characteristics of different parts of your application. For more advanced network security and isolation, consider implementing Kubernetes Network Policies around your tracing components and applications.

Production Considerations

Deploying tracing in production requires more robustness and scalability than our `allInOne` development setup.

  • Jaeger Deployment Strategy: Switch from `allInOne` to a `production` strategy for Jaeger. This deploys separate components:
    • Jaeger Collector: Receives traces.
    • Jaeger Query: Serves the UI and API.
    • Jaeger Agent: Can run as a DaemonSet on each node to collect traces from local applications and forward them to the collector. This reduces network overhead and provides a local buffer.
    • Storage Backend: For production, you must use a persistent storage backend like Elasticsearch, Cassandra, or an external database. The `allInOne` uses in-memory storage, which is not durable. Configure this in your `Jaeger` CR. For example, using Elasticsearch:
      spec:
        strategy: production
        storage:
          type: elasticsearch
          options:
            es:
              server-urls: http://elasticsearch-master:9200
      
  • OpenTelemetry Collector Deployment:
    • Deployment Strategy: For high-traffic services, consider deploying the collector as a `DaemonSet` on each node or as a sidecar alongside your application pods.
      • DaemonSet: Each node runs a collector. Applications on that node send traces to the local collector. This is efficient for host-level telemetry.
      • Sidecar: Each application pod has its own collector sidecar. Applications send traces to `localhost`. This provides strong isolation and resource management per application.
    • Resource Limits: Set appropriate CPU and memory limits for your collector pods to prevent resource exhaustion.
    • Horizontal Scaling: If using a Deployment for the collector, you might need to scale it horizontally with `HPA` based on CPU or memory usage, especially if it's a central bottleneck.
  • Network Security:
    • TLS: Enable TLS for OTLP endpoints on the OpenTelemetry Collector and for communication between the collector and Jaeger.
    • Network Policies: Implement Kubernetes Network Policies to restrict traffic to and from your tracing components, ensuring only authorized services can send or receive tracing data. For example, allow only application pods to send to the collector, and only the collector to send to Jaeger.
    • Ingress/Gateway: Expose the Jaeger UI securely using an Ingress controller or the Kubernetes Gateway API, with proper authentication and authorization.
  • Sampling: In high-volume environments, collecting 100% of traces can be costly and generate too much data. Implement sampling strategies in the OpenTelemetry Collector to reduce the volume of traces while still capturing representative samples. This can be configured in the collector's `processors` section.
  • Resource Management: Monitor the resource consumption of Jaeger and OpenTelemetry Collector components. Adjust `requests` and `limits` in your Kubernetes manifests. For cost optimization, especially with underlying nodes, tools like Karpenter can help manage node resources dynamically.
  • Observability of Observability: Monitor your Jaeger and OpenTelemetry Collector instances themselves. Are they healthy? Are they dropping traces? Integrate their metrics into your existing monitoring solution. You can leverage tools like eBPF Observability with Hubble to gain deeper insights into network interactions if you're using Cilium.
  • Authentication/Authorization: Secure access to the Jaeger UI and API with appropriate authentication and authorization mechanisms, especially if exposed externally.
  • Sidecar Injection with Service Mesh: If you're already using a service mesh like Istio (e.g., Istio Ambient Mesh), it can often handle automatic instrumentation and trace context propagation, reducing the need for manual application-level changes.

Troubleshooting

Here are some common issues you might encounter and their solutions:

  1. No Traces Appearing in Jaeger UI

    Issue: You've instrumented your app and sent requests, but Jaeger UI shows "No traces found."

    Solution:

    1. Check Application Logs: Look for errors in your application's logs related to OpenTelemetry or trace export.
    2. Verify Collector Connectivity: Ensure your application's `OTEL_EXPORTER_OTLP_ENDPOINT` environment variable correctly points to the OpenTelemetry Collector service (e.g., `http://otel-collector.default.svc.cluster.local:4317`).
    3. Check Collector Logs: Inspect the logs of the `otel-collector` pod. Look for

Leave a Reply

Your email address will not be published. Required fields are marked *