Orchestration

Instrument Kubernetes Apps Effortlessly with OpenTelemetry

OpenTelemetry Auto-Instrumentation for Kubernetes

In the dynamic world of Kubernetes, understanding the behavior of your microservices is paramount. Traditional logging often falls short, providing only a fragmented view of complex distributed systems. This is where Observability, particularly distributed tracing, steps in. OpenTelemetry has emerged as the de-facto standard for instrumenting applications, offering a vendor-neutral way to collect traces, metrics, and logs.

Manually instrumenting every application can be a daunting and error-prone task, especially in a fast-paced development environment. Fortunately, OpenTelemetry provides powerful auto-instrumentation capabilities, allowing you to gain deep insights into your Kubernetes workloads with minimal code changes. This guide will walk you through setting up OpenTelemetry auto-instrumentation in your Kubernetes cluster, enabling you to automatically collect rich telemetry data from your applications without modifying their source code.

TL;DR: OpenTelemetry Auto-Instrumentation

Automatically instrument your Kubernetes applications for OpenTelemetry by deploying the OpenTelemetry Operator and configuring auto-instrumentation injection. This guide covers deploying the collector, enabling injection via annotations, and verifying trace collection.


# Install the OpenTelemetry Operator
kubectl apply -f https://github.com/open-telemetry/opentelemetry-operator/releases/latest/download/opentelemetry-operator.yaml

# Create an OpenTelemetry Collector instance
kubectl apply -f - <

Prerequisites

Before diving into auto-instrumentation, ensure you have the following:

  • A running Kubernetes cluster (v1.20+ recommended). You can use Minikube, Kind, or a cloud-managed service like GKE, EKS, or AKS.
  • kubectl configured to communicate with your cluster.
  • Basic understanding of Kubernetes Deployments, Services, and Custom Resource Definitions (CRDs).
  • Familiarity with OpenTelemetry concepts (traces, spans, collectors, exporters).

Step-by-Step Guide

1. Install the OpenTelemetry Operator

The OpenTelemetry Operator simplifies the deployment and management of OpenTelemetry components within your Kubernetes cluster. It provides CRDs for managing OpenTelemetry Collectors and auto-instrumentation configurations. This operator acts as an admission controller, injecting the necessary sidecars or init containers for auto-instrumentation based on pod annotations.

First, install the operator from its official GitHub repository. This will create the necessary CRDs and deploy the operator itself into the opentelemetry-operator-system namespace. This is a crucial first step for enabling automatic injection of instrumentation agents into your application pods.


kubectl apply -f https://github.com/open-telemetry/opentelemetry-operator/releases/latest/download/opentelemetry-operator.yaml

Verify: Check if the operator deployment is running and healthy.


kubectl -n opentelemetry-operator-system get pods

NAME                                             READY   STATUS    RESTARTS   AGE
opentelemetry-operator-controller-manager-f7...  1/1     Running   0          2m

2. Deploy the OpenTelemetry Collector

The OpenTelemetry Collector is a powerful, vendor-agnostic proxy that receives, processes, and exports telemetry data. It's a central component in your observability pipeline. For auto-instrumentation, applications will typically send their traces directly to the collector. Here, we'll deploy a basic collector configuration that receives OTLP (OpenTelemetry Protocol) data and exports it to the console using the logging exporter. In a production environment, you would configure it to export to your chosen backend (e.g., Jaeger, Prometheus, Splunk, Datadog).

We'll create an OpenTelemetryCollector custom resource, which the operator will then translate into a Kubernetes Deployment and Service. This setup ensures that your applications have a stable endpoint to send their telemetry data.


apiVersion: opentelemetry.io/v1alpha1
kind: OpenTelemetryCollector
metadata:
  name: otel-collector
spec:
  mode: deployment
  image: otel/opentelemetry-collector-contrib:0.99.0 # Use contrib for more exporters/processors
  config: |
    receivers:
      otlp:
        protocols:
          grpc:
          http:
    processors:
      batch: # Batch processor for efficiency
        send_batch_size: 100
        timeout: 1s
    exporters:
      logging: # Log traces to console for verification
        loglevel: debug
      # Example of a Jaeger exporter (uncomment for a real setup)
      # jaeger:
      #   endpoint: "jaeger-collector.jaeger.svc.cluster.local:14250"
      #   tls:
      #     insecure: true
    service:
      pipelines:
        traces:
          receivers: [otlp]
          processors: [batch]
          exporters: [logging] # Change to [jaeger] or other exporters as needed
        metrics:
          receivers: [otlp]
          processors: [batch]
          exporters: [logging] # Add metrics exporters here
        logs:
          receivers: [otlp]
          processors: [batch]
          exporters: [logging] # Add logs exporters here

kubectl apply -f otel-collector.yaml

Verify: Ensure the collector deployment and service are up and running.


kubectl get deployments,services otel-collector

NAME                               READY   UP-TO-DATE   AVAILABLE   AGE
deployment.apps/otel-collector     1/1     1            1           1m

NAME                   TYPE        CLUSTER-IP       EXTERNAL-IP   PORT(S)              AGE
service/otel-collector ClusterIP   10.96.123.45     <none>        4317/TCP,4318/TCP   1m

The collector is now listening on ports 4317 (gRPC) and 4318 (HTTP) for OTLP data.

3. Configure OpenTelemetry Auto-Instrumentation

The OpenTelemetry Operator uses an Instrumentation custom resource to define how auto-instrumentation should be applied to pods. This CR specifies which languages to instrument and how the agents should be configured. The operator then injects the necessary environment variables and agent JARs/libraries into your application pods based on these configurations and specific pod annotations.

We'll create an Instrumentation resource for Java, but similar configurations exist for other languages like Python, Node.js, and .NET. This resource tells the operator how to set up the OpenTelemetry agent for Java applications.


apiVersion: opentelemetry.io/v1alpha1
kind: Instrumentation
metadata:
  name: my-instrumentation
spec:
  exporter:
    endpoint: http://otel-collector.default.svc.cluster.local:4317 # Collector endpoint
  propagators:
    - tracecontext
    - baggage
    - b3
  java:
    image: ghcr.io/open-telemetry/opentelemetry-operator/autoinstrumentation-java:1.26.0 # Java agent image
  # python:
  #   image: ghcr.io/open-telemetry/opentelemetry-operator/autoinstrumentation-python:1.26.0
  # nodejs:
  #   image: ghcr.io/open-telemetry/opentelemetry-operator/autoinstrumentation-nodejs:1.26.0
  # dotnet:
  #   image: ghcr.io/open-telemetry/opentelemetry-operator/autoinstrumentation-dotnet:1.26.0

kubectl apply -f instrumentation.yaml

Verify: Check the status of the Instrumentation resource.


kubectl get instrumentation my-instrumentation

NAME               AGE
my-instrumentation 30s

4. Deploy a Sample Application with Auto-Instrumentation

Now, let's deploy a sample Java application and enable auto-instrumentation using Kubernetes annotations. The key annotation is instrumentation.opentelemetry.io/inject-java: "true". When the OpenTelemetry Operator sees this annotation on a new pod, it intercepts the pod creation request and injects the OpenTelemetry Java agent into the pod's container definition. This typically involves adding an initContainer to download the agent and setting environment variables like JAVA_TOOL_OPTIONS to load the agent.

We'll also set OTEL_SERVICE_NAME and OTEL_EXPORTER_OTLP_ENDPOINT directly in the Deployment's environment variables. While the Instrumentation CR can provide a default endpoint, explicitly setting it here ensures clarity. The OTEL_RESOURCE_ATTRIBUTES are also useful for adding service-level metadata to all collected telemetry.


apiVersion: apps/v1
kind: Deployment
metadata:
  name: otel-java-app
  labels:
    app: otel-java-app
spec:
  replicas: 1
  selector:
    matchLabels:
      app: otel-java-app
  template:
    metadata:
      labels:
        app: otel-java-app
      annotations:
        instrumentation.opentelemetry.io/inject-java: "true" # !! Critical annotation !!
    spec:
      containers:
      - name: java-app
        image: ghcr.io/open-telemetry/opentelemetry-operator/autoinstrumentation-java:1.26.0 # A simple Java app that makes HTTP calls
        ports:
        - containerPort: 8080
        env:
        - name: OTEL_SERVICE_NAME
          value: my-java-service # Name of the service for traces
        - name: OTEL_EXPORTER_OTLP_ENDPOINT
          value: http://otel-collector.default.svc.cluster.local:4317 # OTLP endpoint for the collector
        - name: OTEL_RESOURCE_ATTRIBUTES
          value: service.namespace=default,environment=production # Additional resource attributes
---
apiVersion: v1
kind: Service
metadata:
  name: otel-java-app
spec:
  selector:
    app: otel-java-app
  ports:
  - protocol: TCP
    port: 80
    targetPort: 8080

kubectl apply -f sample-app.yaml

Verify: Check the pod's description to see the injected environment variables and volumes related to OpenTelemetry. You should see an initContainer named opentelemetry-auto-instrumentation and environment variables like JAVA_TOOL_OPTIONS.


kubectl describe pod -l app=otel-java-app

...
Init Containers:
  opentelemetry-auto-instrumentation:
    Container ID:   containerd://...
    Image:          ghcr.io/open-telemetry/opentelemetry-operator/autoinstrumentation-java:1.26.0
    Image ID:       ghcr.io/open-telemetry/opentelemetry-operator/autoinstrumentation-java@sha256:...
    Port:           <none>
    Host Port:      <none>
    Command:
      cp
      /autoinstrumentation/opentelemetry-javaagent.jar
      /otel-auto-instrumentation-java/opentelemetry-javaagent.jar
    State:          Terminated
      Reason:       Completed
...
Containers:
  java-app:
    Container ID:  containerd://...
    Image:         ghcr.io/open-telemetry/opentelemetry-operator/autoinstrumentation-java:1.26.0
    Image ID:      ghcr.io/open-telemetry/opentelemetry-operator/autoinstrumentation-java@sha256:...
    Port:          8080/TCP
    Host Port:     <none>
    Environment:
      OTEL_SERVICE_NAME:                   my-java-service
      OTEL_EXPORTER_OTLP_ENDPOINT:         http://otel-collector.default.svc.cluster.local:4317
      OTEL_RESOURCE_ATTRIBUTES:            service.namespace=default,environment=production
      JAVA_TOOL_OPTIONS:                   -javaagent:/otel-auto-instrumentation-java/opentelemetry-javaagent.jar
      OTEL_JAVAAGENT_AUTO_CONF_ENABLED:    "true"
      OTEL_JAVAAGENT_ARGS:                 
      OTEL_JAVAAGENT_DEBUG:                "false"
...

5. Generate Traffic and Verify Traces

Once the application is running, generate some traffic to it. The sample Java application we deployed has a simple HTTP endpoint. Making a request to this endpoint will trigger the auto-instrumented code, generating traces that are then sent to the OpenTelemetry Collector.

The collector, configured with the logging exporter, will print these traces to its standard output. This is an excellent way to confirm that auto-instrumentation is working correctly before integrating with a full-fledged observability backend.


# Forward a local port to the service
kubectl port-forward svc/otel-java-app 8080:80 &

# Make a request to the application
curl localhost:8080/hello

Hello from Java App!

Verify: Check the logs of the otel-collector deployment. You should see detailed trace information, including spans generated by your Java application.


kubectl logs -f deployment/otel-collector

...
2023-10-27T10:30:15.123Z        INFO    Traces  {"resource spans": [{"resource": {"attributes": [{"key": "service.name", "value": {"stringValue": "my-java-service"}}, {"key": "service.namespace", "value": {"stringValue": "default"}}, {"key": "environment", "value": {"stringValue": "production"}}, {"key": "host.arch", "value": {"stringValue": "amd64"}}, {"key": "os.type", "value": {"stringValue": "linux"}}, {"key": "os.description", "value": {"stringValue": "Linux 5.15.0-86-generic"}}, {"key": "process.pid", "value": {"intValue": 1}}, {"key": "process.runtime.name", "value": {"stringValue": "OpenJDK Runtime Environment"}}, {"key": "process.runtime.version", "value": {"stringValue": "17.0.8.1+1"}}, {"key": "process.command_args", "value": {"arrayValue": {"values": [{"stringValue": "java"}, {"stringValue": "-javaagent:/otel-auto-instrumentation-java/opentelemetry-javaagent.jar"}, {"stringValue": "-jar"}, {"stringValue": "/app/app.jar"}]}}}, {"key": "telemetry.sdk.name", "value": {"stringValue": "opentelemetry"}}, {"key": "telemetry.sdk.language", "value": {"stringValue": "java"}}, {"key": "telemetry.sdk.version", "value": {"stringValue": "1.26.0"}}, {"key": "k8s.pod.name", "value": {"stringValue": "otel-java-app-..."}}, {"key": "k8s.deployment.name", "value": {"stringValue": "otel-java-app"}}]}, "scope_spans": [{"scope": {"name": "io.openteentelemetry.tomcat-10.0"}, "spans": [{"trace_id": "...", "span_id": "...", "parent_span_id": "...", "name": "HTTP GET /hello", "kind": "SPAN_KIND_SERVER", "start_time_unix_nano": "...", "end_time_unix_nano": "...", "attributes": [{"key": "http.method", "value": {"stringValue": "GET"}}, {"key": "http.target", "value": {"stringValue": "/hello"}}, {"key": "http.status_code", "value": {"intValue": 200}}, {"key": "net.host.port", "value": {"intValue": 8080}}], "status": {"code": "STATUS_CODE_OK"}}]}]}]}
...

The logs confirm that the Java application is successfully sending traces to the collector, and the traces contain relevant HTTP request information and resource attributes. This demonstrates the power of OpenTelemetry auto-instrumentation!

Production Considerations

While auto-instrumentation significantly reduces the effort to get started, deploying it in production requires careful planning:

  1. Collector Sizing and High Availability: For production, a single collector instance is a single point of failure. Deploy multiple collector instances, potentially as a StatefulSet or DaemonSet for node-local collection, and use Kubernetes Services for load balancing. Refer to the OpenTelemetry Collector sizing guidelines for resource allocation.
  2. Exporters: The logging exporter is for debugging. In production, configure the collector to export to your chosen observability backend (e.g., Jaeger, Prometheus, Loki, Datadog, New Relic, Splunk). Each backend has its specific exporter configuration.
  3. Resource Attributes and Naming Conventions: Standardize OTEL_SERVICE_NAME and OTEL_RESOURCE_ATTRIBUTES across your applications. This ensures consistent tagging for easier filtering and analysis in your observability backend. Consider using a mutating webhook or policy engine like Kyverno to enforce these standards.
  4. Security:
    • Network Policies: Restrict communication to your OpenTelemetry Collector using Kubernetes Network Policies. Only allow application pods to send OTLP data to the collector.
    • TLS: Secure communication between applications and the collector, and between the collector and your backend, using TLS. The collector supports TLS configuration for both receivers and exporters.
    • Authentication: If your observability backend requires authentication, configure the appropriate authentication extensions in your collector.
  5. Performance Overhead: While auto-instrumentation is designed to be lightweight, it does introduce some overhead. Monitor your application's CPU and memory usage after instrumentation. Test thoroughly in non-production environments.
  6. Sampling: For high-volume applications, sending every trace can be overwhelming and costly. Implement sampling strategies in your OpenTelemetry Collector to reduce the amount of data exported without losing critical insights.
  7. Agent Updates: Keep your OpenTelemetry Operator and agent images updated to benefit from bug fixes, performance improvements, and new features.
  8. Custom Instrumentation: Auto-instrumentation provides a good baseline, but for business-critical logic or capturing specific application-level details, you might still need to add manual instrumentation to your code.
  9. Integration with other tools: Consider how OpenTelemetry data can enrich other observability tools. For instance, combining traces with metrics collected by Prometheus or logs centralized by Loki.
  10. Advanced Collector Configuration: The OpenTelemetry Collector is highly configurable. Explore processors like attributes, resourcedetection, memory_limiter, and k8sattributes to enrich, filter, and manage your telemetry data effectively. The k8sattributes processor is particularly useful for adding Kubernetes metadata (pod name, namespace, etc.) to your traces. For advanced networking, you might integrate with solutions like Cilium WireGuard Encryption for secure data transport.

Troubleshooting

  1. Issue: OpenTelemetry Operator Pod stuck in Pending state.

    Reason: Likely a resource constraint or incorrect RBAC permissions for the operator itself.

    Solution: Check the events for the operator pod and its deployment. Ensure your cluster has enough resources and that the operator's service account has the necessary permissions. Sometimes, a fresh install can resolve transient issues.

    
    kubectl -n opentelemetry-operator-system describe pod opentelemetry-operator-controller-manager-...
    kubectl -n opentelemetry-operator-system get role,rolebinding,clusterrole,clusterrolebinding
            
  2. Issue: Application pod not showing injected initContainer or environment variables.

    Reason: The admission webhook might not be working, the Instrumentation resource is misconfigured, or the pod annotation is incorrect.

    Solution:

    1. Verify the opentelemetry-operator-admission-webhook service and deployment are running in the opentelemetry-operator-system namespace.
    2. Ensure the ValidatingWebhookConfiguration and MutatingWebhookConfiguration for the operator are correctly set up and targeting the right namespaces.
    3. Double-check the pod annotation, e.g., instrumentation.opentelemetry.io/inject-java: "true".
    4. Ensure the Instrumentation resource exists and specifies the correct language agent image.
    5. Restart the application pod (delete and recreate) after making changes to the Instrumentation resource or annotations, as webhooks only act on pod creation.
  3. Issue: OpenTelemetry Collector not receiving traces.

    Reason: Network connectivity issues, incorrect endpoint configuration, or the application isn't generating traces.

    Solution:

    1. Check collector logs for any errors (kubectl logs -f deployment/otel-collector).
    2. Verify the OTEL_EXPORTER_OTLP_ENDPOINT environment variable in your application pod matches the collector service's cluster IP and port (e.g., http://otel-collector.default.svc.cluster.local:4317).
    3. Ensure there are no Kubernetes Network Policies blocking traffic between the application and the collector.
    4. Confirm the application is actually making calls that would generate traces (e.g., HTTP requests, database queries).
  4. Issue: Traces are generated, but missing important details or context.

    Reason: Default auto-instrumentation might not cover all libraries, or context propagation is broken.

    Solution:

    1. Consult the OpenTelemetry documentation for your specific language agent to see which libraries are automatically instrumented.
    2. Ensure proper context propagation (e.g., tracecontext, baggage) is configured in your Instrumentation resource and is supported by all services in your trace path.
    3. Consider adding manual instrumentation for critical business logic or unsupported libraries.
    4. Verify OTEL_RESOURCE_ATTRIBUTES are set correctly to add useful metadata to your services.
  5. Issue: High resource consumption by application pods after instrumentation.

    Reason: The auto-instrumentation agent or the application itself might be generating too much telemetry data.

    Solution:

    1. Monitor CPU and memory usage of the instrumented pods.
    2. Implement sampling in the OpenTelemetry Collector to reduce the volume of traces exported.
    3. Review the application's code for excessive span creation or highly verbose logging that might trigger more traces.
    4. Ensure the OpenTelemetry agent version is optimized for performance.
  6. Issue: Collector logs show "unknown service" or similar errors when exporting to a backend.

    Reason: Incorrect exporter configuration in the OpenTelemetry Collector.

    Solution:

    1. Carefully review the documentation for your chosen observability backend and its OpenTelemetry exporter.
    2. Double-check the endpoint, authentication, and any TLS settings in the collector configuration.
    3. Ensure the backend service is reachable from the collector pod (e.g., correct hostname, port, and no network policies blocking traffic).

FAQ Section

  1. What is OpenTelemetry auto-instrumentation?

    OpenTelemetry auto-instrumentation refers to the process of automatically collecting telemetry data (traces, metrics, logs) from an application without modifying its source code. This is typically achieved by using language-specific agents (e.g., Java agent, Python agent) that hook into common libraries and frameworks at runtime to capture relevant events.

  2. Which programming languages are supported for auto-instrumentation by the OpenTelemetry Operator?

    The OpenTelemetry Operator commonly supports auto-instrumentation for popular languages like Java, Python, Node.js, and .NET. Support for other languages may vary or be under active development. Always check the official OpenTelemetry Operator GitHub repository for the latest supported languages and versions.

  3. How does the OpenTelemetry Operator inject the instrumentation agent?

    The OpenTelemetry Operator uses a Mutating Admission Webhook. When a new pod is created with specific annotations (e.g., instrumentation.opentelemetry.io/inject-java: "true"), the webhook intercepts the pod definition and modifies it. This modification typically involves injecting an initContainer to download the agent and setting environment variables (like JAVA_TOOL_OPTIONS for Java) to load the agent at application startup.

  4. Can I combine auto-instrumentation with manual instrumentation?

    Yes, absolutely! Auto-instrumentation provides a great baseline for common operations (HTTP requests, database calls). For deeper insights into specific business logic, custom functions, or to add more context to existing spans, you can add manual instrumentation to your application code. The data from both will be correlated into the same traces.

  5. What are the alternatives to the OpenTelemetry Operator for auto-instrumentation in Kubernetes?

    While the OpenTelemetry Operator is the standard for managing OpenTelemetry components in Kubernetes, alternative approaches for auto-instrumentation include:

    • Manual Injection: Modifying your Dockerfiles or Kubernetes manifests directly to include the agent and environment variables.
    • Service Mesh Integration: Some service meshes like Istio Ambient Mesh or Linkerd can generate telemetry data, though it might not always be OpenTelemetry native or as detailed as language-specific agents.
    • Cloud Provider Specific Solutions: Cloud providers sometimes offer their own agents or integrations for their observability platforms, which may or may not be OpenTelemetry-compatible.

Cleanup Commands

To remove all resources created during this guide:


# Delete sample application
kubectl delete -f sample-app.yaml

# Delete instrumentation resource
kubectl delete -f instrumentation.yaml

# Delete OpenTelemetry Collector
kubectl delete -f otel-collector.yaml

# Uninstall the OpenTelemetry Operator
kubectl delete -f https://github.com/open-telemetry/opentelemetry-operator/releases/latest/download/opentelemetry-operator.yaml

# (Optional) Clean up the opentelemetry-operator-system namespace if it's empty
kubectl delete namespace opentelemetry-operator-system

Next Steps / Further Reading

Conclusion

OpenTelemetry auto-instrumentation in Kubernetes significantly streamlines the process of gaining deep observability into your microservices. By leveraging the OpenTelemetry Operator, you can effortlessly inject language-specific agents into your applications, collecting rich trace data without altering your source code. This empowers development and operations teams with critical insights into application behavior, performance bottlenecks, and distributed system interactions, ultimately leading to faster debugging and more reliable services. Embrace OpenTelemetry to build a truly observable and resilient Kubernetes environment.

Leave a Reply

Your email address will not be published. Required fields are marked *