OpenTelemetry Auto-Instrumentation for Kubernetes
In the dynamic world of Kubernetes, understanding the behavior of your microservices is paramount. Traditional logging often falls short, providing only a fragmented view of complex distributed systems. This is where Observability, particularly distributed tracing, steps in. OpenTelemetry has emerged as the de-facto standard for instrumenting applications, offering a vendor-neutral way to collect traces, metrics, and logs.
Manually instrumenting every application can be a daunting and error-prone task, especially in a fast-paced development environment. Fortunately, OpenTelemetry provides powerful auto-instrumentation capabilities, allowing you to gain deep insights into your Kubernetes workloads with minimal code changes. This guide will walk you through setting up OpenTelemetry auto-instrumentation in your Kubernetes cluster, enabling you to automatically collect rich telemetry data from your applications without modifying their source code.
TL;DR: OpenTelemetry Auto-Instrumentation
Automatically instrument your Kubernetes applications for OpenTelemetry by deploying the OpenTelemetry Operator and configuring auto-instrumentation injection. This guide covers deploying the collector, enabling injection via annotations, and verifying trace collection.
# Install the OpenTelemetry Operator
kubectl apply -f https://github.com/open-telemetry/opentelemetry-operator/releases/latest/download/opentelemetry-operator.yaml
# Create an OpenTelemetry Collector instance
kubectl apply -f - <
Prerequisites
Before diving into auto-instrumentation, ensure you have the following:
- A running Kubernetes cluster (v1.20+ recommended). You can use Minikube, Kind, or a cloud-managed service like GKE, EKS, or AKS.
kubectlconfigured to communicate with your cluster.- Basic understanding of Kubernetes Deployments, Services, and Custom Resource Definitions (CRDs).
- Familiarity with OpenTelemetry concepts (traces, spans, collectors, exporters).
Step-by-Step Guide
1. Install the OpenTelemetry Operator
The OpenTelemetry Operator simplifies the deployment and management of OpenTelemetry components within your Kubernetes cluster. It provides CRDs for managing OpenTelemetry Collectors and auto-instrumentation configurations. This operator acts as an admission controller, injecting the necessary sidecars or init containers for auto-instrumentation based on pod annotations.
First, install the operator from its official GitHub repository. This will create the necessary CRDs and deploy the operator itself into the opentelemetry-operator-system namespace. This is a crucial first step for enabling automatic injection of instrumentation agents into your application pods.
kubectl apply -f https://github.com/open-telemetry/opentelemetry-operator/releases/latest/download/opentelemetry-operator.yaml
Verify: Check if the operator deployment is running and healthy.
kubectl -n opentelemetry-operator-system get pods
NAME READY STATUS RESTARTS AGE
opentelemetry-operator-controller-manager-f7... 1/1 Running 0 2m
2. Deploy the OpenTelemetry Collector
The OpenTelemetry Collector is a powerful, vendor-agnostic proxy that receives, processes, and exports telemetry data. It's a central component in your observability pipeline. For auto-instrumentation, applications will typically send their traces directly to the collector. Here, we'll deploy a basic collector configuration that receives OTLP (OpenTelemetry Protocol) data and exports it to the console using the logging exporter. In a production environment, you would configure it to export to your chosen backend (e.g., Jaeger, Prometheus, Splunk, Datadog).
We'll create an OpenTelemetryCollector custom resource, which the operator will then translate into a Kubernetes Deployment and Service. This setup ensures that your applications have a stable endpoint to send their telemetry data.
apiVersion: opentelemetry.io/v1alpha1
kind: OpenTelemetryCollector
metadata:
name: otel-collector
spec:
mode: deployment
image: otel/opentelemetry-collector-contrib:0.99.0 # Use contrib for more exporters/processors
config: |
receivers:
otlp:
protocols:
grpc:
http:
processors:
batch: # Batch processor for efficiency
send_batch_size: 100
timeout: 1s
exporters:
logging: # Log traces to console for verification
loglevel: debug
# Example of a Jaeger exporter (uncomment for a real setup)
# jaeger:
# endpoint: "jaeger-collector.jaeger.svc.cluster.local:14250"
# tls:
# insecure: true
service:
pipelines:
traces:
receivers: [otlp]
processors: [batch]
exporters: [logging] # Change to [jaeger] or other exporters as needed
metrics:
receivers: [otlp]
processors: [batch]
exporters: [logging] # Add metrics exporters here
logs:
receivers: [otlp]
processors: [batch]
exporters: [logging] # Add logs exporters here
kubectl apply -f otel-collector.yaml
Verify: Ensure the collector deployment and service are up and running.
kubectl get deployments,services otel-collector
NAME READY UP-TO-DATE AVAILABLE AGE
deployment.apps/otel-collector 1/1 1 1 1m
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
service/otel-collector ClusterIP 10.96.123.45 <none> 4317/TCP,4318/TCP 1m
The collector is now listening on ports 4317 (gRPC) and 4318 (HTTP) for OTLP data.
3. Configure OpenTelemetry Auto-Instrumentation
The OpenTelemetry Operator uses an Instrumentation custom resource to define how auto-instrumentation should be applied to pods. This CR specifies which languages to instrument and how the agents should be configured. The operator then injects the necessary environment variables and agent JARs/libraries into your application pods based on these configurations and specific pod annotations.
We'll create an Instrumentation resource for Java, but similar configurations exist for other languages like Python, Node.js, and .NET. This resource tells the operator how to set up the OpenTelemetry agent for Java applications.
apiVersion: opentelemetry.io/v1alpha1
kind: Instrumentation
metadata:
name: my-instrumentation
spec:
exporter:
endpoint: http://otel-collector.default.svc.cluster.local:4317 # Collector endpoint
propagators:
- tracecontext
- baggage
- b3
java:
image: ghcr.io/open-telemetry/opentelemetry-operator/autoinstrumentation-java:1.26.0 # Java agent image
# python:
# image: ghcr.io/open-telemetry/opentelemetry-operator/autoinstrumentation-python:1.26.0
# nodejs:
# image: ghcr.io/open-telemetry/opentelemetry-operator/autoinstrumentation-nodejs:1.26.0
# dotnet:
# image: ghcr.io/open-telemetry/opentelemetry-operator/autoinstrumentation-dotnet:1.26.0
kubectl apply -f instrumentation.yaml
Verify: Check the status of the Instrumentation resource.
kubectl get instrumentation my-instrumentation
NAME AGE
my-instrumentation 30s
4. Deploy a Sample Application with Auto-Instrumentation
Now, let's deploy a sample Java application and enable auto-instrumentation using Kubernetes annotations. The key annotation is instrumentation.opentelemetry.io/inject-java: "true". When the OpenTelemetry Operator sees this annotation on a new pod, it intercepts the pod creation request and injects the OpenTelemetry Java agent into the pod's container definition. This typically involves adding an initContainer to download the agent and setting environment variables like JAVA_TOOL_OPTIONS to load the agent.
We'll also set OTEL_SERVICE_NAME and OTEL_EXPORTER_OTLP_ENDPOINT directly in the Deployment's environment variables. While the Instrumentation CR can provide a default endpoint, explicitly setting it here ensures clarity. The OTEL_RESOURCE_ATTRIBUTES are also useful for adding service-level metadata to all collected telemetry.
apiVersion: apps/v1
kind: Deployment
metadata:
name: otel-java-app
labels:
app: otel-java-app
spec:
replicas: 1
selector:
matchLabels:
app: otel-java-app
template:
metadata:
labels:
app: otel-java-app
annotations:
instrumentation.opentelemetry.io/inject-java: "true" # !! Critical annotation !!
spec:
containers:
- name: java-app
image: ghcr.io/open-telemetry/opentelemetry-operator/autoinstrumentation-java:1.26.0 # A simple Java app that makes HTTP calls
ports:
- containerPort: 8080
env:
- name: OTEL_SERVICE_NAME
value: my-java-service # Name of the service for traces
- name: OTEL_EXPORTER_OTLP_ENDPOINT
value: http://otel-collector.default.svc.cluster.local:4317 # OTLP endpoint for the collector
- name: OTEL_RESOURCE_ATTRIBUTES
value: service.namespace=default,environment=production # Additional resource attributes
---
apiVersion: v1
kind: Service
metadata:
name: otel-java-app
spec:
selector:
app: otel-java-app
ports:
- protocol: TCP
port: 80
targetPort: 8080
kubectl apply -f sample-app.yaml
Verify: Check the pod's description to see the injected environment variables and volumes related to OpenTelemetry. You should see an initContainer named opentelemetry-auto-instrumentation and environment variables like JAVA_TOOL_OPTIONS.
kubectl describe pod -l app=otel-java-app
...
Init Containers:
opentelemetry-auto-instrumentation:
Container ID: containerd://...
Image: ghcr.io/open-telemetry/opentelemetry-operator/autoinstrumentation-java:1.26.0
Image ID: ghcr.io/open-telemetry/opentelemetry-operator/autoinstrumentation-java@sha256:...
Port: <none>
Host Port: <none>
Command:
cp
/autoinstrumentation/opentelemetry-javaagent.jar
/otel-auto-instrumentation-java/opentelemetry-javaagent.jar
State: Terminated
Reason: Completed
...
Containers:
java-app:
Container ID: containerd://...
Image: ghcr.io/open-telemetry/opentelemetry-operator/autoinstrumentation-java:1.26.0
Image ID: ghcr.io/open-telemetry/opentelemetry-operator/autoinstrumentation-java@sha256:...
Port: 8080/TCP
Host Port: <none>
Environment:
OTEL_SERVICE_NAME: my-java-service
OTEL_EXPORTER_OTLP_ENDPOINT: http://otel-collector.default.svc.cluster.local:4317
OTEL_RESOURCE_ATTRIBUTES: service.namespace=default,environment=production
JAVA_TOOL_OPTIONS: -javaagent:/otel-auto-instrumentation-java/opentelemetry-javaagent.jar
OTEL_JAVAAGENT_AUTO_CONF_ENABLED: "true"
OTEL_JAVAAGENT_ARGS:
OTEL_JAVAAGENT_DEBUG: "false"
...
5. Generate Traffic and Verify Traces
Once the application is running, generate some traffic to it. The sample Java application we deployed has a simple HTTP endpoint. Making a request to this endpoint will trigger the auto-instrumented code, generating traces that are then sent to the OpenTelemetry Collector.
The collector, configured with the logging exporter, will print these traces to its standard output. This is an excellent way to confirm that auto-instrumentation is working correctly before integrating with a full-fledged observability backend.
# Forward a local port to the service
kubectl port-forward svc/otel-java-app 8080:80 &
# Make a request to the application
curl localhost:8080/hello
Hello from Java App!
Verify: Check the logs of the otel-collector deployment. You should see detailed trace information, including spans generated by your Java application.
kubectl logs -f deployment/otel-collector
...
2023-10-27T10:30:15.123Z INFO Traces {"resource spans": [{"resource": {"attributes": [{"key": "service.name", "value": {"stringValue": "my-java-service"}}, {"key": "service.namespace", "value": {"stringValue": "default"}}, {"key": "environment", "value": {"stringValue": "production"}}, {"key": "host.arch", "value": {"stringValue": "amd64"}}, {"key": "os.type", "value": {"stringValue": "linux"}}, {"key": "os.description", "value": {"stringValue": "Linux 5.15.0-86-generic"}}, {"key": "process.pid", "value": {"intValue": 1}}, {"key": "process.runtime.name", "value": {"stringValue": "OpenJDK Runtime Environment"}}, {"key": "process.runtime.version", "value": {"stringValue": "17.0.8.1+1"}}, {"key": "process.command_args", "value": {"arrayValue": {"values": [{"stringValue": "java"}, {"stringValue": "-javaagent:/otel-auto-instrumentation-java/opentelemetry-javaagent.jar"}, {"stringValue": "-jar"}, {"stringValue": "/app/app.jar"}]}}}, {"key": "telemetry.sdk.name", "value": {"stringValue": "opentelemetry"}}, {"key": "telemetry.sdk.language", "value": {"stringValue": "java"}}, {"key": "telemetry.sdk.version", "value": {"stringValue": "1.26.0"}}, {"key": "k8s.pod.name", "value": {"stringValue": "otel-java-app-..."}}, {"key": "k8s.deployment.name", "value": {"stringValue": "otel-java-app"}}]}, "scope_spans": [{"scope": {"name": "io.openteentelemetry.tomcat-10.0"}, "spans": [{"trace_id": "...", "span_id": "...", "parent_span_id": "...", "name": "HTTP GET /hello", "kind": "SPAN_KIND_SERVER", "start_time_unix_nano": "...", "end_time_unix_nano": "...", "attributes": [{"key": "http.method", "value": {"stringValue": "GET"}}, {"key": "http.target", "value": {"stringValue": "/hello"}}, {"key": "http.status_code", "value": {"intValue": 200}}, {"key": "net.host.port", "value": {"intValue": 8080}}], "status": {"code": "STATUS_CODE_OK"}}]}]}]}
...
The logs confirm that the Java application is successfully sending traces to the collector, and the traces contain relevant HTTP request information and resource attributes. This demonstrates the power of OpenTelemetry auto-instrumentation!
Production Considerations
While auto-instrumentation significantly reduces the effort to get started, deploying it in production requires careful planning:
- Collector Sizing and High Availability: For production, a single collector instance is a single point of failure. Deploy multiple collector instances, potentially as a StatefulSet or DaemonSet for node-local collection, and use Kubernetes Services for load balancing. Refer to the OpenTelemetry Collector sizing guidelines for resource allocation.
- Exporters: The
loggingexporter is for debugging. In production, configure the collector to export to your chosen observability backend (e.g., Jaeger, Prometheus, Loki, Datadog, New Relic, Splunk). Each backend has its specific exporter configuration. - Resource Attributes and Naming Conventions: Standardize
OTEL_SERVICE_NAMEandOTEL_RESOURCE_ATTRIBUTESacross your applications. This ensures consistent tagging for easier filtering and analysis in your observability backend. Consider using a mutating webhook or policy engine like Kyverno to enforce these standards. - Security:
- Network Policies: Restrict communication to your OpenTelemetry Collector using Kubernetes Network Policies. Only allow application pods to send OTLP data to the collector.
- TLS: Secure communication between applications and the collector, and between the collector and your backend, using TLS. The collector supports TLS configuration for both receivers and exporters.
- Authentication: If your observability backend requires authentication, configure the appropriate authentication extensions in your collector.
- Performance Overhead: While auto-instrumentation is designed to be lightweight, it does introduce some overhead. Monitor your application's CPU and memory usage after instrumentation. Test thoroughly in non-production environments.
- Sampling: For high-volume applications, sending every trace can be overwhelming and costly. Implement sampling strategies in your OpenTelemetry Collector to reduce the amount of data exported without losing critical insights.
- Agent Updates: Keep your OpenTelemetry Operator and agent images updated to benefit from bug fixes, performance improvements, and new features.
- Custom Instrumentation: Auto-instrumentation provides a good baseline, but for business-critical logic or capturing specific application-level details, you might still need to add manual instrumentation to your code.
- Integration with other tools: Consider how OpenTelemetry data can enrich other observability tools. For instance, combining traces with metrics collected by Prometheus or logs centralized by Loki.
- Advanced Collector Configuration: The OpenTelemetry Collector is highly configurable. Explore processors like
attributes,resourcedetection,memory_limiter, andk8sattributesto enrich, filter, and manage your telemetry data effectively. Thek8sattributesprocessor is particularly useful for adding Kubernetes metadata (pod name, namespace, etc.) to your traces. For advanced networking, you might integrate with solutions like Cilium WireGuard Encryption for secure data transport.
Troubleshooting
-
Issue: OpenTelemetry Operator Pod stuck in Pending state.
Reason: Likely a resource constraint or incorrect RBAC permissions for the operator itself.
Solution: Check the events for the operator pod and its deployment. Ensure your cluster has enough resources and that the operator's service account has the necessary permissions. Sometimes, a fresh install can resolve transient issues.
kubectl -n opentelemetry-operator-system describe pod opentelemetry-operator-controller-manager-... kubectl -n opentelemetry-operator-system get role,rolebinding,clusterrole,clusterrolebinding -
Issue: Application pod not showing injected
initContaineror environment variables.Reason: The admission webhook might not be working, the
Instrumentationresource is misconfigured, or the pod annotation is incorrect.Solution:
- Verify the
opentelemetry-operator-admission-webhookservice and deployment are running in theopentelemetry-operator-systemnamespace. - Ensure the
ValidatingWebhookConfigurationandMutatingWebhookConfigurationfor the operator are correctly set up and targeting the right namespaces. - Double-check the pod annotation, e.g.,
instrumentation.opentelemetry.io/inject-java: "true". - Ensure the
Instrumentationresource exists and specifies the correct language agent image. - Restart the application pod (delete and recreate) after making changes to the
Instrumentationresource or annotations, as webhooks only act on pod creation.
- Verify the
-
Issue: OpenTelemetry Collector not receiving traces.
Reason: Network connectivity issues, incorrect endpoint configuration, or the application isn't generating traces.
Solution:
- Check collector logs for any errors (
kubectl logs -f deployment/otel-collector). - Verify the
OTEL_EXPORTER_OTLP_ENDPOINTenvironment variable in your application pod matches the collector service's cluster IP and port (e.g.,http://otel-collector.default.svc.cluster.local:4317). - Ensure there are no Kubernetes Network Policies blocking traffic between the application and the collector.
- Confirm the application is actually making calls that would generate traces (e.g., HTTP requests, database queries).
- Check collector logs for any errors (
-
Issue: Traces are generated, but missing important details or context.
Reason: Default auto-instrumentation might not cover all libraries, or context propagation is broken.
Solution:
- Consult the OpenTelemetry documentation for your specific language agent to see which libraries are automatically instrumented.
- Ensure proper context propagation (e.g.,
tracecontext,baggage) is configured in yourInstrumentationresource and is supported by all services in your trace path. - Consider adding manual instrumentation for critical business logic or unsupported libraries.
- Verify
OTEL_RESOURCE_ATTRIBUTESare set correctly to add useful metadata to your services.
-
Issue: High resource consumption by application pods after instrumentation.
Reason: The auto-instrumentation agent or the application itself might be generating too much telemetry data.
Solution:
- Monitor CPU and memory usage of the instrumented pods.
- Implement sampling in the OpenTelemetry Collector to reduce the volume of traces exported.
- Review the application's code for excessive span creation or highly verbose logging that might trigger more traces.
- Ensure the OpenTelemetry agent version is optimized for performance.
-
Issue: Collector logs show "unknown service" or similar errors when exporting to a backend.
Reason: Incorrect exporter configuration in the OpenTelemetry Collector.
Solution:
- Carefully review the documentation for your chosen observability backend and its OpenTelemetry exporter.
- Double-check the endpoint, authentication, and any TLS settings in the collector configuration.
- Ensure the backend service is reachable from the collector pod (e.g., correct hostname, port, and no network policies blocking traffic).
FAQ Section
-
What is OpenTelemetry auto-instrumentation?
OpenTelemetry auto-instrumentation refers to the process of automatically collecting telemetry data (traces, metrics, logs) from an application without modifying its source code. This is typically achieved by using language-specific agents (e.g., Java agent, Python agent) that hook into common libraries and frameworks at runtime to capture relevant events.
-
Which programming languages are supported for auto-instrumentation by the OpenTelemetry Operator?
The OpenTelemetry Operator commonly supports auto-instrumentation for popular languages like Java, Python, Node.js, and .NET. Support for other languages may vary or be under active development. Always check the official OpenTelemetry Operator GitHub repository for the latest supported languages and versions.
-
How does the OpenTelemetry Operator inject the instrumentation agent?
The OpenTelemetry Operator uses a Mutating Admission Webhook. When a new pod is created with specific annotations (e.g.,
instrumentation.opentelemetry.io/inject-java: "true"), the webhook intercepts the pod definition and modifies it. This modification typically involves injecting aninitContainerto download the agent and setting environment variables (likeJAVA_TOOL_OPTIONSfor Java) to load the agent at application startup. -
Can I combine auto-instrumentation with manual instrumentation?
Yes, absolutely! Auto-instrumentation provides a great baseline for common operations (HTTP requests, database calls). For deeper insights into specific business logic, custom functions, or to add more context to existing spans, you can add manual instrumentation to your application code. The data from both will be correlated into the same traces.
-
What are the alternatives to the OpenTelemetry Operator for auto-instrumentation in Kubernetes?
While the OpenTelemetry Operator is the standard for managing OpenTelemetry components in Kubernetes, alternative approaches for auto-instrumentation include:
- Manual Injection: Modifying your Dockerfiles or Kubernetes manifests directly to include the agent and environment variables.
- Service Mesh Integration: Some service meshes like Istio Ambient Mesh or Linkerd can generate telemetry data, though it might not always be OpenTelemetry native or as detailed as language-specific agents.
- Cloud Provider Specific Solutions: Cloud providers sometimes offer their own agents or integrations for their observability platforms, which may or may not be OpenTelemetry-compatible.
Cleanup Commands
To remove all resources created during this guide:
# Delete sample application
kubectl delete -f sample-app.yaml
# Delete instrumentation resource
kubectl delete -f instrumentation.yaml
# Delete OpenTelemetry Collector
kubectl delete -f otel-collector.yaml
# Uninstall the OpenTelemetry Operator
kubectl delete -f https://github.com/open-telemetry/opentelemetry-operator/releases/latest/download/opentelemetry-operator.yaml
# (Optional) Clean up the opentelemetry-operator-system namespace if it's empty
kubectl delete namespace opentelemetry-operator-system
Next Steps / Further Reading
- Explore advanced OpenTelemetry Collector configurations, including processors and various exporters for your preferred observability backend.
- Learn about sampling strategies to manage the volume of telemetry data in production.
- Deep dive into OpenTelemetry Metrics and how to collect them alongside traces.
- Investigate how to integrate OpenTelemetry with your existing eBPF Observability tools like Hubble for a more comprehensive view of your network and application performance.
- Consider using tools like Karpenter in conjunction with your observability data to optimize cluster costs based on actual workload demands.
- Read the official Kubernetes Operator documentation to understand the underlying principles of the OpenTelemetry Operator.
Conclusion
OpenTelemetry auto-instrumentation in Kubernetes significantly streamlines the process of gaining deep observability into your microservices. By leveraging the OpenTelemetry Operator, you can effortlessly inject language-specific agents into your applications, collecting rich trace data without altering your source code. This empowers development and operations teams with critical insights into application behavior, performance bottlenecks, and distributed system interactions, ultimately leading to faster debugging and more reliable services. Embrace OpenTelemetry to build a truly observable and resilient Kubernetes environment.