Introduction
In the complex tapestry of modern microservices, observability is not just a buzzword; it’s the lifeline that keeps your applications healthy, performant, and resilient. As distributed systems grow, so does the sheer volume and variety of telemetry data—logs, metrics, and traces—generated by countless components. Collecting, processing, and exporting this data to various backend systems can quickly become a monumental challenge. This is where the OpenTelemetry Collector steps in as an indispensable tool, offering a standardized, vendor-agnostic solution for your observability pipeline.
The OpenTelemetry Collector is a powerful, flexible, and highly configurable agent that can receive, process, and export telemetry data in a multitude of formats. It acts as a central hub, decoupling the instrumentation of your applications from the specific backend systems you use for analysis and storage. This means you can switch between different observability platforms (e.g., Prometheus, Jaeger, Grafana, Datadog, Splunk) without re-instrumenting your code. For organizations operating Kubernetes, the Collector is particularly transformative, providing a unified approach to gather insights from ephemeral pods, services, and nodes. Whether you’re dealing with vast amounts of metrics for a GPU-intensive LLM workload or tracing requests through an Istio Ambient Mesh, the OpenTelemetry Collector simplifies your observability strategy.
TL;DR: OpenTelemetry Collector – Unified Observability Pipeline
The OpenTelemetry Collector is a vendor-agnostic agent for receiving, processing, and exporting telemetry data (logs, metrics, traces). It acts as a central hub, decoupling application instrumentation from backend observability systems. Deploy it as a DaemonSet for node-level collection or a Deployment for application-level collection in Kubernetes.
Key Takeaways:
- Unified Data Collection: Gathers metrics, traces, and logs from diverse sources.
- Vendor Agnostic: Supports various data formats (OTLP, Jaeger, Prometheus, Zipkin) and exports to multiple backends (Prometheus, Loki, Jaeger, commercial SaaS).
- Processing Power: Filters, samples, transforms, and enriches telemetry data before export.
- Kubernetes Native: Deploys easily as a DaemonSet (per node) or Deployment (per application/namespace).
Key Commands:
# Deploy the Collector as a DaemonSet
kubectl apply -f opentelemetry-collector-daemonset.yaml
# Deploy the Collector as a Deployment
kubectl apply -f opentelemetry-collector-deployment.yaml
# Verify Collector status
kubectl get pods -n opentelemetry-collector
kubectl logs -f -n opentelemetry-collector opentelemetry-collector-xxxx
# Apply a sample application with OTLP exporter
kubectl apply -f sample-app-with-otlp.yaml
Prerequisites
Before diving into the deployment and configuration of the OpenTelemetry Collector on Kubernetes, ensure you have the following:
- Kubernetes Cluster: A running Kubernetes cluster (v1.20+ recommended). You can use Minikube, Kind, or any cloud provider’s managed Kubernetes service (EKS, GKE, AKS).
kubectl: The Kubernetes command-line tool, configured to connect to your cluster. Refer to the official Kubernetes documentation for kubectl installation.- Helm (Optional but Recommended): For easier deployment of the Collector and its associated components, Helm is highly recommended. Install Helm by following the official Helm installation guide.
- Basic Kubernetes Knowledge: Familiarity with Kubernetes concepts like Pods, Deployments, DaemonSets, Services, ConfigMaps, and Namespaces.
- Understanding of Observability Concepts: Basic understanding of metrics, traces, and logs.
Step-by-Step Guide: Deploying and Configuring OpenTelemetry Collector on Kubernetes
Step 1: Set up the Namespace and RBAC
First, we’ll create a dedicated namespace for our OpenTelemetry Collector components to keep things organized. We’ll also define the necessary Role-Based Access Control (RBAC) permissions. The Collector, especially when deployed as a DaemonSet collecting host-level metrics or using service discovery, requires specific permissions to interact with the Kubernetes API, such as listing nodes, pods, and services. This ensures it can properly gather metadata and function efficiently. Without these permissions, the Collector might fail to start or collect comprehensive data.
# opentelemetry-namespace-rbac.yaml
apiVersion: v1
kind: Namespace
metadata:
name: opentelemetry-collector
---
apiVersion: v1
kind: ServiceAccount
metadata:
name: opentelemetry-collector
namespace: opentelemetry-collector
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
name: opentelemetry-collector
rules:
- apiGroups: [""]
resources: ["nodes", "nodes/metrics", "nodes/proxy", "pods", "services", "endpoints"]
verbs: ["get", "list", "watch"]
- apiGroups: ["apps"]
resources: ["replicasets"]
verbs: ["get", "list", "watch"]
- apiGroups: ["extensions"]
resources: ["daemonsets", "deployments"]
verbs: ["get", "list", "watch"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
name: opentelemetry-collector
subjects:
- kind: ServiceAccount
name: opentelemetry-collector
namespace: opentelemetry-collector
roleRef:
kind: ClusterRole
name: opentelemetry-collector
apiGroup: rbac.authorization.k8s.io
Apply these configurations:
kubectl apply -f opentelemetry-namespace-rbac.yaml
Verify: Check if the namespace and service account are created, and the ClusterRoleBinding is in place.
kubectl get namespace opentelemetry-collector
kubectl get serviceaccount -n opentelemetry-collector opentelemetry-collector
kubectl get clusterrolebinding opentelemetry-collector
Expected Output:
NAME STATUS AGE
opentelemetry-collector Active Xs
NAME SECRETS AGE
opentelemetry-collector 1 Xs
NAME ROLE AGE
opentelemetry-collector ClusterRole/opentelemetry-collector Xs
Step 2: Define the OpenTelemetry Collector Configuration
The core of the OpenTelemetry Collector is its configuration, typically provided via a ConfigMap. This configuration defines the receivers (how data is ingested), processors (how data is transformed), and exporters (where data is sent). For this example, we’ll set up a basic configuration to receive OTLP (OpenTelemetry Protocol) data, process it, and export it to a Prometheus backend for metrics and a Jaeger backend for traces. We’ll also include a simple logging exporter to see what data is being processed.
This configuration also includes a Prometheus receiver for scraping metrics from the Collector itself, and a memory limiter processor to prevent the Collector from consuming too much memory, which is crucial for stability in production environments. For more advanced configurations, you might consider adding extensions for health checks or service discovery.
# opentelemetry-collector-config.yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: opentelemetry-collector-config
namespace: opentelemetry-collector
data:
collector.yaml: |
receivers:
otlp:
protocols:
grpc:
http:
prometheus:
config:
scrape_configs:
- job_name: 'otel-collector'
scrape_interval: 10s
static_configs:
- targets: ['0.0.0.0:8888'] # Collector's own metrics endpoint
processors:
batch:
send_batch_size: 10000
timeout: 10s
memory_limiter:
check_interval: 1s
limit_mib: 200
spike_limit_mib: 50
exporters:
logging:
loglevel: debug
prometheus:
endpoint: "0.0.0.0:8888" # Expose Prometheus metrics from collector
jaeger:
grpc:
endpoint: "jaeger-collector.jaeger.svc.cluster.local:14250" # Assuming Jaeger is deployed
# For local testing, you might use an external endpoint or a port-forward
# Example of exporting to a commercial SaaS (e.g., Datadog, New Relic, etc.)
# datadog:
# api:
# key: ${env:DD_API_KEY}
# metrics:
# resource_attributes_as_tags: true
# newrelic:
# api_key: ${env:NEW_RELIC_API_KEY}
# common:
# service_name: "otel-collector"
service:
telemetry:
metrics:
address: 0.0.0.0:8888 # Collector's own metrics endpoint
pipelines:
traces:
receivers: [otlp]
processors: [memory_limiter, batch]
exporters: [logging, jaeger]
metrics:
receivers: [otlp, prometheus] # Receive OTLP metrics and scrape collector's own metrics
processors: [memory_limiter, batch]
exporters: [logging, prometheus]
logs:
receivers: [otlp]
processors: [memory_limiter, batch]
exporters: [logging]
Apply this configuration:
kubectl apply -f opentelemetry-collector-config.yaml
Verify: Check if the ConfigMap is created.
kubectl get configmap -n opentelemetry-collector opentelemetry-collector-config -o yaml
Expected Output: The YAML content of your ConfigMap.
Step 3: Deploy the OpenTelemetry Collector
The OpenTelemetry Collector can be deployed in various modes, each suited for different use cases:
- Agent (DaemonSet): Deployed as a DaemonSet, one instance per node. Ideal for collecting host-level metrics, Kubernetes events, and forwarding telemetry from applications running on that node. This is often used for edge collection.
- Gateway (Deployment): Deployed as a regular Deployment, typically with multiple replicas, acting as a central processing and routing layer. It receives data from agents or directly from applications, performs aggregation, sampling, and enrichment, and then exports to various backends. This is often used for centralized processing.
For this guide, we’ll demonstrate both. We’ll start with a DaemonSet for node-level collection and then a Deployment for a centralized gateway.
Option A: Deploy as a DaemonSet (Per Node Agent)
A DaemonSet ensures that a Collector instance runs on every (or selected) node in your cluster. This is excellent for collecting node-specific metrics, logs from host paths, and for acting as a local forwarder for applications running on the same node. This setup minimizes network hops and can reduce resource consumption on application pods by offloading collection logic.
# opentelemetry-collector-daemonset.yaml
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: opentelemetry-collector-agent
namespace: opentelemetry-collector
labels:
app: opentelemetry-collector-agent
spec:
selector:
matchLabels:
app: opentelemetry-collector-agent
template:
metadata:
labels:
app: opentelemetry-collector-agent
spec:
serviceAccountName: opentelemetry-collector
containers:
- name: opentelemetry-collector
image: otel/opentelemetry-collector:0.96.0 # Use a specific version
command: ["/otelcol", "--config=/conf/collector.yaml"]
volumeMounts:
- name: collector-config
mountPath: /conf
- name: varlog
mountPath: /var/log
readOnly: true
- name: varlibdockercontainers
mountPath: /var/lib/docker/containers
readOnly: true
ports:
- name: otlp-grpc
containerPort: 4317
protocol: TCP
- name: otlp-http
containerPort: 4318
protocol: TCP
- name: prometheus
containerPort: 8888
protocol: TCP
resources:
limits:
cpu: 200m
memory: 256Mi
requests:
cpu: 100m
memory: 128Mi
securityContext:
runAsUser: 0 # Needed for accessing host paths
volumes:
- name: collector-config
configMap:
name: opentelemetry-collector-config
- name: varlog
hostPath:
path: /var/log
- name: varlibdockercontainers
hostPath:
path: /var/lib/docker/containers
Apply this configuration:
kubectl apply -f opentelemetry-collector-daemonset.yaml
Option B: Deploy as a Deployment (Gateway)
A Deployment is suitable for a centralized Collector instance (or multiple instances behind a load balancer) that acts as a gateway. It can receive data from agents, directly from applications (e.g., using OTLP), or from other Collectors, and then process and export it. This is often used for aggregation, sampling, and routing to different backends. For large-scale deployments, you might even consider Karpenter to dynamically provision nodes for your Collector Deployments, optimizing resource usage.
# opentelemetry-collector-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: opentelemetry-collector-gateway
namespace: opentelemetry-collector
labels:
app: opentelemetry-collector-gateway
spec:
replicas: 1
selector:
matchLabels:
app: opentelemetry-collector-gateway
template:
metadata:
labels:
app: opentelemetry-collector-gateway
spec:
serviceAccountName: opentelemetry-collector
containers:
- name: opentelemetry-collector
image: otel/opentelemetry-collector:0.96.0 # Use a specific version
command: ["/otelcol", "--config=/conf/collector.yaml"]
volumeMounts:
- name: collector-config
mountPath: /conf
ports:
- name: otlp-grpc
containerPort: 4317
protocol: TCP
- name: otlp-http
containerPort: 4318
protocol: TCP
- name: prometheus
containerPort: 8888
protocol: TCP
resources:
limits:
cpu: 500m
memory: 512Mi
requests:
cpu: 250m
memory: 256Mi
volumes:
- name: collector-config
configMap:
name: opentelemetry-collector-config
Apply this configuration:
kubectl apply -f opentelemetry-collector-deployment.yaml
Verify: Check if the Collector pods are running.
kubectl get pods -n opentelemetry-collector -l app=opentelemetry-collector-agent # for DaemonSet
kubectl get pods -n opentelemetry-collector -l app=opentelemetry-collector-gateway # for Deployment
Expected Output (for DaemonSet, adjust for number of nodes):
NAME READY STATUS RESTARTS AGE
opentelemetry-collector-agent-xxxxx 1/1 Running 0 Xs
Expected Output (for Deployment):
NAME READY STATUS RESTARTS AGE
opentelemetry-collector-gateway-xxxxx-yyyyy 1/1 Running 0 Xs
Step 4: Expose the Collector with a Service
To allow applications and other Collectors to send telemetry data to our deployed Collector, we need to expose it via a Kubernetes Service. This Service will route traffic to the OTLP gRPC and HTTP endpoints, and optionally to the Prometheus metrics endpoint of the Collector itself.
# opentelemetry-collector-service.yaml
apiVersion: v1
kind: Service
metadata:
name: opentelemetry-collector
namespace: opentelemetry-collector
labels:
app: opentelemetry-collector
spec:
selector:
app: opentelemetry-collector-agent # or app: opentelemetry-collector-gateway
ports:
- name: otlp-grpc
protocol: TCP
port: 4317
targetPort: 4317
- name: otlp-http
protocol: TCP
port: 4318
targetPort: 4318
- name: prometheus
protocol: TCP
port: 8888
targetPort: 8888
Important: Adjust the selector based on whether you deployed a DaemonSet (app: opentelemetry-collector-agent) or a Deployment (app: opentelemetry-collector-gateway).
Apply this configuration:
kubectl apply -f opentelemetry-collector-service.yaml
Verify: Check if the Service is created.
kubectl get service -n opentelemetry-collector opentelemetry-collector
Expected Output:
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
opentelemetry-collector ClusterIP 10.xx.xx.xx <none> 4317/TCP,4318/TCP,8888/TCP Xs
Step 5: Deploy a Sample Application with OTLP Exporter
Now, let’s deploy a simple application that is instrumented with OpenTelemetry and configured to send its telemetry data to our Collector. This application will generate traces and metrics, which the Collector will then receive and process.
# sample-app-with-otlp.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: sample-otlp-app
namespace: default # Deploy in default namespace for simplicity
spec:
replicas: 1
selector:
matchLabels:
app: sample-otlp-app
template:
metadata:
labels:
app: sample-otlp-app
spec:
containers:
- name: sample-app
image: otel/opentelemetry-collector-contrib:0.96.0 # Using a contrib image for simplicity, it includes a test app
command: ["/otelcol"] # Overriding to run a simple HTTP server that generates telemetry
args: ["--config=/etc/otelcol/config.yaml"]
env:
- name: OTEL_EXPORTER_OTLP_ENDPOINT
value: "opentelemetry-collector.opentelemetry-collector.svc.cluster.local:4317" # Collector service endpoint
- name: OTEL_RESOURCE_ATTRIBUTES
value: "service.name=sample-otlp-app,service.version=1.0.0"
ports:
- containerPort: 8080
volumeMounts:
- name: otel-config
mountPath: /etc/otelcol
volumes:
- name: otel-config
configMap:
name: sample-app-otel-config
---
apiVersion: v1
kind: ConfigMap
metadata:
name: sample-app-otel-config
namespace: default
data:
config.yaml: |
receivers:
otlp:
protocols:
grpc:
http:
processors:
batch:
exporters:
otlp:
endpoint: "opentelemetry-collector.opentelemetry-collector.svc.cluster.local:4317"
tls:
insecure: true
service:
pipelines:
traces:
receivers: [otlp]
processors: [batch]
exporters: [otlp]
metrics:
receivers: [otlp]
processors: [batch]
exporters: [otlp]
This example uses the Collector contrib image itself as a simple application, configured to send OTLP data to our main Collector. In a real-world scenario, this would be your actual application instrumented with OpenTelemetry SDKs.
Apply this configuration:
kubectl apply -f sample-app-with-otlp.yaml
Verify: Check the logs of both the sample application and the OpenTelemetry Collector to see data flowing.
kubectl get pods -l app=sample-otlp-app
kubectl logs -f $(kubectl get pods -l app=sample-otlp-app -o jsonpath='{.items[0].metadata.name}')
# Check collector logs
kubectl logs -f $(kubectl get pods -n opentelemetry-collector -l app=opentelemetry-collector-agent -o jsonpath='{.items[0].metadata.name}') # or gateway
Expected Output (Collector logs): You should see debug messages indicating that traces and metrics are being received and exported to the logging exporter.
...
{"level":"debug","ts":"2023-10-27T10:00:00.000Z","caller":"loggingexporter/logging_exporter.go:73","msg":"TracesExporter","#traces":1}
{"level":"debug","ts":"2023-10-27T10:00:00.000Z","caller":"loggingexporter/logging_exporter.go:73","msg":"MetricsExporter","#metrics":1}
...
Step 6: Deploy Prometheus and Jaeger (Optional but Recommended)
To fully visualize the collected data, you’ll need backend systems like Prometheus for metrics and Jaeger for traces. These are crucial components for any comprehensive observability stack. For a more in-depth look at robust networking for such components, consider exploring Kubernetes Network Policies to secure communication between your observability stack.
We’ll use Helm for a quick deployment of these tools. First, add their respective Helm repositories:
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo add jaegertracing https://jaegertracing.github.io/helm-charts
helm repo update
Deploy Prometheus:
helm install prometheus prometheus-community/prometheus \
--namespace monitoring \
--create-namespace \
-f - <<EOF
server:
service:
type: NodePort
nodePort: 30090 # Access Prometheus UI on http://<node-ip>:30090
extraScrapeConfigs: |
- job_name: 'opentelemetry-collector'
scrape_interval: 10s
static_configs:
- targets: ['opentelemetry-collector.opentelemetry-collector.svc.cluster.local:8888']
EOF
Deploy Jaeger:
helm install jaeger jaegertracing/jaeger \
--namespace jaeger \
--create-namespace \
-f - <<EOF
agent:
enabled: false # Collector sends directly to collector service
collector:
service:
type: ClusterIP # Expose collector GRPC internally
query:
service:
type: NodePort
nodePort: 30080 # Access Jaeger UI on http://<node-ip>:30080
EOF
Verify: Access Prometheus and Jaeger UIs.
- Find a node IP:
kubectl get nodes -o wide - Prometheus UI:
http://<node-ip>:30090(Go to Status -> Targets, you should seeopentelemetry-collectoras a target) - Jaeger UI:
http://<node-ip>:30080(Selectsample-otlp-appfrom the Service dropdown and click Find Traces)
Production Considerations
Deploying the OpenTelemetry Collector in a production environment requires careful planning and consideration beyond the basic setup. Here are key aspects to address:
- Resource Management:
- Limits & Requests: Always set appropriate CPU and memory requests and limits for Collector pods. Under-resourcing can lead to OOMKills and dropped telemetry, while over-resourcing wastes cluster resources. The
memory_limiterprocessor in the Collector config is also crucial. - Horizontal Scaling: For Gateway deployments, scale out replicas based on telemetry volume and processing requirements. Use Horizontal Pod Autoscalers (HPA) based on CPU, memory, or custom metrics (e.g., queue size).
- Limits & Requests: Always set appropriate CPU and memory requests and limits for Collector pods. Under-resourcing can lead to OOMKills and dropped telemetry, while over-resourcing wastes cluster resources. The
- High Availability & Reliability:
- Multiple Replicas: Deploy multiple replicas for Gateway Collectors to ensure no single point of failure.
- Anti-Affinity: Use pod anti-affinity rules to ensure Collector replicas are spread across different nodes and availability zones.
- Persistent Queues (for Exporters): For critical data, configure exporters with persistent queues to buffer data to disk before sending it to backends. This prevents data loss during temporary network outages or backend unavailability.
- Security:
- Network Policies: Implement Kubernetes Network Policies to restrict traffic to and from Collector pods. Only allow expected sources (applications, other collectors) to send data and only allow the Collector to connect to its designated backends.
- TLS/SSL: Secure communication between applications and the Collector, and between the Collector and backends using TLS. The OTLP receiver and various exporters support TLS configuration.
- Authentication: If connecting to commercial backends, use environment variables for API keys or integrate with Kubernetes Secrets for sensitive credentials.
- Image Security: Use trusted, specific OpenTelemetry Collector image versions. Consider scanning images with tools like Sigstore and Kyverno for supply chain security.
- Observability of the Collector Itself:
- Self-Scraping: Configure the Collector to expose its own metrics (as shown in the example via the Prometheus receiver on port 8888) and monitor these. Key metrics include receiver/exporter queue sizes, dropped batches, and memory usage.
- Health Checks: Use liveness and readiness probes in your Kubernetes deployment to ensure the Collector is responsive and healthy.
- Logging: Configure appropriate log levels. For production,
infois usually sufficient; usedebugonly for troubleshooting.
- Configuration Management:
- Version Control: Keep your Collector configurations in version control (Git).
- Helm Charts: Use the official OpenTelemetry Helm chart for robust and configurable deployments. It handles many of the production considerations out of the box.
- Advanced Processing:
- Data Sampling: Implement tail-based or head-based sampling processors to reduce trace volume, especially for high-traffic services.
- Batching: Optimize batch sizes and timeouts for processors and exporters to balance latency and throughput.
- Attribute Processors: Use attribute processors to add, remove, or transform attributes (e.g., adding Kubernetes metadata via the k8sattributes processor).
- Network Performance:
- Cilium: For high-performance networking and advanced observability, consider using Cilium with WireGuard encryption. Cilium’s eBPF capabilities can provide deep network insights and efficient traffic handling for your telemetry pipeline. You can even use eBPF Observability with Hubble to monitor the Collector’s network traffic.
- Load Balancing: For Gateway deployments, ensure your Kubernetes Service (e.g., LoadBalancer type) can handle the expected ingress traffic.
Troubleshooting
Here are some common issues you might encounter with the OpenTelemetry Collector on Kubernetes and how to resolve them.
-
Collector Pods are stuck in
Pendingstate.Issue: Pods aren’t scheduling.
Solution: Check events for the pod to understand why it’s pending. Common reasons include insufficient resources (CPU/memory), taints/tolerations issues, or node selector mismatches.
kubectl describe pod -n opentelemetry-collector <collector-pod-name>Adjust resource requests/limits in the DaemonSet/Deployment or fix node affinity/tolerations.
-
Collector Pods are restarting frequently (
CrashLoopBackOff).Issue: The Collector process is failing shortly after starting.
Solution: The most common cause is an invalid configuration in the ConfigMap. Check the Collector’s logs for error messages related to parsing the configuration.
kubectl logs -n opentelemetry-collector <collector-pod-name>Look for lines like “Error loading configuration” or “invalid configuration”. Correct the
collector.yamlin the ConfigMap and reapply. - No telemetry data is appearing