OpenTelemetry Collector: Unified Observability Pipeline
In the complex world of cloud-native applications, gaining comprehensive visibility into your systems is not just a luxury, but a necessity. Microservices, distributed architectures, and ephemeral containers introduce a dizzying array of components that generate vast amounts of telemetry data—logs, metrics, and traces. Correlating this data across disparate services and infrastructure components is a monumental challenge for even the most seasoned SREs and developers. Without a unified approach, debugging becomes a nightmare, performance bottlenecks remain hidden, and proactive issue detection is nearly impossible.
Enter the OpenTelemetry Collector, a powerful, vendor-agnostic agent that acts as the central nervous system for your observability data. It’s designed to process, transform, and export telemetry data from various sources to multiple destinations, all before it ever leaves your network. By standardizing data collection and processing, the OpenTelemetry Collector simplifies your observability stack, reduces operational overhead, and ensures that your valuable telemetry reaches the right backend in the right format. This guide will walk you through deploying and configuring the OpenTelemetry Collector on Kubernetes, transforming your fragmented observability landscape into a streamlined, high-performance pipeline.
TL;DR: Unified Observability with OpenTelemetry Collector
The OpenTelemetry Collector is your central hub for collecting, processing, and exporting logs, metrics, and traces in Kubernetes. Deploy it as a DaemonSet for node-level collection or a Deployment for application-level data. This guide covers its architecture, configuration, and deployment on Kubernetes.
Key Commands:
- Install Helm Repo:
helm repo add open-telemetry https://open-telemetry.github.io/opentelemetry-helm-charts - Update Helm Repo:
helm repo update - Install Collector (DaemonSet example):
helm install otel-collector open-telemetry/opentelemetry-collector -f values-daemonset.yaml - Apply Collector ConfigMap:
kubectl apply -f collector-config.yaml - View Collector Logs:
kubectl logs -l app.kubernetes.io/name=opentelemetry-collector - Cleanup:
helm uninstall otel-collector
Prerequisites
Before diving into the deployment of the OpenTelemetry Collector, ensure you have the following:
- Kubernetes Cluster: A running Kubernetes cluster (v1.20+ recommended). You can use Minikube, Kind, or a managed service like EKS, GKE, or AKS.
- kubectl: The Kubernetes command-line tool, configured to connect to your cluster. Refer to the official Kubernetes documentation for installation instructions.
- Helm: The Kubernetes package manager, version 3.x or higher. Instructions available on the Helm website.
- Basic Kubernetes Knowledge: Familiarity with Deployments, DaemonSets, ConfigMaps, and Services.
- Basic Observability Concepts: Understanding of metrics, traces, and logs.
Step-by-Step Guide: Deploying OpenTelemetry Collector on Kubernetes
We’ll deploy the OpenTelemetry Collector using its official Helm chart, which provides a flexible way to manage its configuration and deployment strategy.
1. Add the OpenTelemetry Helm Repository
First, add the official OpenTelemetry Helm chart repository to your Helm configuration. This allows you to easily install and manage the Collector.
helm repo add open-telemetry https://open-telemetry.github.io/opentelemetry-helm-charts
helm repo update
Verify: You should see a confirmation that the repository has been added and updated.
"open-telemetry" has been added to your repositories
Hang tight while we grab the latest from your repos...
...Successfully got an update from the "open-telemetry" chart repository
Update Complete. ⎈Happy Helming!⎈
2. Understand OpenTelemetry Collector Deployment Modes
The OpenTelemetry Collector can be deployed in several modes, each suited for different use cases:
- Agent (DaemonSet): Deployed as a Kubernetes DaemonSet on each node. This is ideal for collecting host-level metrics, logs from node agents (like Fluent Bit), and telemetry from applications running on that node, especially when you want to avoid network hops for initial collection. This model is often used for infrastructure-level observability.
- Gateway/Collector (Deployment): Deployed as a Kubernetes Deployment. This acts as a central aggregation point for telemetry data from multiple agents or directly from applications. It’s excellent for advanced processing, filtering, batching, and routing data to various backends. This provides a centralized point for managing your observability pipeline, similar to how an API Gateway centralizes traffic management.
For this guide, we’ll primarily focus on the Agent (DaemonSet) model for node-level collection, and then touch upon the Gateway (Deployment) for aggregation.
3. Configure the OpenTelemetry Collector
The core of the OpenTelemetry Collector is its configuration, typically provided via a ConfigMap. This configuration defines the Collector’s pipelines: receivers, processors, and exporters.
- Receivers: How data gets into the Collector (e.g., OTLP, Prometheus, Jaeger, Fluent Bit).
- Processors: How data is transformed, filtered, or enriched within the Collector (e.g., batching, attribute modification, resource detection).
- Exporters: Where data is sent from the Collector (e.g., OTLP, Prometheus, Jaeger, Loki, Datadog, New Relic).
Let’s create a basic configuration that receives OTLP (OpenTelemetry Protocol) traces, metrics, and logs, and exports them to a local console for demonstration purposes. In a real-world scenario, you’d export to a robust backend like Prometheus, Grafana Loki, or a commercial APM solution.
# collector-config.yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: otel-collector-config
labels:
app.kubernetes.io/name: opentelemetry-collector
data:
collector.yaml: |
# Receivers: How the Collector gets data
receivers:
otlp:
protocols:
grpc:
http:
# Example: Prometheus receiver for scraping metrics
# prometheus:
# config:
# scrape_configs:
# - job_name: "otel-collector"
# scrape_interval: 10s
# static_configs:
# - targets: ["0.0.0.0:8888"] # Collector's own metrics endpoint
# Processors: How the Collector processes data
processors:
batch:
send_batch_size: 100
timeout: 10s
# Example: Resource detection to add Kubernetes metadata
resourcedetection:
detectors: ["system", "env"]
system:
resource_attributes:
os.type:
enabled: true
host.arch:
enabled: true
host.name:
enabled: true
container.id:
enabled: true
container.name:
enabled: true
k8s.pod.name:
enabled: true
k8s.node.name:
enabled: true
# Exporters: Where the Collector sends data
exporters:
# Console exporter for demonstration (prints to stdout)
debug:
verbosity: detailed
# Example: OTLP exporter to another Collector or backend
# otlp:
# endpoint: "otel-collector-gateway:4317" # Target another collector or backend
# tls:
# insecure: true
# Example: Prometheus Remote Write exporter
# prometheusremotewrite:
# endpoint: "http://prometheus.kube-prometheus-stack.svc.cluster.local:9090/api/v1/write"
# Example: Loki exporter for logs
# loki:
# endpoint: http://loki.loki.svc.cluster.local:3100/loki/api/v1/push
# Service: Defines the data pipelines
service:
telemetry:
logs:
level: "info"
pipelines:
traces:
receivers: [otlp]
processors: [batch, resourcedetection]
exporters: [debug] # Change to otlp or another backend
metrics:
receivers: [otlp]
processors: [batch, resourcedetection]
exporters: [debug] # Change to prometheusremotewrite or another backend
logs:
receivers: [otlp]
processors: [batch, resourcedetection]
exporters: [debug] # Change to loki or another backend
Apply this ConfigMap to your cluster:
kubectl apply -f collector-config.yaml
Verify:
configmap/otel-collector-config created
4. Deploy the OpenTelemetry Collector as a DaemonSet
Now, let’s deploy the Collector using the Helm chart. We’ll use a values-daemonset.yaml file to customize the deployment, ensuring it runs as a DaemonSet and uses our custom configuration.
# values-daemonset.yaml
mode: daemonset # Deploy as a DaemonSet
config:
## Refer to ./internal/config.yaml for default config.
## Any changes here will override the default config.
## Example:
# receivers:
# jaeger:
# protocols:
# grpc:
# thrift_compact:
# thrift_http:
# thrift_binary:
# zipkin:
# prometheus:
# config:
# scrape_configs:
# - job_name: 'otel-collector'
# scrape_interval: 10s
# static_configs:
# - targets: ['0.0.0.0:8888']
# processors:
# batch:
# send_batch_size: 1000
# timeout: 10s
# exporters:
# logging:
# verbosity: detailed
# service:
# telemetry:
# logs:
# level: "info"
# pipelines:
# traces:
# receivers: [jaeger, zipkin, otlp]
# processors: [batch]
# exporters: [logging]
# metrics:
# receivers: [prometheus, otlp]
# processors: [batch]
# exporters: [logging]
# Use an external ConfigMap for the collector configuration
# This references the ConfigMap we created in the previous step
existingConfigMap: otel-collector-config
# Define ports for receivers. OTLP (gRPC and HTTP) are standard.
ports:
- name: otlp-grpc
containerPort: 4317
protocol: TCP
- name: otlp-http
containerPort: 4318
protocol: TCP
# - name: prometheus
# containerPort: 8888
# protocol: TCP
# Expose the OTLP ports as a Service
service:
type: ClusterIP
ports:
- name: otlp-grpc
port: 4317
targetPort: 4317
protocol: TCP
- name: otlp-http
port: 4318
targetPort: 4318
protocol: TCP
# Resource limits (adjust for your environment)
resources:
requests:
cpu: 100m
memory: 128Mi
limits:
cpu: 500m
memory: 512Mi
# Assign tolerations if you have tainted nodes
# tolerations:
# - effect: NoSchedule
# key: dedicated
# operator: Exists
# value: infra-node
# Node affinity for specific nodes (optional)
# affinity:
# nodeAffinity:
# requiredDuringSchedulingIgnoredDuringExecution:
# nodeSelectorTerms:
# - matchExpressions:
# - key: kubernetes.io/os
# operator: In
# values:
# - linux
Install the Collector using Helm:
helm install otel-collector open-telemetry/opentelemetry-collector -f values-daemonset.yaml
Verify: Check the deployed pods and services.
kubectl get pods -l app.kubernetes.io/name=opentelemetry-collector
kubectl get svc -l app.kubernetes.io/name=opentelemetry-collector
You should see one collector pod running on each node, and a ClusterIP service exposing the OTLP ports.
NAME READY STATUS RESTARTS AGE
otel-collector-opentelemetry-collector-xxxx 1/1 Running 0 2m
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
otel-collector-opentelemetry-collector ClusterIP 10.xx.xx.xx <none> 4317/TCP,4318/TCP 2m
5. Deploy the OpenTelemetry Collector as a Gateway (Optional)
For more advanced scenarios, you might want a central gateway Collector. This would typically be a Deployment that receives data from the DaemonSet agents and then forwards it to your final backend(s).
# values-deployment.yaml
mode: deployment # Deploy as a Deployment
config:
existingConfigMap: otel-collector-config # Re-use the same config for simplicity, or create a new one
# Define ports for receivers.
ports:
- name: otlp-grpc
containerPort: 4317
protocol: TCP
- name: otlp-http
containerPort: 4318
protocol: TCP
# Expose the OTLP ports as a Service
service:
type: ClusterIP
ports:
- name: otlp-grpc
port: 4317
targetPort: 4317
protocol: TCP
- name: otlp-http
port: 4318
targetPort: 4318
protocol: TCP
# Replicas for high availability
replicas: 2
# Resource limits
resources:
requests:
cpu: 200m
memory: 256Mi
limits:
cpu: 1000m
memory: 1Gi
Install the Gateway Collector:
helm install otel-collector-gateway open-telemetry/opentelemetry-collector -f values-deployment.yaml
Verify:
kubectl get pods -l app.kubernetes.io/instance=otel-collector-gateway
kubectl get svc -l app.kubernetes.io/instance=otel-collector-gateway
You should see the specified number of gateway collector pods and its associated service.
6. Instrument an Application to Send Data to the Collector
Now that the Collector is running, you need an application to send it telemetry data. This usually involves instrumenting your application with OpenTelemetry SDKs.
Here’s a simple example of a Python application that sends a trace to the Collector. For more detailed instrumentation, refer to the OpenTelemetry documentation for various languages.
# app.py
from opentelemetry import trace
from opentelemetry.sdk.resources import Resource
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
from opentelemetry.sdk.metrics import MeterProvider
from opentelemetry.sdk.metrics.export import PeriodicExportingMetricReader
from opentelemetry.exporter.otlp.proto.grpc.metric_exporter import OTLPMetricExporter
from opentelemetry.sdk.logs import LoggerProvider, LoggingHandler
from opentelemetry.sdk.logs.export import BatchLogRecordProcessor
from opentelemetry.exporter.otlp.proto.grpc.log_exporter import OTLPLogExporter
import logging
import time
# Configure resource for traces, metrics, and logs
resource = Resource.create({
"service.name": "my-python-app",
"service.version": "1.0.0",
"environment": "development"
})
# --- Tracing Setup ---
trace_provider = TracerProvider(resource=resource)
trace.set_tracer_provider(trace_provider)
span_exporter = OTLPSpanExporter(endpoint="otel-collector-opentelemetry-collector:4317", insecure=True)
span_processor = BatchSpanProcessor(span_exporter)
trace_provider.add_span_processor(span_processor)
tracer = trace.get_tracer(__name__)
# --- Metrics Setup ---
metric_provider = MeterProvider(resource=resource)
metric_exporter = OTLPMetricExporter(endpoint="otel-collector-opentelemetry-collector:4317", insecure=True)
metric_reader = PeriodicExportingMetricReader(metric_exporter, export_interval_millis=5000)
metric_provider.add_metric_reader(metric_reader)
meter = metric_provider.get_meter(__name__)
counter = meter.create_counter(
"my_counter",
description="A simple counter",
unit="1",
)
# --- Logging Setup ---
logger_provider = LoggerProvider(resource=resource)
log_exporter = OTLPLogExporter(endpoint="otel-collector-opentelemetry-collector:4317", insecure=True)
log_processor = BatchLogRecordProcessor(log_exporter)
logger_provider.add_log_record_processor(log_processor)
handler = LoggingHandler(level=logging.INFO, logger_provider=logger_provider)
logging.getLogger().addHandler(handler)
logger = logging.getLogger(__name__)
def do_work():
with tracer.start_as_current_span("do_work_span"):
logger.info("Doing some work...")
counter.add(1, {"item": "widget"})
time.sleep(0.1)
with tracer.start_as_current_span("sub_work_span"):
logger.debug("Doing sub-work...")
time.sleep(0.05)
if __name__ == "__main__":
print("Sending telemetry to OpenTelemetry Collector...")
for i in range(5):
do_work()
time.sleep(1)
print("Telemetry sent. Check collector logs.")
# Give some time for batch exporters to flush
time.sleep(2)
Create a Dockerfile for this application:
# Dockerfile
FROM python:3.9-slim-buster
WORKDIR /app
COPY requirements.txt .
RUN pip install -r requirements.txt
COPY app.py .
CMD ["python", "app.py"]
And requirements.txt:
opentelemetry-api
opentelemetry-sdk
opentelemetry-exporter-otlp-proto-grpc
opentelemetry-instrumentation
Build and push the image (replace your-docker-repo with your actual repo):
docker build -t your-docker-repo/my-python-app:latest .
docker push your-docker-repo/my-python-app:latest
Deploy the application to Kubernetes:
# app-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: my-python-app
spec:
replicas: 1
selector:
matchLabels:
app: my-python-app
template:
metadata:
labels:
app: my-python-app
spec:
containers:
- name: my-python-app
image: your-docker-repo/my-python-app:latest
env:
- name: OTEL_EXPORTER_OTLP_ENDPOINT
value: "otel-collector-opentelemetry-collector:4317" # Point to the Collector Service
- name: OTEL_RESOURCE_ATTRIBUTES
value: "service.name=my-python-app,service.version=1.0.0,environment=development"
resources:
requests:
cpu: 50m
memory: 64Mi
limits:
cpu: 100m
memory: 128Mi
Apply the deployment:
kubectl apply -f app-deployment.yaml
Verify: Check the logs of the OpenTelemetry Collector pods. You should see the traces, metrics, and logs from your Python application being printed to stdout by the debug exporter.
kubectl logs -l app.kubernetes.io/name=opentelemetry-collector --tail 100
You’ll see output similar to this (truncated for brevity), confirming the Collector is receiving and processing data:
...
2023-10-27T10:30:05.123Z INFO TracesExporter {"kind": "exporter", "name": "debug", "data_type": "traces", "#spans": 1}
2023-10-27T10:30:05.123Z INFO TracesExporter {"kind": "exporter", "name": "debug", "data_type": "traces", "span_id": "...", "parent_span_id": "...", "trace_id": "...", "name": "do_work_span", "kind": 1, "start_time": "...", "end_time": "...", "status": {"code": 0}, "attributes": {"service.name": "my-python-app", "service.version": "1.0.0", "environment": "development"}, "resource": {"service.name": "my-python-app", "service.version": "1.0.0", "environment": "development"}}
...
2023-10-27T10:30:06.543Z INFO MetricsExporter {"kind": "exporter", "name": "debug", "data_type": "metrics", "#metrics": 1}
2023-10-27T10:30:06.543Z INFO MetricsExporter {"kind": "exporter", "name": "debug", "data_type": "metrics", "metric_name": "my_counter", "metric_type": "Sum", "attributes": {"item": "widget"}, "value": 1.0, "resource": {"service.name": "my-python-app", "service.version": "1.0.0", "environment": "development"}}
...
2023-10-27T10:30:07.890Z INFO LogsExporter {"kind": "exporter", "name": "debug", "data_type": "logs", "#logs": 1}
2023-10-27T10:30:07.890Z INFO LogsExporter {"kind": "exporter", "name": "debug", "data_type": "logs", "timestamp": "...", "severity": "INFO", "body": "Doing some work...", "attributes": {"service.name": "my-python-app", "service.version": "1.0.0", "environment": "development"}, "resource": {"service.name": "my-python-app", "service.version": "1.0.0", "environment": "development"}}
...
Production Considerations
Deploying the OpenTelemetry Collector in production requires careful planning to ensure reliability, scalability, and security.
- High Availability: For Gateway Collectors, deploy multiple replicas (
replicas: Nin Helm values) to ensure no single point of failure. Use a StatefulSet if you need persistent storage for certain processors (e.g., those that buffer data to disk). - Resource Management: Set appropriate CPU and memory requests/limits for Collector pods. Over-provisioning wastes resources, while under-provisioning can lead to OOMKills or throttled performance. Monitor Collector resource usage closely. For cost optimization, consider tools like Karpenter to efficiently manage underlying node resources.
- Configuration Management: Use GitOps practices to manage your Collector configurations. Store your
collector-config.yamland Helmvalues.yamlin version control. - Security:
- Network Policies: Restrict network access to Collector ports using Kubernetes Network Policies. Only allow instrumented applications and other Collectors to send data.
- TLS: Always enable TLS for OTLP communication, especially between Collectors or when sending data outside your cluster. The
insecure: trueflag used in the example is for demonstration only. - Authentication: Implement authentication for exporters (e.g., API keys, OAuth2) when sending data to commercial backends.
- Least Privilege: Run Collector pods with minimal necessary permissions.
- Storage for Buffering: Some processors (e.g.,
queued_retry) or exporters might benefit from disk-based buffering to prevent data loss during transient network issues or backend unavailability. This often requires a PersistentVolumeClaim. - Scalability:
- Horizontal Scaling: Scale the number of Gateway Collector replicas based on the volume of telemetry data.
- Vertical Scaling: Increase CPU/memory for individual Collector pods if they become bottlenecks.
- Sharding: For extremely high volumes, consider sharding your Collector deployment, where different Collectors handle different types of data or data from specific services.
- Monitoring the Collector Itself: Export the Collector’s own internal metrics (e.g., receiver throughput, exporter errors) to Prometheus or your monitoring system. The Collector exposes its own metrics on port 8888 by default.
- Backend Connectivity: Ensure Collectors have proper network connectivity to your chosen observability backends (Prometheus, Grafana Loki, Jaeger, commercial APM, etc.). This might involve configuring firewalls, VPC peering, or Cilium WireGuard Encryption for secure connections.
- Advanced Processors: Leverage advanced processors like
k8sattributesto automatically enrich telemetry with Kubernetes metadata,attributesto rename/add/remove attributes, orfilterto drop unwanted data. This reduces data volume and improves query performance in your backend.
Troubleshooting
Here are common issues you might encounter with the OpenTelemetry Collector and their solutions.
-
Collector Pods Not Running/Crashing
Issue: Collector pods are in
Pending,CrashLoopBackOff, orErrorstate.Solution:
- Check Events:
kubectl describe pod <collector-pod-name>Look for reasons like insufficient resources (OOMKilled), image pull errors, or volume mounting issues.
- Check Logs:
kubectl logs <collector-pod-name>Configuration errors in
collector.yamlare a common cause. The Collector will log detailed parsing errors. - Resource Limits: If OOMKilled, increase memory limits in your
values.yaml.
- Check Events:
-
No Telemetry Data Reaching the Collector
Issue: Application is sending data, but Collector logs don’t show any received telemetry (e.g., no debug output).
Solution:
- Application Configuration: Double-check that your application’s OpenTelemetry SDK is configured to point to the correct Collector service endpoint (e.g.,
otel-collector-opentelemetry-collector:4317). - Service Reachability: From inside your application pod, verify connectivity to the Collector service:
kubectl exec -it <app-pod-name> -- curl -v telnet otel-collector-opentelemetry-collector 4317 - Collector Receiver Configuration: Ensure the Collector’s
receiverssection is correctly configured for the protocol your application is using (e.g.,otlp:withgrpc:andhttp:). - Network Policies: Verify no Network Policies are blocking traffic between your application and the Collector.
- Application Configuration: Double-check that your application’s OpenTelemetry SDK is configured to point to the correct Collector service endpoint (e.g.,
-
Telemetry Data Not Reaching the Backend
Issue: Collector receives data, but it’s not appearing in your monitoring backend (e.g., Prometheus, Grafana, Jaeger).
Solution:
- Collector Exporter Configuration: Review the
exporterssection in yourcollector.yaml. Ensure the endpoint, credentials, and protocol are correct for your backend. - Collector Logs: Look for errors in the Collector logs related to exporters (e.g., “connection refused”, “unauthorized”, “TLS handshake error”).
- Backend Reachability: From the Collector pod, try to reach the backend endpoint (e.g., using
curlortelnet). - Backend Status: Check the status and logs of your observability backend. It might be down, misconfigured, or rejecting data.
- Collector Exporter Configuration: Review the
-
High Resource Consumption by Collector Pods
Issue: Collector pods are consuming excessive CPU or memory.
Solution:
- Profile the Collector: Enable the Collector’s own metrics (exposed on port 8888 by default) and scrape them with Prometheus to identify bottlenecks (e.g., high CPU on certain processors).
- Batch Processor: Adjust
send_batch_sizeandtimeoutin thebatchprocessor. Larger batches reduce CPU overhead but increase latency. - Filtering: Use
filterprocessors to drop unneeded telemetry data early in the pipeline, reducing processing load and export bandwidth. - Reduce Verbosity: Lower logging verbosity (
service.telemetry.logs.level) to reduce I/O and processing. - Scale Out: For Gateway Collectors, increase the number of replicas.
-
Data Loss / Backpressure
Issue: Telemetry data is intermittently missing or delayed, especially during spikes.
Solution:
- Batch Processor: Ensure you have a
batchprocessor configured for all pipelines. This helps absorb spikes. - Queued Retry Exporter:
- Batch Processor: Ensure you have a