OpenTelemetry Collector: Unified Observability Pipeline
In the complex world of cloud-native applications, gaining comprehensive visibility into your systems is not just a luxury—it’s a necessity. Microservices, distributed systems, and ephemeral infrastructure make traditional monitoring approaches insufficient. This is where OpenTelemetry steps in, providing a vendor-agnostic set of APIs, SDKs, and tools to instrument, generate, collect, and export telemetry data (metrics, logs, and traces).
At the heart of the OpenTelemetry ecosystem lies the OpenTelemetry Collector. This powerful, flexible, and vendor-agnostic proxy can receive, process, and export telemetry data in various formats to multiple backends. Whether you’re dealing with Prometheus, Jaeger, Loki, or proprietary monitoring solutions, the Collector acts as a central hub, streamlining your observability pipeline and reducing the overhead on your application services. It’s the critical component for building a robust, scalable, and future-proof observability strategy in Kubernetes environments.
TL;DR: OpenTelemetry Collector in Kubernetes
The OpenTelemetry Collector is your central hub for all telemetry data (metrics, logs, traces) in Kubernetes. It receives data from applications, processes it (filtering, transforming, enriching), and exports it to various observability backends. Deploy it as a DaemonSet for host-level logs/metrics or a Deployment for application-level data. This guide shows you how to set up the Collector to scrape Prometheus metrics, receive OTLP traces, and forward them to a mock backend.
Key Commands:
- Deploy Collector:
kubectl apply -f opentelemetry-collector.yaml - Deploy Sample App:
kubectl apply -f sample-app.yaml - View Collector Logs:
kubectl logs -f <collector-pod-name> - Port-forward to Collector:
kubectl port-forward svc/otel-collector 8888:8888
Prerequisites
Before diving in, ensure you have the following:
- A running Kubernetes cluster (v1.20+ recommended). You can use Kind, Minikube, or a cloud provider’s managed Kubernetes service like AWS EKS, GKE, or Azure AKS.
kubectlinstalled and configured to interact with your cluster.- Basic understanding of Kubernetes concepts like Deployments, Services, and ConfigMaps.
- Familiarity with observability concepts (metrics, logs, traces).
Step-by-Step Guide: Setting Up the OpenTelemetry Collector in Kubernetes
We’ll walk through deploying an OpenTelemetry Collector to collect metrics and traces from a sample application and forward them to a simple logging exporter. This setup demonstrates the core functionality of the Collector.
1. Understand the OpenTelemetry Collector Configuration
The Collector’s behavior is defined by a YAML configuration file. This file specifies receivers (how data enters the Collector), processors (how data is transformed), exporters (where data is sent), and service (which pipelines are enabled). For a deep dive into configuration, refer to the official OpenTelemetry Collector documentation.
Let’s create a ConfigMap for our Collector configuration. This configuration will enable an OTLP receiver for traces and a Prometheus receiver for metrics. It will then use a batch processor to efficiently send data to a logging exporter, which simply prints telemetry to the Collector’s standard output – perfect for demonstration.
Create a file named otel-collector-config.yaml:
apiVersion: v1
kind: ConfigMap
metadata:
name: otel-collector-config
namespace: default
data:
otel-collector-config.yaml: |
receivers:
otlp:
protocols:
grpc:
http:
prometheus:
config:
scrape_configs:
- job_name: 'kubernetes-pods'
kubernetes_sd_configs:
- role: pod
relabel_configs:
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
action: keep
regex: true
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
action: replace
target_label: __metrics_path__
regex: (.+)
- source_labels: [__address__, __meta_kubernetes_pod_annotation_prometheus_io_port]
action: replace
regex: ([^:]+)(?::\d+)?;(\d+)
replacement: $1:$2
target_label: __address__
- action: labelmap
regex: __meta_kubernetes_pod_label_(.+)
- source_labels: [__meta_kubernetes_namespace]
action: replace
target_label: kubernetes_namespace
- source_labels: [__meta_kubernetes_pod_name]
action: replace
target_label: kubernetes_pod_name
processors:
batch:
send_batch_size: 100
timeout: 10s
exporters:
logging:
loglevel: debug # Useful for debugging, shows collected data in logs
# You can add other exporters here, e.g., for Prometheus, Jaeger, Loki, etc.
# prometheusremotewrite:
# endpoint: "http://prometheus-server:9090/api/v1/write"
# jaeger:
# endpoint: "jaeger-collector:14250"
# tls:
# insecure: true
service:
pipelines:
traces:
receivers: [otlp]
processors: [batch]
exporters: [logging]
metrics:
receivers: [prometheus]
processors: [batch]
exporters: [logging]
logs:
# If you want to collect logs, you'd add a logs pipeline
# receivers: [filelog] # e.g., using filelog receiver for host logs
# processors: [batch]
# exporters: [logging]
Apply the ConfigMap:
kubectl apply -f otel-collector-config.yaml
Verify:
kubectl get configmap otel-collector-config -o yaml
You should see the YAML configuration embedded in the ConfigMap’s data section.
2. Deploy the OpenTelemetry Collector
The OpenTelemetry Collector can be deployed in various modes: as a DaemonSet for host-level collection (e.g., node metrics, host logs), as a Deployment for application-level collection (e.g., receiving OTLP from instrumented apps), or as a sidecar. For this guide, we’ll use a Deployment to act as a central collector for our sample application, and a Service to expose its OTLP gRPC and HTTP endpoints.
Create a file named otel-collector.yaml:
apiVersion: apps/v1
kind: Deployment
metadata:
name: otel-collector
namespace: default
labels:
app: otel-collector
spec:
replicas: 1
selector:
matchLabels:
app: otel-collector
template:
metadata:
labels:
app: otel-collector
spec:
containers:
- name: otel-collector
image: otel/opentelemetry-collector-contrib:0.96.0 # Use a stable version
command:
- "/otelcol-contrib"
- "--config=/conf/otel-collector-config.yaml"
volumeMounts:
- name: otel-collector-config-vol
mountPath: /conf
ports:
- name: otlp-grpc
containerPort: 4317 # Default OTLP gRPC port
- name: otlp-http
containerPort: 4318 # Default OTLP HTTP port
- name: prometheus
containerPort: 8888 # Prometheus receiver default port
- name: health
containerPort: 13133 # Health check extension
- name: pprof
containerPort: 1777 # pprof extension (profiling)
- name: zpages
containerPort: 55679 # zPages extension (diagnostics)
livenessProbe:
httpGet:
path: /health
port: health
initialDelaySeconds: 5
periodSeconds: 10
readinessProbe:
httpGet:
path: /health
port: health
initialDelaySeconds: 5
periodSeconds: 10
volumes:
- name: otel-collector-config-vol
configMap:
name: otel-collector-config
---
apiVersion: v1
kind: Service
metadata:
name: otel-collector
namespace: default
labels:
app: otel-collector
spec:
selector:
app: otel-collector
ports:
- name: otlp-grpc
protocol: TCP
port: 4317
targetPort: 4317
- name: otlp-http
protocol: TCP
port: 4318
targetPort: 4318
- name: prometheus
protocol: TCP
port: 8888
targetPort: 8888
- name: health
protocol: TCP
port: 13133
targetPort: 13133
Apply the Deployment and Service:
kubectl apply -f otel-collector.yaml
Verify:
kubectl get pods -l app=otel-collector
kubectl get svc otel-collector
You should see the Collector pod running and the Service exposing its ports.
3. Deploy a Sample Application with OpenTelemetry Instrumentation
Now, let’s deploy a simple application that generates Prometheus metrics and sends OTLP traces. We’ll use a basic Python Flask application for this. The application will expose a /metrics endpoint for Prometheus scraping and will be configured to send traces to our OpenTelemetry Collector.
The Prometheus receiver in our Collector configuration uses Kubernetes service discovery to find pods annotated with prometheus.io/scrape: "true". Our sample app will include these annotations.
Create a file named sample-app.yaml:
apiVersion: apps/v1
kind: Deployment
metadata:
name: sample-otel-app
namespace: default
labels:
app: sample-otel-app
spec:
replicas: 1
selector:
matchLabels:
app: sample-otel-app
template:
metadata:
labels:
app: sample-otel-app
annotations:
prometheus.io/scrape: "true"
prometheus.io/path: "/metrics"
prometheus.io/port: "8000" # Port where Prometheus metrics are exposed
spec:
containers:
- name: flask-app
image: python:3.9-slim-buster
command: ["/bin/bash", "-c"]
args:
- |
pip install Flask prometheus_client opentelemetry-api opentelemetry-sdk opentelemetry-exporter-otlp opentelemetry-instrumentation-flask
cat <<EOF > app.py
from flask import Flask, request
from prometheus_client import generate_latest, Counter, Histogram
import time
from opentelemetry import trace
from opentelemetry.sdk.resources import Resource
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
from opentelemetry.instrumentation.flask import FlaskInstrumentor
app = Flask(__name__)
# OpenTelemetry Tracing setup
resource = Resource.create({
"service.name": "sample-otel-app",
"service.instance.id": "instance-1",
})
provider = TracerProvider(resource=resource)
processor = BatchSpanProcessor(OTLPSpanExporter(endpoint="otel-collector:4317", insecure=True))
provider.add_span_processor(processor)
trace.set_tracer_provider(provider)
FlaskInstrumentor().instrument_app(app)
tracer = trace.get_tracer(__name__)
# Prometheus Metrics setup
REQUEST_COUNT = Counter(
'app_request_count', 'Application Request Count',
['method', 'endpoint']
)
REQUEST_LATENCY = Histogram(
'app_request_latency_seconds', 'Request latency in seconds',
['method', 'endpoint']
)
@app.route('/')
def hello():
with tracer.start_as_current_span("hello-endpoint"):
REQUEST_COUNT.labels(method='GET', endpoint='/').inc()
with REQUEST_LATENCY.labels(method='GET', endpoint='/').time():
time.sleep(0.05) # Simulate some work
return 'Hello, OpenTelemetry!'
@app.route('/metrics')
def metrics():
return generate_latest(), 200, {'Content-Type': 'text/plain; version=0.0.4; charset=utf-8'}
if __name__ == '__main__':
app.run(host='0.0.0.0', port=8000)
EOF
python app.py
ports:
- containerPort: 8000
name: http
env:
- name: OTEL_RESOURCE_ATTRIBUTES
value: "service.name=sample-otel-app,service.version=1.0.0"
- name: OTEL_EXPORTER_OTLP_ENDPOINT
value: "http://otel-collector:4317" # Target the Collector's gRPC endpoint
- name: OTEL_EXPORTER_OTLP_PROTOCOL
value: "grpc"
---
apiVersion: v1
kind: Service
metadata:
name: sample-otel-app
namespace: default
labels:
app: sample-otel-app
spec:
selector:
app: sample-otel-app
ports:
- name: http
protocol: TCP
port: 80
targetPort: 8000
Apply the sample application:
kubectl apply -f sample-app.yaml
Verify:
kubectl get pods -l app=sample-otel-app
Wait for the pod to be in a Running state. It might take a moment for Python dependencies to install.
4. Generate Traffic and Observe Telemetry
Now that both the Collector and the sample application are running, let’s generate some traffic to the application to produce metrics and traces. We’ll then inspect the Collector’s logs to see the data being processed.
First, get the name of the sample application pod:
APP_POD=$(kubectl get pods -l app=sample-otel-app -o jsonpath='{.items[0].metadata.name}')
echo $APP_POD
Generate some requests to the application:
kubectl exec -it $APP_POD -- curl localhost:8000
kubectl exec -it $APP_POD -- curl localhost:8000
kubectl exec -it $APP_POD -- curl localhost:8000
Now, let’s get the name of the OpenTelemetry Collector pod:
COLLECTOR_POD=$(kubectl get pods -l app=otel-collector -o jsonpath='{.items[0].metadata.name}')
echo $COLLECTOR_POD
View the logs of the OpenTelemetry Collector. You should see it receiving and exporting both metrics and traces:
kubectl logs -f $COLLECTOR_POD
Expected Output (truncated):
...
2023-10-27T10:00:00.123Z INFO TracesExporter {"kind": "exporter", "name": "logging", "resource spans": 1}
2023-10-27T10:00:00.123Z DEBUG TracesExporter {"kind": "exporter", "name": "logging", "data": [{"resource": {"attributes": [{"key": "service.name", "value": {"stringValue": "sample-otel-app"}}, {"key": "service.instance.id", "value": {"stringValue": "instance-1"}}]}, "scope_spans": [{"scope": {"name": "app.py", "version": "0.1.0"}, "spans": [{"trace_id": "...", "span_id": "...", "parent_span_id": "...", "name": "hello-endpoint", "kind": "SPAN_KIND_INTERNAL", "start_time_unix_nano": "...", "end_time_unix_nano": "...", "attributes": [], "status": {"code": "STATUS_CODE_UNSET"}}]}, {"scope": {"name": "opentelemetry.instrumentation.flask", "version": "0.41b0"}, "spans": [{"trace_id": "...", "span_id": "...", "...", "name": "GET /", "kind": "SPAN_KIND_SERVER", "start_time_unix_nano": "...", "end_time_unix_nano": "...", "attributes": [{"key": "http.method", "value": {"stringValue": "GET"}}, {"key": "http.scheme", "value": {"stringValue": "http"}}, {"key": "http.host", "value": {"stringValue": "localhost:8000"}}, {"key": "http.target", "value": {"stringValue": "/"}}, {"key": "http.flavor", "value": {"stringValue": "1.1"}}, {"key": "net.host.ip", "value": {"stringValue": "127.0.0.1"}}, {"key": "net.host.port", "value": {"intValue": 8000}}, {"key": "http.status_code", "value": {"intValue": 200}}], "status": {"code": "STATUS_CODE_UNSET"}}]}]}]}
...
2023-10-27T10:00:05.456Z INFO MetricsExporter {"kind": "exporter", "name": "logging", "resource metrics": 1}
2023-10-27T10:00:05.456Z DEBUG MetricsExporter {"kind": "exporter", "name": "logging", "data": [{"resource": {"attributes": [{"key": "kubernetes_namespace", "value": {"stringValue": "default"}}, {"key": "kubernetes_pod_name", "value": {"stringValue": "sample-otel-app-..."}}]}, "scope_metrics": [{"scope": {"name": "prometheus"}, "metrics": [{"name": "app_request_count_total", "description": "Application Request Count", "sum": {"data_points": [{"attributes": [{"key": "method", "value": {"stringValue": "GET"}}, {"key": "endpoint", "value": {"stringValue": "/"}}], "start_time_unix_nano": "...", "time_unix_nano": "...", "as_int": "3"}]}}, {"name": "app_request_latency_seconds_bucket", "sum": {"data_points": [{"attributes": [{"key": "le", "value": {"stringValue": "0.075"}}, {"key": "method", "value": {"stringValue": "GET"}}, {"key": "endpoint", "value": {"stringValue": "/"}}], "start_time_unix_nano": "...", "time_unix_nano": "...", "as_int": "3"}]}}, {"name": "app_request_latency_seconds_count", "sum": {"data_points": [{"attributes": [{"key": "method", "value": {"stringValue": "GET"}}, {"key": "endpoint", "value": {"stringValue": "/"}}], "start_time_unix_nano": "...", "time_unix_nano": "...", "as_int": "3"}]}}, {"name": "app_request_latency_seconds_sum", "sum": {"data_points": [{"attributes": [{"key": "method", "value": {"stringValue": "GET"}}, {"key": "endpoint", "value": {"stringValue": "/"}}], "start_time_unix_nano": "...", "time_unix_nano": "...", "as_double": "0.150..."}]}}]}]}
...
You can clearly see log entries indicating the receipt and export of traces (TracesExporter) and metrics (MetricsExporter) by the Collector. This confirms your unified observability pipeline is working!
Production Considerations
While the logging exporter is great for demonstration, a production setup requires more robust solutions. Here are key considerations:
- Scalability: For high-volume environments, consider deploying multiple Collector instances behind a load balancer or using a DaemonSet for node-level collection. The OpenTelemetry Operator can help manage Collector deployments.
- High Availability: Run multiple replicas of the Collector Deployment. Use Karpenter or Kubernetes’ built-in autoscaling for dynamic scaling based on resource utilization.
- Resource Limits: Set appropriate CPU and memory requests/limits for Collector pods to prevent resource exhaustion and ensure stability.
- Storage: If using file-based receivers or processors (e.g., for disk-backed queues), ensure persistent storage is configured (e.g., PersistentVolumeClaims).
- Security:
- Network Policies: Restrict network access to Collector endpoints using Kubernetes Network Policies. Only allow instrumented applications and monitoring backends to communicate with the Collector.
- Authentication/Authorization: Configure authentication for OTLP endpoints if exposing them externally. Use TLS for all communication.
- Secrets Management: Store API keys or credentials for exporters securely using Kubernetes Secrets.
- For enhanced security, consider integration with tools like Sigstore and Kyverno for supply chain integrity and policy enforcement.
- Exporters: Replace the
loggingexporter with real backends:- Metrics: Prometheus Remote Write, OTLP (to another Collector or vendor solution), Datadog, Google Cloud Monitoring.
- Traces: Jaeger, Zipkin, Datadog, OTLP.
- Logs: Loki, Fluent Bit, OTLP.
- Processors: Leverage powerful processors for data manipulation:
batch: For efficient sending.memory_limiter: Prevents OOM errors.resourcedetection: Automatically adds resource attributes (e.g., K8s pod name, namespace).attributes,spanmetrics,transform: For advanced data enrichment and transformation.
- Networking: For cross-cluster or hybrid cloud setups, consider advanced networking solutions like Cilium WireGuard encryption for secure and efficient data transfer. If you’re using a service mesh like Istio Ambient Mesh, the Collector can integrate seamlessly.
- Observability of the Collector itself: Monitor the Collector’s health and performance using its own Prometheus metrics endpoint (port 8888 by default) and zPages (port 55679). You can also use eBPF Observability with Hubble to gain deeper insights into network interactions.
Troubleshooting
-
Collector Pod Not Running/Crashing:
Issue: The
otel-collectorpod is inCrashLoopBackOfforPendingstate.Solution:
- Check pod events:
kubectl describe pod <collector-pod-name>. Look for issues with image pull, volume mounts, or resource constraints. - Check collector logs:
kubectl logs <collector-pod-name>. Configuration errors are often printed here first. Typographical errors in theotel-collector-config.yamlare common culprits. - Ensure the ConfigMap
otel-collector-configexists and is correctly named.
- Check pod events:
-
No Metrics/Traces in Collector Logs:
Issue: The Collector pod is running, but its logs show no received metrics or traces.
Solution:
- For Traces (OTLP):
- Verify the application is correctly configured to send OTLP data to the Collector’s service:
otel-collector:4317(gRPC) orotel-collector:4318(HTTP). - Check if the application’s OpenTelemetry SDK is properly initialized and instrumented.
- Ensure network connectivity:
kubectl exec -it(or 4318).-- curl otel-collector:4317
- Verify the application is correctly configured to send OTLP data to the Collector’s service:
- For Metrics (Prometheus):
- Verify the sample app pod has the correct Prometheus annotations:
prometheus.io/scrape: "true",prometheus.io/path,prometheus.io/port. - Check if the Prometheus receiver in
otel-collector-config.yamlhas correctkubernetes_sd_configsandrelabel_configsthat match your pods. - Port-forward to the app and curl its metrics endpoint:
kubectl port-forwardthen8000:8000 curl localhost:8000/metricsto confirm the app is exposing metrics.
- Verify the sample app pod has the correct Prometheus annotations:
- For Traces (OTLP):
-
Collector Logs Show “Error exporting data”:
Issue: The Collector receives data but fails to send it to the configured exporter.
Solution:
- If using a real exporter (e.g., Jaeger, Prometheus Remote Write), check the endpoint URL in your
otel-collector-config.yaml. Is it correct and reachable? - Check network policies. Are there any Network Policies blocking egress traffic from the Collector to the backend?
- For external services, ensure DNS resolution is working.
- Check authentication/authorization for the target backend.
- If using a real exporter (e.g., Jaeger, Prometheus Remote Write), check the endpoint URL in your
-
High Resource Usage by Collector:
Issue: The Collector pod consumes excessive CPU or memory.
Solution:
- Increase
send_batch_sizeandtimeoutin thebatchprocessor to reduce export frequency. - Implement
memory_limiterprocessor to gracefully handle memory pressure. - Use tail_sampling for traces to reduce data volume.
- Consider sharding Collectors or deploying them as DaemonSets for specific types of data.
- Optimize processing logic; complex transformations can be CPU intensive.
- Increase
-
Metrics/Traces Missing Attributes:
Issue: Telemetry data arrives at the backend but lacks expected tags or attributes.
Solution:
- Verify the application’s OpenTelemetry SDK is correctly setting resource attributes and span attributes.
- Check Collector processors like
resourcedetection,attributes, ortransform. Ensure they are configured to add/modify attributes as expected and are part of the correct pipeline. - If using Prometheus receiver, review
relabel_configsin the Collector config to ensure labels are being correctly captured and mapped.
FAQ Section
-
What is the difference between OpenTelemetry Collector and OpenTelemetry SDKs?
OpenTelemetry SDKs are used within your application code to instrument it and generate telemetry data (metrics, traces, logs). The OpenTelemetry Collector is a standalone proxy that receives this data (and data from other sources), processes it, and exports it to various observability backends. SDKs are for data generation, the Collector is for data management and routing.
-
Should I deploy the Collector as a DaemonSet or a Deployment?
It depends on your use case:
- DaemonSet: Ideal for node-level collection (e.g., host metrics, system logs, Kubelet metrics). Each node gets a Collector instance, reducing network hops for local data.
- Deployment: Best for application-level collection (e.g., receiving OTLP from instrumented apps, scraping Prometheus endpoints). It acts as a central proxy, often with multiple replicas for high availability, and is typically exposed via a Kubernetes Service.
- Sidecar: For very specific per-application needs, a Collector can run as a sidecar container in the same pod as the application. This ensures data is processed locally before being sent out, but adds overhead per pod.
-
Can the OpenTelemetry Collector replace my existing Prometheus or Fluent Bit agents?
Potentially, yes. The Collector has receivers for