Orchestration

Build Your Unified Observability Pipeline with OTel Collector

OpenTelemetry Collector: Unified Observability Pipeline

In the complex world of cloud-native applications, gaining comprehensive visibility into your systems is not just a luxury—it’s a necessity. Microservices, distributed systems, and ephemeral infrastructure make traditional monitoring approaches insufficient. This is where OpenTelemetry steps in, providing a vendor-agnostic set of APIs, SDKs, and tools to instrument, generate, collect, and export telemetry data (metrics, logs, and traces).

At the heart of the OpenTelemetry ecosystem lies the OpenTelemetry Collector. This powerful, flexible, and vendor-agnostic proxy can receive, process, and export telemetry data in various formats to multiple backends. Whether you’re dealing with Prometheus, Jaeger, Loki, or proprietary monitoring solutions, the Collector acts as a central hub, streamlining your observability pipeline and reducing the overhead on your application services. It’s the critical component for building a robust, scalable, and future-proof observability strategy in Kubernetes environments.

TL;DR: OpenTelemetry Collector in Kubernetes

The OpenTelemetry Collector is your central hub for all telemetry data (metrics, logs, traces) in Kubernetes. It receives data from applications, processes it (filtering, transforming, enriching), and exports it to various observability backends. Deploy it as a DaemonSet for host-level logs/metrics or a Deployment for application-level data. This guide shows you how to set up the Collector to scrape Prometheus metrics, receive OTLP traces, and forward them to a mock backend.

Key Commands:

  • Deploy Collector: kubectl apply -f opentelemetry-collector.yaml
  • Deploy Sample App: kubectl apply -f sample-app.yaml
  • View Collector Logs: kubectl logs -f <collector-pod-name>
  • Port-forward to Collector: kubectl port-forward svc/otel-collector 8888:8888

Prerequisites

Before diving in, ensure you have the following:

  • A running Kubernetes cluster (v1.20+ recommended). You can use Kind, Minikube, or a cloud provider’s managed Kubernetes service like AWS EKS, GKE, or Azure AKS.
  • kubectl installed and configured to interact with your cluster.
  • Basic understanding of Kubernetes concepts like Deployments, Services, and ConfigMaps.
  • Familiarity with observability concepts (metrics, logs, traces).

Step-by-Step Guide: Setting Up the OpenTelemetry Collector in Kubernetes

We’ll walk through deploying an OpenTelemetry Collector to collect metrics and traces from a sample application and forward them to a simple logging exporter. This setup demonstrates the core functionality of the Collector.

1. Understand the OpenTelemetry Collector Configuration

The Collector’s behavior is defined by a YAML configuration file. This file specifies receivers (how data enters the Collector), processors (how data is transformed), exporters (where data is sent), and service (which pipelines are enabled). For a deep dive into configuration, refer to the official OpenTelemetry Collector documentation.

Let’s create a ConfigMap for our Collector configuration. This configuration will enable an OTLP receiver for traces and a Prometheus receiver for metrics. It will then use a batch processor to efficiently send data to a logging exporter, which simply prints telemetry to the Collector’s standard output – perfect for demonstration.

Create a file named otel-collector-config.yaml:

apiVersion: v1
kind: ConfigMap
metadata:
  name: otel-collector-config
  namespace: default
data:
  otel-collector-config.yaml: |
    receivers:
      otlp:
        protocols:
          grpc:
          http:
      prometheus:
        config:
          scrape_configs:
            - job_name: 'kubernetes-pods'
              kubernetes_sd_configs:
                - role: pod
              relabel_configs:
                - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
                  action: keep
                  regex: true
                - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
                  action: replace
                  target_label: __metrics_path__
                  regex: (.+)
                - source_labels: [__address__, __meta_kubernetes_pod_annotation_prometheus_io_port]
                  action: replace
                  regex: ([^:]+)(?::\d+)?;(\d+)
                  replacement: $1:$2
                  target_label: __address__
                - action: labelmap
                  regex: __meta_kubernetes_pod_label_(.+)
                - source_labels: [__meta_kubernetes_namespace]
                  action: replace
                  target_label: kubernetes_namespace
                - source_labels: [__meta_kubernetes_pod_name]
                  action: replace
                  target_label: kubernetes_pod_name

    processors:
      batch:
        send_batch_size: 100
        timeout: 10s

    exporters:
      logging:
        loglevel: debug # Useful for debugging, shows collected data in logs
      
      # You can add other exporters here, e.g., for Prometheus, Jaeger, Loki, etc.
      # prometheusremotewrite:
      #   endpoint: "http://prometheus-server:9090/api/v1/write"
      # jaeger:
      #   endpoint: "jaeger-collector:14250"
      #   tls:
      #     insecure: true

    service:
      pipelines:
        traces:
          receivers: [otlp]
          processors: [batch]
          exporters: [logging]
        metrics:
          receivers: [prometheus]
          processors: [batch]
          exporters: [logging]
        logs:
          # If you want to collect logs, you'd add a logs pipeline
          # receivers: [filelog] # e.g., using filelog receiver for host logs
          # processors: [batch]
          # exporters: [logging]

Apply the ConfigMap:

kubectl apply -f otel-collector-config.yaml

Verify:

kubectl get configmap otel-collector-config -o yaml

You should see the YAML configuration embedded in the ConfigMap’s data section.

2. Deploy the OpenTelemetry Collector

The OpenTelemetry Collector can be deployed in various modes: as a DaemonSet for host-level collection (e.g., node metrics, host logs), as a Deployment for application-level collection (e.g., receiving OTLP from instrumented apps), or as a sidecar. For this guide, we’ll use a Deployment to act as a central collector for our sample application, and a Service to expose its OTLP gRPC and HTTP endpoints.

Create a file named otel-collector.yaml:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: otel-collector
  namespace: default
  labels:
    app: otel-collector
spec:
  replicas: 1
  selector:
    matchLabels:
      app: otel-collector
  template:
    metadata:
      labels:
        app: otel-collector
    spec:
      containers:
        - name: otel-collector
          image: otel/opentelemetry-collector-contrib:0.96.0 # Use a stable version
          command:
            - "/otelcol-contrib"
            - "--config=/conf/otel-collector-config.yaml"
          volumeMounts:
            - name: otel-collector-config-vol
              mountPath: /conf
          ports:
            - name: otlp-grpc
              containerPort: 4317 # Default OTLP gRPC port
            - name: otlp-http
              containerPort: 4318 # Default OTLP HTTP port
            - name: prometheus
              containerPort: 8888 # Prometheus receiver default port
            - name: health
              containerPort: 13133 # Health check extension
            - name: pprof
              containerPort: 1777 # pprof extension (profiling)
            - name: zpages
              containerPort: 55679 # zPages extension (diagnostics)
          livenessProbe:
            httpGet:
              path: /health
              port: health
            initialDelaySeconds: 5
            periodSeconds: 10
          readinessProbe:
            httpGet:
              path: /health
              port: health
            initialDelaySeconds: 5
            periodSeconds: 10
      volumes:
        - name: otel-collector-config-vol
          configMap:
            name: otel-collector-config
---
apiVersion: v1
kind: Service
metadata:
  name: otel-collector
  namespace: default
  labels:
    app: otel-collector
spec:
  selector:
    app: otel-collector
  ports:
    - name: otlp-grpc
      protocol: TCP
      port: 4317
      targetPort: 4317
    - name: otlp-http
      protocol: TCP
      port: 4318
      targetPort: 4318
    - name: prometheus
      protocol: TCP
      port: 8888
      targetPort: 8888
    - name: health
      protocol: TCP
      port: 13133
      targetPort: 13133

Apply the Deployment and Service:

kubectl apply -f otel-collector.yaml

Verify:

kubectl get pods -l app=otel-collector
kubectl get svc otel-collector

You should see the Collector pod running and the Service exposing its ports.

3. Deploy a Sample Application with OpenTelemetry Instrumentation

Now, let’s deploy a simple application that generates Prometheus metrics and sends OTLP traces. We’ll use a basic Python Flask application for this. The application will expose a /metrics endpoint for Prometheus scraping and will be configured to send traces to our OpenTelemetry Collector.

The Prometheus receiver in our Collector configuration uses Kubernetes service discovery to find pods annotated with prometheus.io/scrape: "true". Our sample app will include these annotations.

Create a file named sample-app.yaml:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: sample-otel-app
  namespace: default
  labels:
    app: sample-otel-app
spec:
  replicas: 1
  selector:
    matchLabels:
      app: sample-otel-app
  template:
    metadata:
      labels:
        app: sample-otel-app
      annotations:
        prometheus.io/scrape: "true"
        prometheus.io/path: "/metrics"
        prometheus.io/port: "8000" # Port where Prometheus metrics are exposed
    spec:
      containers:
        - name: flask-app
          image: python:3.9-slim-buster
          command: ["/bin/bash", "-c"]
          args:
            - |
              pip install Flask prometheus_client opentelemetry-api opentelemetry-sdk opentelemetry-exporter-otlp opentelemetry-instrumentation-flask
              cat <<EOF > app.py
              from flask import Flask, request
              from prometheus_client import generate_latest, Counter, Histogram
              import time
              from opentelemetry import trace
              from opentelemetry.sdk.resources import Resource
              from opentelemetry.sdk.trace import TracerProvider
              from opentelemetry.sdk.trace.export import BatchSpanProcessor
              from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
              from opentelemetry.instrumentation.flask import FlaskInstrumentor
              
              app = Flask(__name__)
              
              # OpenTelemetry Tracing setup
              resource = Resource.create({
                  "service.name": "sample-otel-app",
                  "service.instance.id": "instance-1",
              })
              
              provider = TracerProvider(resource=resource)
              processor = BatchSpanProcessor(OTLPSpanExporter(endpoint="otel-collector:4317", insecure=True))
              provider.add_span_processor(processor)
              trace.set_tracer_provider(provider)
              
              FlaskInstrumentor().instrument_app(app)
              
              tracer = trace.get_tracer(__name__)
              
              # Prometheus Metrics setup
              REQUEST_COUNT = Counter(
                  'app_request_count', 'Application Request Count',
                  ['method', 'endpoint']
              )
              REQUEST_LATENCY = Histogram(
                  'app_request_latency_seconds', 'Request latency in seconds',
                  ['method', 'endpoint']
              )
              
              @app.route('/')
              def hello():
                  with tracer.start_as_current_span("hello-endpoint"):
                      REQUEST_COUNT.labels(method='GET', endpoint='/').inc()
                      with REQUEST_LATENCY.labels(method='GET', endpoint='/').time():
                          time.sleep(0.05) # Simulate some work
                          return 'Hello, OpenTelemetry!'
              
              @app.route('/metrics')
              def metrics():
                  return generate_latest(), 200, {'Content-Type': 'text/plain; version=0.0.4; charset=utf-8'}
              
              if __name__ == '__main__':
                  app.run(host='0.0.0.0', port=8000)
              EOF
              python app.py
          ports:
            - containerPort: 8000
              name: http
          env:
            - name: OTEL_RESOURCE_ATTRIBUTES
              value: "service.name=sample-otel-app,service.version=1.0.0"
            - name: OTEL_EXPORTER_OTLP_ENDPOINT
              value: "http://otel-collector:4317" # Target the Collector's gRPC endpoint
            - name: OTEL_EXPORTER_OTLP_PROTOCOL
              value: "grpc"
---
apiVersion: v1
kind: Service
metadata:
  name: sample-otel-app
  namespace: default
  labels:
    app: sample-otel-app
spec:
  selector:
    app: sample-otel-app
  ports:
    - name: http
      protocol: TCP
      port: 80
      targetPort: 8000

Apply the sample application:

kubectl apply -f sample-app.yaml

Verify:

kubectl get pods -l app=sample-otel-app

Wait for the pod to be in a Running state. It might take a moment for Python dependencies to install.

4. Generate Traffic and Observe Telemetry

Now that both the Collector and the sample application are running, let’s generate some traffic to the application to produce metrics and traces. We’ll then inspect the Collector’s logs to see the data being processed.

First, get the name of the sample application pod:

APP_POD=$(kubectl get pods -l app=sample-otel-app -o jsonpath='{.items[0].metadata.name}')
echo $APP_POD

Generate some requests to the application:

kubectl exec -it $APP_POD -- curl localhost:8000
kubectl exec -it $APP_POD -- curl localhost:8000
kubectl exec -it $APP_POD -- curl localhost:8000

Now, let’s get the name of the OpenTelemetry Collector pod:

COLLECTOR_POD=$(kubectl get pods -l app=otel-collector -o jsonpath='{.items[0].metadata.name}')
echo $COLLECTOR_POD

View the logs of the OpenTelemetry Collector. You should see it receiving and exporting both metrics and traces:

kubectl logs -f $COLLECTOR_POD

Expected Output (truncated):

...
2023-10-27T10:00:00.123Z        INFO    TracesExporter  {"kind": "exporter", "name": "logging", "resource spans": 1}
2023-10-27T10:00:00.123Z        DEBUG   TracesExporter  {"kind": "exporter", "name": "logging", "data": [{"resource": {"attributes": [{"key": "service.name", "value": {"stringValue": "sample-otel-app"}}, {"key": "service.instance.id", "value": {"stringValue": "instance-1"}}]}, "scope_spans": [{"scope": {"name": "app.py", "version": "0.1.0"}, "spans": [{"trace_id": "...", "span_id": "...", "parent_span_id": "...", "name": "hello-endpoint", "kind": "SPAN_KIND_INTERNAL", "start_time_unix_nano": "...", "end_time_unix_nano": "...", "attributes": [], "status": {"code": "STATUS_CODE_UNSET"}}]}, {"scope": {"name": "opentelemetry.instrumentation.flask", "version": "0.41b0"}, "spans": [{"trace_id": "...", "span_id": "...", "...", "name": "GET /", "kind": "SPAN_KIND_SERVER", "start_time_unix_nano": "...", "end_time_unix_nano": "...", "attributes": [{"key": "http.method", "value": {"stringValue": "GET"}}, {"key": "http.scheme", "value": {"stringValue": "http"}}, {"key": "http.host", "value": {"stringValue": "localhost:8000"}}, {"key": "http.target", "value": {"stringValue": "/"}}, {"key": "http.flavor", "value": {"stringValue": "1.1"}}, {"key": "net.host.ip", "value": {"stringValue": "127.0.0.1"}}, {"key": "net.host.port", "value": {"intValue": 8000}}, {"key": "http.status_code", "value": {"intValue": 200}}], "status": {"code": "STATUS_CODE_UNSET"}}]}]}]}
...
2023-10-27T10:00:05.456Z        INFO    MetricsExporter {"kind": "exporter", "name": "logging", "resource metrics": 1}
2023-10-27T10:00:05.456Z        DEBUG   MetricsExporter {"kind": "exporter", "name": "logging", "data": [{"resource": {"attributes": [{"key": "kubernetes_namespace", "value": {"stringValue": "default"}}, {"key": "kubernetes_pod_name", "value": {"stringValue": "sample-otel-app-..."}}]}, "scope_metrics": [{"scope": {"name": "prometheus"}, "metrics": [{"name": "app_request_count_total", "description": "Application Request Count", "sum": {"data_points": [{"attributes": [{"key": "method", "value": {"stringValue": "GET"}}, {"key": "endpoint", "value": {"stringValue": "/"}}], "start_time_unix_nano": "...", "time_unix_nano": "...", "as_int": "3"}]}}, {"name": "app_request_latency_seconds_bucket", "sum": {"data_points": [{"attributes": [{"key": "le", "value": {"stringValue": "0.075"}}, {"key": "method", "value": {"stringValue": "GET"}}, {"key": "endpoint", "value": {"stringValue": "/"}}], "start_time_unix_nano": "...", "time_unix_nano": "...", "as_int": "3"}]}}, {"name": "app_request_latency_seconds_count", "sum": {"data_points": [{"attributes": [{"key": "method", "value": {"stringValue": "GET"}}, {"key": "endpoint", "value": {"stringValue": "/"}}], "start_time_unix_nano": "...", "time_unix_nano": "...", "as_int": "3"}]}}, {"name": "app_request_latency_seconds_sum", "sum": {"data_points": [{"attributes": [{"key": "method", "value": {"stringValue": "GET"}}, {"key": "endpoint", "value": {"stringValue": "/"}}], "start_time_unix_nano": "...", "time_unix_nano": "...", "as_double": "0.150..."}]}}]}]}
...

You can clearly see log entries indicating the receipt and export of traces (TracesExporter) and metrics (MetricsExporter) by the Collector. This confirms your unified observability pipeline is working!

Production Considerations

While the logging exporter is great for demonstration, a production setup requires more robust solutions. Here are key considerations:

  • Scalability: For high-volume environments, consider deploying multiple Collector instances behind a load balancer or using a DaemonSet for node-level collection. The OpenTelemetry Operator can help manage Collector deployments.
  • High Availability: Run multiple replicas of the Collector Deployment. Use Karpenter or Kubernetes’ built-in autoscaling for dynamic scaling based on resource utilization.
  • Resource Limits: Set appropriate CPU and memory requests/limits for Collector pods to prevent resource exhaustion and ensure stability.
  • Storage: If using file-based receivers or processors (e.g., for disk-backed queues), ensure persistent storage is configured (e.g., PersistentVolumeClaims).
  • Security:
    • Network Policies: Restrict network access to Collector endpoints using Kubernetes Network Policies. Only allow instrumented applications and monitoring backends to communicate with the Collector.
    • Authentication/Authorization: Configure authentication for OTLP endpoints if exposing them externally. Use TLS for all communication.
    • Secrets Management: Store API keys or credentials for exporters securely using Kubernetes Secrets.
    • For enhanced security, consider integration with tools like Sigstore and Kyverno for supply chain integrity and policy enforcement.
  • Exporters: Replace the logging exporter with real backends:
  • Processors: Leverage powerful processors for data manipulation:
    • batch: For efficient sending.
    • memory_limiter: Prevents OOM errors.
    • resourcedetection: Automatically adds resource attributes (e.g., K8s pod name, namespace).
    • attributes, spanmetrics, transform: For advanced data enrichment and transformation.
  • Networking: For cross-cluster or hybrid cloud setups, consider advanced networking solutions like Cilium WireGuard encryption for secure and efficient data transfer. If you’re using a service mesh like Istio Ambient Mesh, the Collector can integrate seamlessly.
  • Observability of the Collector itself: Monitor the Collector’s health and performance using its own Prometheus metrics endpoint (port 8888 by default) and zPages (port 55679). You can also use eBPF Observability with Hubble to gain deeper insights into network interactions.

Troubleshooting

  1. Collector Pod Not Running/Crashing:

    Issue: The otel-collector pod is in CrashLoopBackOff or Pending state.

    Solution:

    • Check pod events: kubectl describe pod <collector-pod-name>. Look for issues with image pull, volume mounts, or resource constraints.
    • Check collector logs: kubectl logs <collector-pod-name>. Configuration errors are often printed here first. Typographical errors in the otel-collector-config.yaml are common culprits.
    • Ensure the ConfigMap otel-collector-config exists and is correctly named.
  2. No Metrics/Traces in Collector Logs:

    Issue: The Collector pod is running, but its logs show no received metrics or traces.

    Solution:

    • For Traces (OTLP):
      • Verify the application is correctly configured to send OTLP data to the Collector’s service: otel-collector:4317 (gRPC) or otel-collector:4318 (HTTP).
      • Check if the application’s OpenTelemetry SDK is properly initialized and instrumented.
      • Ensure network connectivity: kubectl exec -it -- curl otel-collector:4317 (or 4318).
    • For Metrics (Prometheus):
      • Verify the sample app pod has the correct Prometheus annotations: prometheus.io/scrape: "true", prometheus.io/path, prometheus.io/port.
      • Check if the Prometheus receiver in otel-collector-config.yaml has correct kubernetes_sd_configs and relabel_configs that match your pods.
      • Port-forward to the app and curl its metrics endpoint: kubectl port-forward 8000:8000 then curl localhost:8000/metrics to confirm the app is exposing metrics.
  3. Collector Logs Show “Error exporting data”:

    Issue: The Collector receives data but fails to send it to the configured exporter.

    Solution:

    • If using a real exporter (e.g., Jaeger, Prometheus Remote Write), check the endpoint URL in your otel-collector-config.yaml. Is it correct and reachable?
    • Check network policies. Are there any Network Policies blocking egress traffic from the Collector to the backend?
    • For external services, ensure DNS resolution is working.
    • Check authentication/authorization for the target backend.
  4. High Resource Usage by Collector:

    Issue: The Collector pod consumes excessive CPU or memory.

    Solution:

    • Increase send_batch_size and timeout in the batch processor to reduce export frequency.
    • Implement memory_limiter processor to gracefully handle memory pressure.
    • Use tail_sampling for traces to reduce data volume.
    • Consider sharding Collectors or deploying them as DaemonSets for specific types of data.
    • Optimize processing logic; complex transformations can be CPU intensive.
  5. Metrics/Traces Missing Attributes:

    Issue: Telemetry data arrives at the backend but lacks expected tags or attributes.

    Solution:

    • Verify the application’s OpenTelemetry SDK is correctly setting resource attributes and span attributes.
    • Check Collector processors like resourcedetection, attributes, or transform. Ensure they are configured to add/modify attributes as expected and are part of the correct pipeline.
    • If using Prometheus receiver, review relabel_configs in the Collector config to ensure labels are being correctly captured and mapped.

FAQ Section

  1. What is the difference between OpenTelemetry Collector and OpenTelemetry SDKs?

    OpenTelemetry SDKs are used within your application code to instrument it and generate telemetry data (metrics, traces, logs). The OpenTelemetry Collector is a standalone proxy that receives this data (and data from other sources), processes it, and exports it to various observability backends. SDKs are for data generation, the Collector is for data management and routing.

  2. Should I deploy the Collector as a DaemonSet or a Deployment?

    It depends on your use case:

    • DaemonSet: Ideal for node-level collection (e.g., host metrics, system logs, Kubelet metrics). Each node gets a Collector instance, reducing network hops for local data.
    • Deployment: Best for application-level collection (e.g., receiving OTLP from instrumented apps, scraping Prometheus endpoints). It acts as a central proxy, often with multiple replicas for high availability, and is typically exposed via a Kubernetes Service.
    • Sidecar: For very specific per-application needs, a Collector can run as a sidecar container in the same pod as the application. This ensures data is processed locally before being sent out, but adds overhead per pod.
  3. Can the OpenTelemetry Collector replace my existing Prometheus or Fluent Bit agents?

    Potentially, yes. The Collector has receivers for

Leave a Reply

Your email address will not be published. Required fields are marked *