Orchestration

OpenTelemetry Collector: Your Unified Observability Pipeline

OpenTelemetry Collector: Unified Observability Pipeline

In the complex world of cloud-native applications, gaining comprehensive visibility into your systems is not just a luxury, but a necessity. Microservices, distributed architectures, and ephemeral containers introduce a dizzying array of components that generate vast amounts of telemetry data—logs, metrics, and traces. Correlating this data across disparate services and infrastructure components is a monumental challenge for even the most seasoned SREs and developers. Without a unified approach, debugging becomes a nightmare, performance bottlenecks remain hidden, and proactive issue detection is nearly impossible.

Enter the OpenTelemetry Collector, a powerful, vendor-agnostic agent that acts as the central nervous system for your observability data. It’s designed to process, transform, and export telemetry data from various sources to multiple destinations, all before it ever leaves your network. By standardizing data collection and processing, the OpenTelemetry Collector simplifies your observability stack, reduces operational overhead, and ensures that your valuable telemetry reaches the right backend in the right format. This guide will walk you through deploying and configuring the OpenTelemetry Collector on Kubernetes, transforming your fragmented observability landscape into a streamlined, high-performance pipeline.

TL;DR: Unified Observability with OpenTelemetry Collector

The OpenTelemetry Collector is your central hub for collecting, processing, and exporting logs, metrics, and traces in Kubernetes. Deploy it as a DaemonSet for node-level collection or a Deployment for application-level data. This guide covers its architecture, configuration, and deployment on Kubernetes.

Key Commands:

  • Install Helm Repo:
    helm repo add open-telemetry https://open-telemetry.github.io/opentelemetry-helm-charts
  • Update Helm Repo:
    helm repo update
  • Install Collector (DaemonSet example):
    helm install otel-collector open-telemetry/opentelemetry-collector -f values-daemonset.yaml
  • Apply Collector ConfigMap:
    kubectl apply -f collector-config.yaml
  • View Collector Logs:
    kubectl logs -l app.kubernetes.io/name=opentelemetry-collector
  • Cleanup:
    helm uninstall otel-collector

Prerequisites

Before diving into the deployment of the OpenTelemetry Collector, ensure you have the following:

  • Kubernetes Cluster: A running Kubernetes cluster (v1.20+ recommended). You can use Minikube, Kind, or a managed service like EKS, GKE, or AKS.
  • kubectl: The Kubernetes command-line tool, configured to connect to your cluster. Refer to the official Kubernetes documentation for installation instructions.
  • Helm: The Kubernetes package manager, version 3.x or higher. Instructions available on the Helm website.
  • Basic Kubernetes Knowledge: Familiarity with Deployments, DaemonSets, ConfigMaps, and Services.
  • Basic Observability Concepts: Understanding of metrics, traces, and logs.

Step-by-Step Guide: Deploying OpenTelemetry Collector on Kubernetes

We’ll deploy the OpenTelemetry Collector using its official Helm chart, which provides a flexible way to manage its configuration and deployment strategy.

1. Add the OpenTelemetry Helm Repository

First, add the official OpenTelemetry Helm chart repository to your Helm configuration. This allows you to easily install and manage the Collector.

helm repo add open-telemetry https://open-telemetry.github.io/opentelemetry-helm-charts
helm repo update

Verify: You should see a confirmation that the repository has been added and updated.

"open-telemetry" has been added to your repositories
Hang tight while we grab the latest from your repos...
...Successfully got an update from the "open-telemetry" chart repository
Update Complete. ⎈Happy Helming!⎈

2. Understand OpenTelemetry Collector Deployment Modes

The OpenTelemetry Collector can be deployed in several modes, each suited for different use cases:

  • Agent (DaemonSet): Deployed as a Kubernetes DaemonSet on each node. This is ideal for collecting host-level metrics, logs from node agents (like Fluent Bit), and telemetry from applications running on that node, especially when you want to avoid network hops for initial collection. This model is often used for infrastructure-level observability.
  • Gateway/Collector (Deployment): Deployed as a Kubernetes Deployment. This acts as a central aggregation point for telemetry data from multiple agents or directly from applications. It’s excellent for advanced processing, filtering, batching, and routing data to various backends. This provides a centralized point for managing your observability pipeline, similar to how an API Gateway centralizes traffic management.

For this guide, we’ll primarily focus on the Agent (DaemonSet) model for node-level collection, and then touch upon the Gateway (Deployment) for aggregation.

3. Configure the OpenTelemetry Collector

The core of the OpenTelemetry Collector is its configuration, typically provided via a ConfigMap. This configuration defines the Collector’s pipelines: receivers, processors, and exporters.

  • Receivers: How data gets into the Collector (e.g., OTLP, Prometheus, Jaeger, Fluent Bit).
  • Processors: How data is transformed, filtered, or enriched within the Collector (e.g., batching, attribute modification, resource detection).
  • Exporters: Where data is sent from the Collector (e.g., OTLP, Prometheus, Jaeger, Loki, Datadog, New Relic).

Let’s create a basic configuration that receives OTLP (OpenTelemetry Protocol) traces, metrics, and logs, and exports them to a local console for demonstration purposes. In a real-world scenario, you’d export to a robust backend like Prometheus, Grafana Loki, or a commercial APM solution.

# collector-config.yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: otel-collector-config
  labels:
    app.kubernetes.io/name: opentelemetry-collector
data:
  collector.yaml: |
    # Receivers: How the Collector gets data
    receivers:
      otlp:
        protocols:
          grpc:
          http:
      # Example: Prometheus receiver for scraping metrics
      # prometheus:
      #   config:
      #     scrape_configs:
      #       - job_name: "otel-collector"
      #         scrape_interval: 10s
      #         static_configs:
      #           - targets: ["0.0.0.0:8888"] # Collector's own metrics endpoint

    # Processors: How the Collector processes data
    processors:
      batch:
        send_batch_size: 100
        timeout: 10s
      # Example: Resource detection to add Kubernetes metadata
      resourcedetection:
        detectors: ["system", "env"]
        system:
          resource_attributes:
            os.type:
              enabled: true
            host.arch:
              enabled: true
            host.name:
              enabled: true
            container.id:
              enabled: true
            container.name:
              enabled: true
            k8s.pod.name:
              enabled: true
            k8s.node.name:
              enabled: true

    # Exporters: Where the Collector sends data
    exporters:
      # Console exporter for demonstration (prints to stdout)
      debug:
        verbosity: detailed
      # Example: OTLP exporter to another Collector or backend
      # otlp:
      #   endpoint: "otel-collector-gateway:4317" # Target another collector or backend
      #   tls:
      #     insecure: true
      # Example: Prometheus Remote Write exporter
      # prometheusremotewrite:
      #   endpoint: "http://prometheus.kube-prometheus-stack.svc.cluster.local:9090/api/v1/write"
      # Example: Loki exporter for logs
      # loki:
      #   endpoint: http://loki.loki.svc.cluster.local:3100/loki/api/v1/push

    # Service: Defines the data pipelines
    service:
      telemetry:
        logs:
          level: "info"
      pipelines:
        traces:
          receivers: [otlp]
          processors: [batch, resourcedetection]
          exporters: [debug] # Change to otlp or another backend
        metrics:
          receivers: [otlp]
          processors: [batch, resourcedetection]
          exporters: [debug] # Change to prometheusremotewrite or another backend
        logs:
          receivers: [otlp]
          processors: [batch, resourcedetection]
          exporters: [debug] # Change to loki or another backend

Apply this ConfigMap to your cluster:

kubectl apply -f collector-config.yaml

Verify:

configmap/otel-collector-config created

4. Deploy the OpenTelemetry Collector as a DaemonSet

Now, let’s deploy the Collector using the Helm chart. We’ll use a values-daemonset.yaml file to customize the deployment, ensuring it runs as a DaemonSet and uses our custom configuration.

# values-daemonset.yaml
mode: daemonset # Deploy as a DaemonSet

config:
  ## Refer to ./internal/config.yaml for default config.
  ## Any changes here will override the default config.
  ## Example:
  # receivers:
  #   jaeger:
  #     protocols:
  #       grpc:
  #       thrift_compact:
  #       thrift_http:
  #       thrift_binary:
  #   zipkin:
  #   prometheus:
  #     config:
  #       scrape_configs:
  #         - job_name: 'otel-collector'
  #           scrape_interval: 10s
  #           static_configs:
  #             - targets: ['0.0.0.0:8888']
  # processors:
  #   batch:
  #     send_batch_size: 1000
  #     timeout: 10s
  # exporters:
  #   logging:
  #     verbosity: detailed
  # service:
  #   telemetry:
  #     logs:
  #       level: "info"
  #   pipelines:
  #     traces:
  #       receivers: [jaeger, zipkin, otlp]
  #       processors: [batch]
  #       exporters: [logging]
  #     metrics:
  #       receivers: [prometheus, otlp]
  #       processors: [batch]
  #       exporters: [logging]

  # Use an external ConfigMap for the collector configuration
  # This references the ConfigMap we created in the previous step
  existingConfigMap: otel-collector-config

# Define ports for receivers. OTLP (gRPC and HTTP) are standard.
ports:
  - name: otlp-grpc
    containerPort: 4317
    protocol: TCP
  - name: otlp-http
    containerPort: 4318
    protocol: TCP
  # - name: prometheus
  #   containerPort: 8888
  #   protocol: TCP

# Expose the OTLP ports as a Service
service:
  type: ClusterIP
  ports:
    - name: otlp-grpc
      port: 4317
      targetPort: 4317
      protocol: TCP
    - name: otlp-http
      port: 4318
      targetPort: 4318
      protocol: TCP

# Resource limits (adjust for your environment)
resources:
  requests:
    cpu: 100m
    memory: 128Mi
  limits:
    cpu: 500m
    memory: 512Mi

# Assign tolerations if you have tainted nodes
# tolerations:
#   - effect: NoSchedule
#     key: dedicated
#     operator: Exists
#     value: infra-node

# Node affinity for specific nodes (optional)
# affinity:
#   nodeAffinity:
#     requiredDuringSchedulingIgnoredDuringExecution:
#       nodeSelectorTerms:
#         - matchExpressions:
#           - key: kubernetes.io/os
#             operator: In
#             values:
#               - linux

Install the Collector using Helm:

helm install otel-collector open-telemetry/opentelemetry-collector -f values-daemonset.yaml

Verify: Check the deployed pods and services.

kubectl get pods -l app.kubernetes.io/name=opentelemetry-collector
kubectl get svc -l app.kubernetes.io/name=opentelemetry-collector

You should see one collector pod running on each node, and a ClusterIP service exposing the OTLP ports.

NAME                                READY   STATUS    RESTARTS   AGE
otel-collector-opentelemetry-collector-xxxx   1/1     Running   0          2m

NAME                                TYPE        CLUSTER-IP    EXTERNAL-IP   PORT(S)                         AGE
otel-collector-opentelemetry-collector   ClusterIP   10.xx.xx.xx   <none>        4317/TCP,4318/TCP   2m

5. Deploy the OpenTelemetry Collector as a Gateway (Optional)

For more advanced scenarios, you might want a central gateway Collector. This would typically be a Deployment that receives data from the DaemonSet agents and then forwards it to your final backend(s).

# values-deployment.yaml
mode: deployment # Deploy as a Deployment

config:
  existingConfigMap: otel-collector-config # Re-use the same config for simplicity, or create a new one

# Define ports for receivers.
ports:
  - name: otlp-grpc
    containerPort: 4317
    protocol: TCP
  - name: otlp-http
    containerPort: 4318
    protocol: TCP

# Expose the OTLP ports as a Service
service:
  type: ClusterIP
  ports:
    - name: otlp-grpc
      port: 4317
      targetPort: 4317
      protocol: TCP
    - name: otlp-http
      port: 4318
      targetPort: 4318
      protocol: TCP

# Replicas for high availability
replicas: 2 

# Resource limits
resources:
  requests:
    cpu: 200m
    memory: 256Mi
  limits:
    cpu: 1000m
    memory: 1Gi

Install the Gateway Collector:

helm install otel-collector-gateway open-telemetry/opentelemetry-collector -f values-deployment.yaml

Verify:

kubectl get pods -l app.kubernetes.io/instance=otel-collector-gateway
kubectl get svc -l app.kubernetes.io/instance=otel-collector-gateway

You should see the specified number of gateway collector pods and its associated service.

6. Instrument an Application to Send Data to the Collector

Now that the Collector is running, you need an application to send it telemetry data. This usually involves instrumenting your application with OpenTelemetry SDKs.

Here’s a simple example of a Python application that sends a trace to the Collector. For more detailed instrumentation, refer to the OpenTelemetry documentation for various languages.

# app.py
from opentelemetry import trace
from opentelemetry.sdk.resources import Resource
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
from opentelemetry.sdk.metrics import MeterProvider
from opentelemetry.sdk.metrics.export import PeriodicExportingMetricReader
from opentelemetry.exporter.otlp.proto.grpc.metric_exporter import OTLPMetricExporter
from opentelemetry.sdk.logs import LoggerProvider, LoggingHandler
from opentelemetry.sdk.logs.export import BatchLogRecordProcessor
from opentelemetry.exporter.otlp.proto.grpc.log_exporter import OTLPLogExporter
import logging
import time

# Configure resource for traces, metrics, and logs
resource = Resource.create({
    "service.name": "my-python-app",
    "service.version": "1.0.0",
    "environment": "development"
})

# --- Tracing Setup ---
trace_provider = TracerProvider(resource=resource)
trace.set_tracer_provider(trace_provider)
span_exporter = OTLPSpanExporter(endpoint="otel-collector-opentelemetry-collector:4317", insecure=True)
span_processor = BatchSpanProcessor(span_exporter)
trace_provider.add_span_processor(span_processor)
tracer = trace.get_tracer(__name__)

# --- Metrics Setup ---
metric_provider = MeterProvider(resource=resource)
metric_exporter = OTLPMetricExporter(endpoint="otel-collector-opentelemetry-collector:4317", insecure=True)
metric_reader = PeriodicExportingMetricReader(metric_exporter, export_interval_millis=5000)
metric_provider.add_metric_reader(metric_reader)
meter = metric_provider.get_meter(__name__)
counter = meter.create_counter(
    "my_counter",
    description="A simple counter",
    unit="1",
)

# --- Logging Setup ---
logger_provider = LoggerProvider(resource=resource)
log_exporter = OTLPLogExporter(endpoint="otel-collector-opentelemetry-collector:4317", insecure=True)
log_processor = BatchLogRecordProcessor(log_exporter)
logger_provider.add_log_record_processor(log_processor)
handler = LoggingHandler(level=logging.INFO, logger_provider=logger_provider)
logging.getLogger().addHandler(handler)
logger = logging.getLogger(__name__)

def do_work():
    with tracer.start_as_current_span("do_work_span"):
        logger.info("Doing some work...")
        counter.add(1, {"item": "widget"})
        time.sleep(0.1)
        with tracer.start_as_current_span("sub_work_span"):
            logger.debug("Doing sub-work...")
            time.sleep(0.05)
    
if __name__ == "__main__":
    print("Sending telemetry to OpenTelemetry Collector...")
    for i in range(5):
        do_work()
        time.sleep(1)
    print("Telemetry sent. Check collector logs.")
    # Give some time for batch exporters to flush
    time.sleep(2)

Create a Dockerfile for this application:

# Dockerfile
FROM python:3.9-slim-buster
WORKDIR /app
COPY requirements.txt .
RUN pip install -r requirements.txt
COPY app.py .
CMD ["python", "app.py"]

And requirements.txt:

opentelemetry-api
opentelemetry-sdk
opentelemetry-exporter-otlp-proto-grpc
opentelemetry-instrumentation

Build and push the image (replace your-docker-repo with your actual repo):

docker build -t your-docker-repo/my-python-app:latest .
docker push your-docker-repo/my-python-app:latest

Deploy the application to Kubernetes:

# app-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: my-python-app
spec:
  replicas: 1
  selector:
    matchLabels:
      app: my-python-app
  template:
    metadata:
      labels:
        app: my-python-app
    spec:
      containers:
      - name: my-python-app
        image: your-docker-repo/my-python-app:latest
        env:
        - name: OTEL_EXPORTER_OTLP_ENDPOINT
          value: "otel-collector-opentelemetry-collector:4317" # Point to the Collector Service
        - name: OTEL_RESOURCE_ATTRIBUTES
          value: "service.name=my-python-app,service.version=1.0.0,environment=development"
        resources:
          requests:
            cpu: 50m
            memory: 64Mi
          limits:
            cpu: 100m
            memory: 128Mi

Apply the deployment:

kubectl apply -f app-deployment.yaml

Verify: Check the logs of the OpenTelemetry Collector pods. You should see the traces, metrics, and logs from your Python application being printed to stdout by the debug exporter.

kubectl logs -l app.kubernetes.io/name=opentelemetry-collector --tail 100

You’ll see output similar to this (truncated for brevity), confirming the Collector is receiving and processing data:

...
2023-10-27T10:30:05.123Z        INFO    TracesExporter  {"kind": "exporter", "name": "debug", "data_type": "traces", "#spans": 1}
2023-10-27T10:30:05.123Z        INFO    TracesExporter  {"kind": "exporter", "name": "debug", "data_type": "traces", "span_id": "...", "parent_span_id": "...", "trace_id": "...", "name": "do_work_span", "kind": 1, "start_time": "...", "end_time": "...", "status": {"code": 0}, "attributes": {"service.name": "my-python-app", "service.version": "1.0.0", "environment": "development"}, "resource": {"service.name": "my-python-app", "service.version": "1.0.0", "environment": "development"}}
...
2023-10-27T10:30:06.543Z        INFO    MetricsExporter {"kind": "exporter", "name": "debug", "data_type": "metrics", "#metrics": 1}
2023-10-27T10:30:06.543Z        INFO    MetricsExporter {"kind": "exporter", "name": "debug", "data_type": "metrics", "metric_name": "my_counter", "metric_type": "Sum", "attributes": {"item": "widget"}, "value": 1.0, "resource": {"service.name": "my-python-app", "service.version": "1.0.0", "environment": "development"}}
...
2023-10-27T10:30:07.890Z        INFO    LogsExporter    {"kind": "exporter", "name": "debug", "data_type": "logs", "#logs": 1}
2023-10-27T10:30:07.890Z        INFO    LogsExporter    {"kind": "exporter", "name": "debug", "data_type": "logs", "timestamp": "...", "severity": "INFO", "body": "Doing some work...", "attributes": {"service.name": "my-python-app", "service.version": "1.0.0", "environment": "development"}, "resource": {"service.name": "my-python-app", "service.version": "1.0.0", "environment": "development"}}
...

Production Considerations

Deploying the OpenTelemetry Collector in production requires careful planning to ensure reliability, scalability, and security.

  1. High Availability: For Gateway Collectors, deploy multiple replicas (replicas: N in Helm values) to ensure no single point of failure. Use a StatefulSet if you need persistent storage for certain processors (e.g., those that buffer data to disk).
  2. Resource Management: Set appropriate CPU and memory requests/limits for Collector pods. Over-provisioning wastes resources, while under-provisioning can lead to OOMKills or throttled performance. Monitor Collector resource usage closely. For cost optimization, consider tools like Karpenter to efficiently manage underlying node resources.
  3. Configuration Management: Use GitOps practices to manage your Collector configurations. Store your collector-config.yaml and Helm values.yaml in version control.
  4. Security:
    • Network Policies: Restrict network access to Collector ports using Kubernetes Network Policies. Only allow instrumented applications and other Collectors to send data.
    • TLS: Always enable TLS for OTLP communication, especially between Collectors or when sending data outside your cluster. The insecure: true flag used in the example is for demonstration only.
    • Authentication: Implement authentication for exporters (e.g., API keys, OAuth2) when sending data to commercial backends.
    • Least Privilege: Run Collector pods with minimal necessary permissions.
  5. Storage for Buffering: Some processors (e.g., queued_retry) or exporters might benefit from disk-based buffering to prevent data loss during transient network issues or backend unavailability. This often requires a PersistentVolumeClaim.
  6. Scalability:
    • Horizontal Scaling: Scale the number of Gateway Collector replicas based on the volume of telemetry data.
    • Vertical Scaling: Increase CPU/memory for individual Collector pods if they become bottlenecks.
    • Sharding: For extremely high volumes, consider sharding your Collector deployment, where different Collectors handle different types of data or data from specific services.
  7. Monitoring the Collector Itself: Export the Collector’s own internal metrics (e.g., receiver throughput, exporter errors) to Prometheus or your monitoring system. The Collector exposes its own metrics on port 8888 by default.
  8. Backend Connectivity: Ensure Collectors have proper network connectivity to your chosen observability backends (Prometheus, Grafana Loki, Jaeger, commercial APM, etc.). This might involve configuring firewalls, VPC peering, or Cilium WireGuard Encryption for secure connections.
  9. Advanced Processors: Leverage advanced processors like k8sattributes to automatically enrich telemetry with Kubernetes metadata, attributes to rename/add/remove attributes, or filter to drop unwanted data. This reduces data volume and improves query performance in your backend.

Troubleshooting

Here are common issues you might encounter with the OpenTelemetry Collector and their solutions.

  1. Collector Pods Not Running/Crashing

    Issue: Collector pods are in Pending, CrashLoopBackOff, or Error state.

    Solution:

    • Check Events:
      kubectl describe pod <collector-pod-name>

      Look for reasons like insufficient resources (OOMKilled), image pull errors, or volume mounting issues.

    • Check Logs:
      kubectl logs <collector-pod-name>

      Configuration errors in collector.yaml are a common cause. The Collector will log detailed parsing errors.

    • Resource Limits: If OOMKilled, increase memory limits in your values.yaml.
  2. No Telemetry Data Reaching the Collector

    Issue: Application is sending data, but Collector logs don’t show any received telemetry (e.g., no debug output).

    Solution:

    • Application Configuration: Double-check that your application’s OpenTelemetry SDK is configured to point to the correct Collector service endpoint (e.g., otel-collector-opentelemetry-collector:4317).
    • Service Reachability: From inside your application pod, verify connectivity to the Collector service:
      kubectl exec -it <app-pod-name> -- curl -v telnet otel-collector-opentelemetry-collector 4317
    • Collector Receiver Configuration: Ensure the Collector’s receivers section is correctly configured for the protocol your application is using (e.g., otlp: with grpc: and http:).
    • Network Policies: Verify no Network Policies are blocking traffic between your application and the Collector.
  3. Telemetry Data Not Reaching the Backend

    Issue: Collector receives data, but it’s not appearing in your monitoring backend (e.g., Prometheus, Grafana, Jaeger).

    Solution:

    • Collector Exporter Configuration: Review the exporters section in your collector.yaml. Ensure the endpoint, credentials, and protocol are correct for your backend.
    • Collector Logs: Look for errors in the Collector logs related to exporters (e.g., “connection refused”, “unauthorized”, “TLS handshake error”).
    • Backend Reachability: From the Collector pod, try to reach the backend endpoint (e.g., using curl or telnet).
    • Backend Status: Check the status and logs of your observability backend. It might be down, misconfigured, or rejecting data.
  4. High Resource Consumption by Collector Pods

    Issue: Collector pods are consuming excessive CPU or memory.

    Solution:

    • Profile the Collector: Enable the Collector’s own metrics (exposed on port 8888 by default) and scrape them with Prometheus to identify bottlenecks (e.g., high CPU on certain processors).
    • Batch Processor: Adjust send_batch_size and timeout in the batch processor. Larger batches reduce CPU overhead but increase latency.
    • Filtering: Use filter processors to drop unneeded telemetry data early in the pipeline, reducing processing load and export bandwidth.
    • Reduce Verbosity: Lower logging verbosity (service.telemetry.logs.level) to reduce I/O and processing.
    • Scale Out: For Gateway Collectors, increase the number of replicas.
  5. Data Loss / Backpressure

    Issue: Telemetry data is intermittently missing or delayed, especially during spikes.

    Solution:

    • Batch Processor: Ensure you have a batch processor configured for all pipelines. This helps absorb spikes.
    • Queued Retry Exporter:

Leave a Reply

Your email address will not be published. Required fields are marked *