Orchestration

Unify Observability: OpenTelemetry Collector Pipeline

Introduction

In the complex tapestry of modern microservices, observability is not just a buzzword; it’s the lifeline that keeps your applications healthy, performant, and resilient. As distributed systems grow, so does the sheer volume and variety of telemetry data—logs, metrics, and traces—generated by countless components. Collecting, processing, and exporting this data to various backend systems can quickly become a monumental challenge. This is where the OpenTelemetry Collector steps in as an indispensable tool, offering a standardized, vendor-agnostic solution for your observability pipeline.

The OpenTelemetry Collector is a powerful, flexible, and highly configurable agent that can receive, process, and export telemetry data in a multitude of formats. It acts as a central hub, decoupling the instrumentation of your applications from the specific backend systems you use for analysis and storage. This means you can switch between different observability platforms (e.g., Prometheus, Jaeger, Grafana, Datadog, Splunk) without re-instrumenting your code. For organizations operating Kubernetes, the Collector is particularly transformative, providing a unified approach to gather insights from ephemeral pods, services, and nodes. Whether you’re dealing with vast amounts of metrics for a GPU-intensive LLM workload or tracing requests through an Istio Ambient Mesh, the OpenTelemetry Collector simplifies your observability strategy.

TL;DR: OpenTelemetry Collector – Unified Observability Pipeline

The OpenTelemetry Collector is a vendor-agnostic agent for receiving, processing, and exporting telemetry data (logs, metrics, traces). It acts as a central hub, decoupling application instrumentation from backend observability systems. Deploy it as a DaemonSet for node-level collection or a Deployment for application-level collection in Kubernetes.

Key Takeaways:

  • Unified Data Collection: Gathers metrics, traces, and logs from diverse sources.
  • Vendor Agnostic: Supports various data formats (OTLP, Jaeger, Prometheus, Zipkin) and exports to multiple backends (Prometheus, Loki, Jaeger, commercial SaaS).
  • Processing Power: Filters, samples, transforms, and enriches telemetry data before export.
  • Kubernetes Native: Deploys easily as a DaemonSet (per node) or Deployment (per application/namespace).

Key Commands:


# Deploy the Collector as a DaemonSet
kubectl apply -f opentelemetry-collector-daemonset.yaml

# Deploy the Collector as a Deployment
kubectl apply -f opentelemetry-collector-deployment.yaml

# Verify Collector status
kubectl get pods -n opentelemetry-collector
kubectl logs -f -n opentelemetry-collector opentelemetry-collector-xxxx

# Apply a sample application with OTLP exporter
kubectl apply -f sample-app-with-otlp.yaml
    

Prerequisites

Before diving into the deployment and configuration of the OpenTelemetry Collector on Kubernetes, ensure you have the following:

  • Kubernetes Cluster: A running Kubernetes cluster (v1.20+ recommended). You can use Minikube, Kind, or any cloud provider’s managed Kubernetes service (EKS, GKE, AKS).
  • kubectl: The Kubernetes command-line tool, configured to connect to your cluster. Refer to the official Kubernetes documentation for kubectl installation.
  • Helm (Optional but Recommended): For easier deployment of the Collector and its associated components, Helm is highly recommended. Install Helm by following the official Helm installation guide.
  • Basic Kubernetes Knowledge: Familiarity with Kubernetes concepts like Pods, Deployments, DaemonSets, Services, ConfigMaps, and Namespaces.
  • Understanding of Observability Concepts: Basic understanding of metrics, traces, and logs.

Step-by-Step Guide: Deploying and Configuring OpenTelemetry Collector on Kubernetes

Step 1: Set up the Namespace and RBAC

First, we’ll create a dedicated namespace for our OpenTelemetry Collector components to keep things organized. We’ll also define the necessary Role-Based Access Control (RBAC) permissions. The Collector, especially when deployed as a DaemonSet collecting host-level metrics or using service discovery, requires specific permissions to interact with the Kubernetes API, such as listing nodes, pods, and services. This ensures it can properly gather metadata and function efficiently. Without these permissions, the Collector might fail to start or collect comprehensive data.


# opentelemetry-namespace-rbac.yaml
apiVersion: v1
kind: Namespace
metadata:
  name: opentelemetry-collector
---
apiVersion: v1
kind: ServiceAccount
metadata:
  name: opentelemetry-collector
  namespace: opentelemetry-collector
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: opentelemetry-collector
rules:
- apiGroups: [""]
  resources: ["nodes", "nodes/metrics", "nodes/proxy", "pods", "services", "endpoints"]
  verbs: ["get", "list", "watch"]
- apiGroups: ["apps"]
  resources: ["replicasets"]
  verbs: ["get", "list", "watch"]
- apiGroups: ["extensions"]
  resources: ["daemonsets", "deployments"]
  verbs: ["get", "list", "watch"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: opentelemetry-collector
subjects:
- kind: ServiceAccount
  name: opentelemetry-collector
  namespace: opentelemetry-collector
roleRef:
  kind: ClusterRole
  name: opentelemetry-collector
  apiGroup: rbac.authorization.k8s.io

Apply these configurations:


kubectl apply -f opentelemetry-namespace-rbac.yaml

Verify: Check if the namespace and service account are created, and the ClusterRoleBinding is in place.


kubectl get namespace opentelemetry-collector
kubectl get serviceaccount -n opentelemetry-collector opentelemetry-collector
kubectl get clusterrolebinding opentelemetry-collector

Expected Output:


NAME                        STATUS   AGE
opentelemetry-collector     Active   Xs

NAME                        SECRETS   AGE
opentelemetry-collector     1         Xs

NAME                        ROLE                        AGE
opentelemetry-collector     ClusterRole/opentelemetry-collector   Xs

Step 2: Define the OpenTelemetry Collector Configuration

The core of the OpenTelemetry Collector is its configuration, typically provided via a ConfigMap. This configuration defines the receivers (how data is ingested), processors (how data is transformed), and exporters (where data is sent). For this example, we’ll set up a basic configuration to receive OTLP (OpenTelemetry Protocol) data, process it, and export it to a Prometheus backend for metrics and a Jaeger backend for traces. We’ll also include a simple logging exporter to see what data is being processed.

This configuration also includes a Prometheus receiver for scraping metrics from the Collector itself, and a memory limiter processor to prevent the Collector from consuming too much memory, which is crucial for stability in production environments. For more advanced configurations, you might consider adding extensions for health checks or service discovery.


# opentelemetry-collector-config.yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: opentelemetry-collector-config
  namespace: opentelemetry-collector
data:
  collector.yaml: |
    receivers:
      otlp:
        protocols:
          grpc:
          http:
      prometheus:
        config:
          scrape_configs:
            - job_name: 'otel-collector'
              scrape_interval: 10s
              static_configs:
                - targets: ['0.0.0.0:8888'] # Collector's own metrics endpoint

    processors:
      batch:
        send_batch_size: 10000
        timeout: 10s
      memory_limiter:
        check_interval: 1s
        limit_mib: 200
        spike_limit_mib: 50

    exporters:
      logging:
        loglevel: debug
      prometheus:
        endpoint: "0.0.0.0:8888" # Expose Prometheus metrics from collector
      jaeger:
        grpc:
          endpoint: "jaeger-collector.jaeger.svc.cluster.local:14250" # Assuming Jaeger is deployed
          # For local testing, you might use an external endpoint or a port-forward
      # Example of exporting to a commercial SaaS (e.g., Datadog, New Relic, etc.)
      # datadog:
      #   api:
      #     key: ${env:DD_API_KEY}
      #   metrics:
      #     resource_attributes_as_tags: true
      # newrelic:
      #   api_key: ${env:NEW_RELIC_API_KEY}
      #   common:
      #     service_name: "otel-collector"

    service:
      telemetry:
        metrics:
          address: 0.0.0.0:8888 # Collector's own metrics endpoint
      pipelines:
        traces:
          receivers: [otlp]
          processors: [memory_limiter, batch]
          exporters: [logging, jaeger]
        metrics:
          receivers: [otlp, prometheus] # Receive OTLP metrics and scrape collector's own metrics
          processors: [memory_limiter, batch]
          exporters: [logging, prometheus]
        logs:
          receivers: [otlp]
          processors: [memory_limiter, batch]
          exporters: [logging]

Apply this configuration:


kubectl apply -f opentelemetry-collector-config.yaml

Verify: Check if the ConfigMap is created.


kubectl get configmap -n opentelemetry-collector opentelemetry-collector-config -o yaml

Expected Output: The YAML content of your ConfigMap.

Step 3: Deploy the OpenTelemetry Collector

The OpenTelemetry Collector can be deployed in various modes, each suited for different use cases:

  • Agent (DaemonSet): Deployed as a DaemonSet, one instance per node. Ideal for collecting host-level metrics, Kubernetes events, and forwarding telemetry from applications running on that node. This is often used for edge collection.
  • Gateway (Deployment): Deployed as a regular Deployment, typically with multiple replicas, acting as a central processing and routing layer. It receives data from agents or directly from applications, performs aggregation, sampling, and enrichment, and then exports to various backends. This is often used for centralized processing.

For this guide, we’ll demonstrate both. We’ll start with a DaemonSet for node-level collection and then a Deployment for a centralized gateway.

Option A: Deploy as a DaemonSet (Per Node Agent)

A DaemonSet ensures that a Collector instance runs on every (or selected) node in your cluster. This is excellent for collecting node-specific metrics, logs from host paths, and for acting as a local forwarder for applications running on the same node. This setup minimizes network hops and can reduce resource consumption on application pods by offloading collection logic.


# opentelemetry-collector-daemonset.yaml
apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: opentelemetry-collector-agent
  namespace: opentelemetry-collector
  labels:
    app: opentelemetry-collector-agent
spec:
  selector:
    matchLabels:
      app: opentelemetry-collector-agent
  template:
    metadata:
      labels:
        app: opentelemetry-collector-agent
    spec:
      serviceAccountName: opentelemetry-collector
      containers:
      - name: opentelemetry-collector
        image: otel/opentelemetry-collector:0.96.0 # Use a specific version
        command: ["/otelcol", "--config=/conf/collector.yaml"]
        volumeMounts:
        - name: collector-config
          mountPath: /conf
        - name: varlog
          mountPath: /var/log
          readOnly: true
        - name: varlibdockercontainers
          mountPath: /var/lib/docker/containers
          readOnly: true
        ports:
        - name: otlp-grpc
          containerPort: 4317
          protocol: TCP
        - name: otlp-http
          containerPort: 4318
          protocol: TCP
        - name: prometheus
          containerPort: 8888
          protocol: TCP
        resources:
          limits:
            cpu: 200m
            memory: 256Mi
          requests:
            cpu: 100m
            memory: 128Mi
        securityContext:
          runAsUser: 0 # Needed for accessing host paths
      volumes:
      - name: collector-config
        configMap:
          name: opentelemetry-collector-config
      - name: varlog
        hostPath:
          path: /var/log
      - name: varlibdockercontainers
        hostPath:
          path: /var/lib/docker/containers

Apply this configuration:


kubectl apply -f opentelemetry-collector-daemonset.yaml
Option B: Deploy as a Deployment (Gateway)

A Deployment is suitable for a centralized Collector instance (or multiple instances behind a load balancer) that acts as a gateway. It can receive data from agents, directly from applications (e.g., using OTLP), or from other Collectors, and then process and export it. This is often used for aggregation, sampling, and routing to different backends. For large-scale deployments, you might even consider Karpenter to dynamically provision nodes for your Collector Deployments, optimizing resource usage.


# opentelemetry-collector-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: opentelemetry-collector-gateway
  namespace: opentelemetry-collector
  labels:
    app: opentelemetry-collector-gateway
spec:
  replicas: 1
  selector:
    matchLabels:
      app: opentelemetry-collector-gateway
  template:
    metadata:
      labels:
        app: opentelemetry-collector-gateway
    spec:
      serviceAccountName: opentelemetry-collector
      containers:
      - name: opentelemetry-collector
        image: otel/opentelemetry-collector:0.96.0 # Use a specific version
        command: ["/otelcol", "--config=/conf/collector.yaml"]
        volumeMounts:
        - name: collector-config
          mountPath: /conf
        ports:
        - name: otlp-grpc
          containerPort: 4317
          protocol: TCP
        - name: otlp-http
          containerPort: 4318
          protocol: TCP
        - name: prometheus
          containerPort: 8888
          protocol: TCP
        resources:
          limits:
            cpu: 500m
            memory: 512Mi
          requests:
            cpu: 250m
            memory: 256Mi
      volumes:
      - name: collector-config
        configMap:
          name: opentelemetry-collector-config

Apply this configuration:


kubectl apply -f opentelemetry-collector-deployment.yaml

Verify: Check if the Collector pods are running.


kubectl get pods -n opentelemetry-collector -l app=opentelemetry-collector-agent # for DaemonSet
kubectl get pods -n opentelemetry-collector -l app=opentelemetry-collector-gateway # for Deployment

Expected Output (for DaemonSet, adjust for number of nodes):


NAME                                        READY   STATUS    RESTARTS   AGE
opentelemetry-collector-agent-xxxxx         1/1     Running   0          Xs

Expected Output (for Deployment):


NAME                                        READY   STATUS    RESTARTS   AGE
opentelemetry-collector-gateway-xxxxx-yyyyy 1/1     Running   0          Xs

Step 4: Expose the Collector with a Service

To allow applications and other Collectors to send telemetry data to our deployed Collector, we need to expose it via a Kubernetes Service. This Service will route traffic to the OTLP gRPC and HTTP endpoints, and optionally to the Prometheus metrics endpoint of the Collector itself.


# opentelemetry-collector-service.yaml
apiVersion: v1
kind: Service
metadata:
  name: opentelemetry-collector
  namespace: opentelemetry-collector
  labels:
    app: opentelemetry-collector
spec:
  selector:
    app: opentelemetry-collector-agent # or app: opentelemetry-collector-gateway
  ports:
    - name: otlp-grpc
      protocol: TCP
      port: 4317
      targetPort: 4317
    - name: otlp-http
      protocol: TCP
      port: 4318
      targetPort: 4318
    - name: prometheus
      protocol: TCP
      port: 8888
      targetPort: 8888

Important: Adjust the selector based on whether you deployed a DaemonSet (app: opentelemetry-collector-agent) or a Deployment (app: opentelemetry-collector-gateway).

Apply this configuration:


kubectl apply -f opentelemetry-collector-service.yaml

Verify: Check if the Service is created.


kubectl get service -n opentelemetry-collector opentelemetry-collector

Expected Output:


NAME                      TYPE        CLUSTER-IP       EXTERNAL-IP   PORT(S)                               AGE
opentelemetry-collector   ClusterIP   10.xx.xx.xx      <none>        4317/TCP,4318/TCP,8888/TCP            Xs

Step 5: Deploy a Sample Application with OTLP Exporter

Now, let’s deploy a simple application that is instrumented with OpenTelemetry and configured to send its telemetry data to our Collector. This application will generate traces and metrics, which the Collector will then receive and process.


# sample-app-with-otlp.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: sample-otlp-app
  namespace: default # Deploy in default namespace for simplicity
spec:
  replicas: 1
  selector:
    matchLabels:
      app: sample-otlp-app
  template:
    metadata:
      labels:
        app: sample-otlp-app
    spec:
      containers:
      - name: sample-app
        image: otel/opentelemetry-collector-contrib:0.96.0 # Using a contrib image for simplicity, it includes a test app
        command: ["/otelcol"] # Overriding to run a simple HTTP server that generates telemetry
        args: ["--config=/etc/otelcol/config.yaml"]
        env:
        - name: OTEL_EXPORTER_OTLP_ENDPOINT
          value: "opentelemetry-collector.opentelemetry-collector.svc.cluster.local:4317" # Collector service endpoint
        - name: OTEL_RESOURCE_ATTRIBUTES
          value: "service.name=sample-otlp-app,service.version=1.0.0"
        ports:
        - containerPort: 8080
        volumeMounts:
        - name: otel-config
          mountPath: /etc/otelcol
      volumes:
      - name: otel-config
        configMap:
          name: sample-app-otel-config
---
apiVersion: v1
kind: ConfigMap
metadata:
  name: sample-app-otel-config
  namespace: default
data:
  config.yaml: |
    receivers:
      otlp:
        protocols:
          grpc:
          http:
    processors:
      batch:
    exporters:
      otlp:
        endpoint: "opentelemetry-collector.opentelemetry-collector.svc.cluster.local:4317"
        tls:
          insecure: true
    service:
      pipelines:
        traces:
          receivers: [otlp]
          processors: [batch]
          exporters: [otlp]
        metrics:
          receivers: [otlp]
          processors: [batch]
          exporters: [otlp]

This example uses the Collector contrib image itself as a simple application, configured to send OTLP data to our main Collector. In a real-world scenario, this would be your actual application instrumented with OpenTelemetry SDKs.

Apply this configuration:


kubectl apply -f sample-app-with-otlp.yaml

Verify: Check the logs of both the sample application and the OpenTelemetry Collector to see data flowing.


kubectl get pods -l app=sample-otlp-app
kubectl logs -f $(kubectl get pods -l app=sample-otlp-app -o jsonpath='{.items[0].metadata.name}')

# Check collector logs
kubectl logs -f $(kubectl get pods -n opentelemetry-collector -l app=opentelemetry-collector-agent -o jsonpath='{.items[0].metadata.name}') # or gateway

Expected Output (Collector logs): You should see debug messages indicating that traces and metrics are being received and exported to the logging exporter.


...
{"level":"debug","ts":"2023-10-27T10:00:00.000Z","caller":"loggingexporter/logging_exporter.go:73","msg":"TracesExporter","#traces":1}
{"level":"debug","ts":"2023-10-27T10:00:00.000Z","caller":"loggingexporter/logging_exporter.go:73","msg":"MetricsExporter","#metrics":1}
...

Step 6: Deploy Prometheus and Jaeger (Optional but Recommended)

To fully visualize the collected data, you’ll need backend systems like Prometheus for metrics and Jaeger for traces. These are crucial components for any comprehensive observability stack. For a more in-depth look at robust networking for such components, consider exploring Kubernetes Network Policies to secure communication between your observability stack.

We’ll use Helm for a quick deployment of these tools. First, add their respective Helm repositories:


helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo add jaegertracing https://jaegertracing.github.io/helm-charts
helm repo update

Deploy Prometheus:


helm install prometheus prometheus-community/prometheus \
  --namespace monitoring \
  --create-namespace \
  -f - <<EOF
server:
  service:
    type: NodePort
    nodePort: 30090 # Access Prometheus UI on http://<node-ip>:30090
  extraScrapeConfigs: |
    - job_name: 'opentelemetry-collector'
      scrape_interval: 10s
      static_configs:
        - targets: ['opentelemetry-collector.opentelemetry-collector.svc.cluster.local:8888']
EOF

Deploy Jaeger:


helm install jaeger jaegertracing/jaeger \
  --namespace jaeger \
  --create-namespace \
  -f - <<EOF
agent:
  enabled: false # Collector sends directly to collector service
collector:
  service:
    type: ClusterIP # Expose collector GRPC internally
query:
  service:
    type: NodePort
    nodePort: 30080 # Access Jaeger UI on http://<node-ip>:30080
EOF

Verify: Access Prometheus and Jaeger UIs.

  • Find a node IP: kubectl get nodes -o wide
  • Prometheus UI: http://<node-ip>:30090 (Go to Status -> Targets, you should see opentelemetry-collector as a target)
  • Jaeger UI: http://<node-ip>:30080 (Select sample-otlp-app from the Service dropdown and click Find Traces)

Production Considerations

Deploying the OpenTelemetry Collector in a production environment requires careful planning and consideration beyond the basic setup. Here are key aspects to address:

  1. Resource Management:
    • Limits & Requests: Always set appropriate CPU and memory requests and limits for Collector pods. Under-resourcing can lead to OOMKills and dropped telemetry, while over-resourcing wastes cluster resources. The memory_limiter processor in the Collector config is also crucial.
    • Horizontal Scaling: For Gateway deployments, scale out replicas based on telemetry volume and processing requirements. Use Horizontal Pod Autoscalers (HPA) based on CPU, memory, or custom metrics (e.g., queue size).
  2. High Availability & Reliability:
    • Multiple Replicas: Deploy multiple replicas for Gateway Collectors to ensure no single point of failure.
    • Anti-Affinity: Use pod anti-affinity rules to ensure Collector replicas are spread across different nodes and availability zones.
    • Persistent Queues (for Exporters): For critical data, configure exporters with persistent queues to buffer data to disk before sending it to backends. This prevents data loss during temporary network outages or backend unavailability.
  3. Security:
    • Network Policies: Implement Kubernetes Network Policies to restrict traffic to and from Collector pods. Only allow expected sources (applications, other collectors) to send data and only allow the Collector to connect to its designated backends.
    • TLS/SSL: Secure communication between applications and the Collector, and between the Collector and backends using TLS. The OTLP receiver and various exporters support TLS configuration.
    • Authentication: If connecting to commercial backends, use environment variables for API keys or integrate with Kubernetes Secrets for sensitive credentials.
    • Image Security: Use trusted, specific OpenTelemetry Collector image versions. Consider scanning images with tools like Sigstore and Kyverno for supply chain security.
  4. Observability of the Collector Itself:
    • Self-Scraping: Configure the Collector to expose its own metrics (as shown in the example via the Prometheus receiver on port 8888) and monitor these. Key metrics include receiver/exporter queue sizes, dropped batches, and memory usage.
    • Health Checks: Use liveness and readiness probes in your Kubernetes deployment to ensure the Collector is responsive and healthy.
    • Logging: Configure appropriate log levels. For production, info is usually sufficient; use debug only for troubleshooting.
  5. Configuration Management:
    • Version Control: Keep your Collector configurations in version control (Git).
    • Helm Charts: Use the official OpenTelemetry Helm chart for robust and configurable deployments. It handles many of the production considerations out of the box.
  6. Advanced Processing:
    • Data Sampling: Implement tail-based or head-based sampling processors to reduce trace volume, especially for high-traffic services.
    • Batching: Optimize batch sizes and timeouts for processors and exporters to balance latency and throughput.
    • Attribute Processors: Use attribute processors to add, remove, or transform attributes (e.g., adding Kubernetes metadata via the k8sattributes processor).
  7. Network Performance:
    • Cilium: For high-performance networking and advanced observability, consider using Cilium with WireGuard encryption. Cilium’s eBPF capabilities can provide deep network insights and efficient traffic handling for your telemetry pipeline. You can even use eBPF Observability with Hubble to monitor the Collector’s network traffic.
    • Load Balancing: For Gateway deployments, ensure your Kubernetes Service (e.g., LoadBalancer type) can handle the expected ingress traffic.

Troubleshooting

Here are some common issues you might encounter with the OpenTelemetry Collector on Kubernetes and how to resolve them.

  1. Collector Pods are stuck in Pending state.

    Issue: Pods aren’t scheduling.

    Solution: Check events for the pod to understand why it’s pending. Common reasons include insufficient resources (CPU/memory), taints/tolerations issues, or node selector mismatches.

    
    kubectl describe pod -n opentelemetry-collector <collector-pod-name>
            

    Adjust resource requests/limits in the DaemonSet/Deployment or fix node affinity/tolerations.

  2. Collector Pods are restarting frequently (CrashLoopBackOff).

    Issue: The Collector process is failing shortly after starting.

    Solution: The most common cause is an invalid configuration in the ConfigMap. Check the Collector’s logs for error messages related to parsing the configuration.

    
    kubectl logs -n opentelemetry-collector <collector-pod-name>
            

    Look for lines like “Error loading configuration” or “invalid configuration”. Correct the collector.yaml in the ConfigMap and reapply.

  3. No telemetry data is appearing

Leave a Reply

Your email address will not be published. Required fields are marked *