Alerts and Monitoring Kubernetes

Top 5 Alert and Monitoring Tools for Kubernetes: A Hands-On Guide (2025)

Hey there! 👋 So you’ve got your Kubernetes cluster running, containers are orchestrating beautifully, and then… something breaks at 3 AM. Sound familiar? I’ve been there, frantically trying to figure out which pod died and why, wishing I’d set up proper monitoring weeks ago.

Let me save you from that nightmare. After spending the last few months testing various monitoring solutions (and debugging my fair share of incidents), I’m here to walk you through the top 5 Kubernetes monitoring and alerting tools that actually work in 2025.

What We’re Building (and Why It Matters)

Here’s the thing about Kubernetes – it’s amazing at scaling and managing your apps, but it’s also incredibly complex. You’ve got multiple layers: the control plane, nodes, pods, containers, and your actual applications. When something goes wrong (and it will), you need visibility across all these layers.

In this guide, we’ll explore:

  1. Prometheus + Grafana (The battle-tested open-source combo)
  2. Datadog (Enterprise-grade with AI-powered insights)
  3. New Relic (Observability platform with Pixie integration)
  4. SigNoz (OpenTelemetry-native open-source solution)
  5. ELK Stack (For log-centric monitoring)

By the end, you’ll know exactly which tool fits your needs and how to set it up.

Prerequisites

Before we dive in, make sure you have:

  • A running Kubernetes cluster (v1.28+ recommended)
    • Minikube, kind, or a cloud cluster (EKS/GKE/AKS) all work
  • kubectl CLI installed and configured
  • helm 3.x installed
  • Basic understanding of Kubernetes concepts (pods, services, deployments)
  • About 30-60 minutes per tool setup

Pro Tip: I’m using a 3-node cluster for testing, but you can follow along with even a single-node Minikube setup.


Tool #1: Prometheus + Grafana – The Gold Standard

Why This Combo Rocks

Prometheus and Grafana together are like peanut butter and jelly for Kubernetes monitoring. Prometheus collects and stores your metrics, while Grafana makes them actually readable with beautiful dashboards. They’re open-source, widely adopted, and have a massive community.

Here’s where I got stuck initially: I thought Prometheus would give me pretty dashboards out of the box. Nope! Prometheus is all about data collection and storage. You need Grafana for visualization. Once I understood this separation, everything clicked.

Step 1: Install Prometheus Using kube-prometheus-stack

The easiest way to get started is with the kube-prometheus-stack Helm chart. It bundles Prometheus, Grafana, Alertmanager, and essential exporters.

# Add the Prometheus community Helm repository
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update

# Create a namespace for monitoring
kubectl create namespace monitoring

# Install the stack with customizations
helm install prometheus prometheus-community/kube-prometheus-stack \
  --namespace monitoring \
  --set prometheus.prometheusSpec.retention=30d \
  --set prometheus.prometheusSpec.storageSpec.volumeClaimTemplate.spec.resources.requests.storage=50Gi

What’s happening here?

  • retention=30d: Keeps 30 days of metrics (default is 15 days)
  • storage=50Gi: Allocates persistent storage for metrics

Step 2: Verify the Installation

# Check all pods are running
kubectl get pods -n monitoring

# You should see pods like:
# - prometheus-kube-prometheus-prometheus-0
# - prometheus-grafana-xxx
# - prometheus-kube-state-metrics-xxx
# - prometheus-operator-xxx

Step 3: Access Grafana Dashboard

# Port-forward Grafana to your local machine
kubectl port-forward -n monitoring svc/prometheus-grafana 3000:80

# Default credentials:
# Username: admin
# Password: prom-operator (get it with this command)
kubectl get secret -n monitoring prometheus-grafana -o jsonpath="{.data.admin-password}" | base64 --decode

Now open http://localhost:3000 in your browser!

Step 4: Explore Pre-built Dashboards

One of the coolest things about kube-prometheus-stack is the pre-configured dashboards. Navigate to:

  • Dashboards → Browse → Kubernetes / Compute Resources / Cluster

This shows you cluster-wide CPU, memory, and network usage. Pretty neat, right?

Step 5: Set Up Your First Alert

Let’s create an alert for high pod CPU usage. Create a file called high-cpu-alert.yaml:

apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: pod-cpu-alerts
  namespace: monitoring
  labels:
    prometheus: kube-prometheus-prometheus
    role: alert-rules
spec:
  groups:
    - name: pod-cpu
      interval: 30s
      rules:
        - alert: PodHighCPU
          expr: |
            sum(rate(container_cpu_usage_seconds_total{pod!=""}[5m])) by (pod, namespace) > 0.8
          for: 5m
          labels:
            severity: warning
          annotations:
            summary: "Pod {{ $labels.pod }} has high CPU usage"
            description: "Pod {{ $labels.pod }} in namespace {{ $labels.namespace }} is using more than 80% CPU for 5 minutes."

Apply it:

kubectl apply -f high-cpu-alert.yaml

What if you want email alerts? You’ll need to configure Alertmanager. Here’s a quick snippet for the values file:

# alertmanager-config.yaml
alertmanager:
  config:
    route:
      group_by: ['alertname', 'cluster']
      receiver: 'email-notifications'
    receivers:
      - name: 'email-notifications'
        email_configs:
          - to: 'your-email@example.com'
            from: 'alertmanager@example.com'
            smarthost: 'smtp.gmail.com:587'
            auth_username: 'your-email@gmail.com'
            auth_password: 'your-app-password'

Gotcha I ran into: Gmail requires an app-specific password, not your regular password. Generate one in your Google Account settings.

What About Long-term Storage?

Prometheus stores data locally by default, which works great for 30-60 days. But for longer retention or multi-cluster setups, you’ll want Thanos or Cortex. They extend Prometheus with cheap object storage (S3, GCS) for long-term data.


Tool #2: Datadog – Enterprise Monitoring Made Easy

Why Choose Datadog?

If you want a fully-managed solution with minimal setup and powerful AI-driven anomaly detection, Datadog is your friend. It’s commercial (starts around $15/host/month), but the time savings and features can justify the cost for production workloads.

Step 1: Get Your API Key

  1. Sign up at datadoghq.com (free trial available)
  2. Navigate to Organization Settings → API Keys
  3. Copy your API key

Step 2: Install Datadog Operator

# Add Datadog Helm repo
helm repo add datadog https://helm.datadoghq.com
helm repo update

# Create namespace
kubectl create namespace datadog

# Create secret with your API key
kubectl create secret generic datadog-secret \
  --from-literal=api-key=YOUR_API_KEY_HERE \
  -n datadog

# Install Datadog Operator
helm install datadog-operator datadog/datadog-operator \
  --namespace datadog

Step 3: Deploy the Datadog Agent

Create datadog-agent.yaml:

apiVersion: datadoghq.com/v2alpha1
kind: DatadogAgent
metadata:
  name: datadog
  namespace: datadog
spec:
  global:
    clusterName: my-k8s-cluster
    site: datadoghq.com  # or datadoghq.eu for EU
    credentials:
      apiSecret:
        secretName: datadog-secret
        keyName: api-key
  features:
    apm:
      enabled: true  # Application Performance Monitoring
    logCollection:
      enabled: true
      containerCollectAll: true
    liveContainerCollection:
      enabled: true

Apply it:

kubectl apply -f datadog-agent.yaml

Step 4: Verify Data Flow

Within 5 minutes, head to your Datadog dashboard:

  • Go to Infrastructure → Kubernetes
  • You should see your cluster, nodes, and pods

The Live Containers view is particularly impressive – it shows real-time resource usage with minimal lag.

Step 5: Set Up Anomaly Detection

One of Datadog’s killer features is ML-powered anomaly detection. Here’s how:

  1. Go to Monitors → New Monitor
  2. Select Anomaly
  3. Define metric: kubernetes.memory.usage by pod_name
  4. Configure: “Alert if memory usage deviates from normal by 3 standard deviations”
  5. Set notification channel (Slack, PagerDuty, email)

My experience: This caught a memory leak in our app before it became critical. The ML baseline learns your normal patterns over a week, then alerts on deviations.


Tool #3: New Relic with Pixie – Zero-Instrumentation Observability

The Pixie Advantage

New Relic acquired Pixie, which uses eBPF (extended Berkeley Packet Filter) to collect telemetry without changing your application code. No SDK, no agents in your app containers – it just works.

Step 1: Install via Guided Setup

# Install New Relic CLI
curl -s https://download.newrelic.com/install/newrelic-cli/scripts/install.sh | bash

# Run guided install (interactive)
sudo newrelic install

Follow the prompts to select:

  • Kubernetes infrastructure monitoring
  • Pixie integration (select Yes)
  • Your cluster name

Step 2: Verify Installation

kubectl get pods -n newrelic

# You should see:
# - newrelic-bundle pods
# - nri-kube-events
# - nri-prometheus
# - pixie pods (if enabled)

Step 3: Explore the Kubernetes Navigator

Log into one.newrelic.com:

  • Go to Kubernetes → Cluster Explorer
  • Click through: Cluster → Nodes → Pods → Containers

The drill-down experience is smooth, and the Pixie Live Debugging tab lets you run SQL-like queries on live traffic:

-- See HTTP requests in real-time
px.display(px.http_requests(start_time='-5m'))

What blew my mind: You can see actual HTTP headers, SQL queries, and Redis commands without instrumenting your app. It’s like Wireshark for Kubernetes.

Step 4: Create a Workload Alert

# In New Relic UI: Alerts & AI → Create Alert
Name: Pod Restart Alert
Condition: kubernetes.podRestart > 3 in 5 minutes
Threshold: Critical
Notification: Slack channel #alerts

Tool #4: SigNoz – Open-Source OpenTelemetry Platform

Why SigNoz?

If you want vendor-neutral observability with full control over your data, SigNoz is fantastic. It’s built on OpenTelemetry (the CNCF standard), so you can instrument once and switch backends later if needed.

Step 1: Install SigNoz via Helm

# Add SigNoz Helm repo
helm repo add signoz https://charts.signoz.io
helm repo update

# Install with default config
kubectl create namespace platform
helm install signoz signoz/signoz -n platform

Step 2: Access the UI

kubectl port-forward -n platform svc/signoz-frontend 3301:3301

# Open http://localhost:3301
# First-time setup will ask you to create an account

Step 3: Deploy OpenTelemetry Collector

SigNoz uses the OTel Collector to receive metrics, logs, and traces. Here’s a basic config:

# otel-collector-config.yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: otel-collector-config
  namespace: platform
data:
  otel-collector-config.yaml: |
    receivers:
      prometheus:
        config:
          scrape_configs:
            - job_name: 'kubernetes-pods'
              kubernetes_sd_configs:
                - role: pod
    
    exporters:
      otlp:
        endpoint: signoz-otel-collector:4317
        tls:
          insecure: true
    
    service:
      pipelines:
        metrics:
          receivers: [prometheus]
          exporters: [otlp]

Step 4: Instrument a Sample App

Let’s monitor a Python Flask app:

# app.py
from flask import Flask
from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
from opentelemetry.sdk.trace.export import BatchSpanProcessor

# Initialize tracing
trace.set_tracer_provider(TracerProvider())
otlp_exporter = OTLPSpanExporter(endpoint="signoz-otel-collector:4317", insecure=True)
trace.get_tracer_provider().add_span_processor(BatchSpanProcessor(otlp_exporter))

app = Flask(__name__)
tracer = trace.get_tracer(__name__)

@app.route('/')
def hello():
    with tracer.start_as_current_span("hello-span"):
        return "Hello from SigNoz!"

if __name__ == '__main__':
    app.run(host='0.0.0.0', port=5000)

Deploy it and watch traces appear in SigNoz!

Pro Tip: SigNoz’s query builder is way friendlier than PromQL if you’re just getting started. You can build queries visually without memorizing syntax.


Tool #5: ELK Stack – For Log-Centric Monitoring

When to Use ELK

If your debugging workflow heavily relies on logs (think microservices with detailed logging), Elasticsearch + Logstash + Kibana is a powerhouse. It’s not as metrics-focused as Prometheus, but unbeatable for log analysis.

Step 1: Deploy Elastic Cloud on Kubernetes (ECK)

# Install ECK operator
kubectl create -f https://download.elastic.co/downloads/eck/2.10.0/crds.yaml
kubectl apply -f https://download.elastic.co/downloads/eck/2.10.0/operator.yaml

# Verify operator is running
kubectl get pods -n elastic-system

Step 2: Deploy Elasticsearch Cluster

# elasticsearch.yaml
apiVersion: elasticsearch.k8s.elastic.co/v1
kind: Elasticsearch
metadata:
  name: quickstart
  namespace: default
spec:
  version: 8.11.0
  nodeSets:
  - name: default
    count: 3
    config:
      node.store.allow_mmap: false
    volumeClaimTemplates:
    - metadata:
        name: elasticsearch-data
      spec:
        accessModes:
        - ReadWriteOnce
        resources:
          requests:
            storage: 10Gi
kubectl apply -f elasticsearch.yaml

# Get password for 'elastic' user
kubectl get secret quickstart-es-elastic-user -o=jsonpath='{.data.elastic}' | base64 --decode

Step 3: Deploy Fluent Bit for Log Collection

# Add Fluent Bit Helm repo
helm repo add fluent https://fluent.github.io/helm-charts
helm install fluent-bit fluent/fluent-bit \
  --set backend.type=es \
  --set backend.es.host=quickstart-es-http \
  --set backend.es.port=9200

Step 4: Access Kibana

# kibana.yaml
apiVersion: kibana.k8s.elastic.co/v1
kind: Kibana
metadata:
  name: quickstart
spec:
  version: 8.11.0
  count: 1
  elasticsearchRef:
    name: quickstart
kubectl apply -f kibana.yaml

# Port-forward Kibana
kubectl port-forward svc/quickstart-kb-http 5601:5601

Open https://localhost:5601, login with the elastic user credentials.

Step 5: Create a Log Dashboard

In Kibana:

  1. Go to Management → Stack Management → Index Patterns
  2. Create pattern: logstash-* or fluent-bit-*
  3. Go to Discover to search logs
  4. Build visualizations in Dashboard

My favorite query: Finding all ERROR logs from a specific namespace:

kubernetes.namespace_name:"production" AND log:"ERROR"

Going Further: Experiments to Try

Now that you’ve got the basics down, here are some advanced scenarios:

Experiment #1: Multi-Cluster Monitoring with Thanos

Set up Thanos to aggregate metrics from multiple Kubernetes clusters into a single Prometheus query interface.

Hint: You’ll need:

  • Thanos Sidecar on each Prometheus
  • Thanos Query for global queries
  • Object storage (S3/GCS) for long-term storage

Experiment #2: Custom Metrics with ServiceMonitor

Create a custom exporter for your app and let Prometheus auto-discover it:

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: my-app-monitor
spec:
  selector:
    matchLabels:
      app: my-app
  endpoints:
  - port: metrics
    interval: 30s

Experiment #3: Cost Optimization with Kubecost

Install Kubecost to track which namespaces/pods cost you the most:

helm install kubecost kubecost/cost-analyzer \
  --namespace kubecost --create-namespace \
  --set prometheus.server.global.external_labels.cluster_id=my-cluster

Experiment #4: Compare Alert Fatigue

Set up the same alerts in Prometheus, Datadog, and New Relic. Track:

  • False positive rate
  • Time to acknowledge
  • Context provided in alert notifications

My results: Datadog’s anomaly detection had fewer false positives but higher latency. Prometheus was faster but noisier.


Comparison Table: Which Tool Should You Choose?

FeaturePrometheus + GrafanaDatadogNew RelicSigNozELK Stack
CostFree (OSS)$15+/host/month$100+/monthFree (OSS)Free (OSS)
Learning CurveMediumLowLowMediumHigh
Best ForMetricsAll-in-oneAPM + K8sOTel StandardLogs
ScalabilityHigh (with Thanos)Very HighVery HighMediumHigh
APMVia ExportersBuilt-inBuilt-inBuilt-inLimited
Log ManagementLoki addonBuilt-inBuilt-inBuilt-inExcellent
AlertingAlertmanagerAdvancedAdvancedGoodGood
Community SupportExcellentEnterpriseEnterpriseGrowingExcellent

Recommendation

For startups/small teams: Start with Prometheus + Grafana. It’s free, well-documented, and you’ll learn fundamental concepts. Upgrade to SigNoz if you want modern UX without vendor lock-in.

For established companies: Datadog or New Relic if budget allows. The time savings and advanced features (anomaly detection, distributed tracing, incident management) pay for themselves.

For log-heavy workloads: ELK Stack. If your debugging relies on grepping through logs, nothing beats Elasticsearch’s search capabilities.

For vendor-neutral future: SigNoz. OpenTelemetry is the future, and SigNoz implements it beautifully.


Wrapping Up

Kubernetes monitoring doesn’t have to be overwhelming. Start small:

  1. Pick one tool from this guide (I’d start with Prometheus + Grafana)
  2. Get basic cluster metrics flowing
  3. Add application-specific metrics
  4. Refine your alerts (start with high-severity only!)
  5. Iterate and improve

Next Learning Steps:

Got questions? Hit me up in the comments below I’d love to hear which tool you ended up choosing and why!

Leave a Reply

Your email address will not be published. Required fields are marked *