Hey there! 👋 So you’ve got your Kubernetes cluster running, containers are orchestrating beautifully, and then… something breaks at 3 AM. Sound familiar? I’ve been there, frantically trying to figure out which pod died and why, wishing I’d set up proper monitoring weeks ago.
Let me save you from that nightmare. After spending the last few months testing various monitoring solutions (and debugging my fair share of incidents), I’m here to walk you through the top 5 Kubernetes monitoring and alerting tools that actually work in 2025.
What We’re Building (and Why It Matters)
Here’s the thing about Kubernetes – it’s amazing at scaling and managing your apps, but it’s also incredibly complex. You’ve got multiple layers: the control plane, nodes, pods, containers, and your actual applications. When something goes wrong (and it will), you need visibility across all these layers.
In this guide, we’ll explore:
- Prometheus + Grafana (The battle-tested open-source combo)
- Datadog (Enterprise-grade with AI-powered insights)
- New Relic (Observability platform with Pixie integration)
- SigNoz (OpenTelemetry-native open-source solution)
- ELK Stack (For log-centric monitoring)
By the end, you’ll know exactly which tool fits your needs and how to set it up.
Prerequisites
Before we dive in, make sure you have:
- A running Kubernetes cluster (v1.28+ recommended)
- Minikube, kind, or a cloud cluster (EKS/GKE/AKS) all work
kubectlCLI installed and configuredhelm3.x installed- Basic understanding of Kubernetes concepts (pods, services, deployments)
- About 30-60 minutes per tool setup
Pro Tip: I’m using a 3-node cluster for testing, but you can follow along with even a single-node Minikube setup.
Tool #1: Prometheus + Grafana – The Gold Standard
Why This Combo Rocks
Prometheus and Grafana together are like peanut butter and jelly for Kubernetes monitoring. Prometheus collects and stores your metrics, while Grafana makes them actually readable with beautiful dashboards. They’re open-source, widely adopted, and have a massive community.
Here’s where I got stuck initially: I thought Prometheus would give me pretty dashboards out of the box. Nope! Prometheus is all about data collection and storage. You need Grafana for visualization. Once I understood this separation, everything clicked.
Step 1: Install Prometheus Using kube-prometheus-stack
The easiest way to get started is with the kube-prometheus-stack Helm chart. It bundles Prometheus, Grafana, Alertmanager, and essential exporters.
# Add the Prometheus community Helm repository
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update
# Create a namespace for monitoring
kubectl create namespace monitoring
# Install the stack with customizations
helm install prometheus prometheus-community/kube-prometheus-stack \
--namespace monitoring \
--set prometheus.prometheusSpec.retention=30d \
--set prometheus.prometheusSpec.storageSpec.volumeClaimTemplate.spec.resources.requests.storage=50Gi
What’s happening here?
retention=30d: Keeps 30 days of metrics (default is 15 days)storage=50Gi: Allocates persistent storage for metrics
Step 2: Verify the Installation
# Check all pods are running
kubectl get pods -n monitoring
# You should see pods like:
# - prometheus-kube-prometheus-prometheus-0
# - prometheus-grafana-xxx
# - prometheus-kube-state-metrics-xxx
# - prometheus-operator-xxx
Step 3: Access Grafana Dashboard
# Port-forward Grafana to your local machine
kubectl port-forward -n monitoring svc/prometheus-grafana 3000:80
# Default credentials:
# Username: admin
# Password: prom-operator (get it with this command)
kubectl get secret -n monitoring prometheus-grafana -o jsonpath="{.data.admin-password}" | base64 --decode
Now open http://localhost:3000 in your browser!
Step 4: Explore Pre-built Dashboards
One of the coolest things about kube-prometheus-stack is the pre-configured dashboards. Navigate to:
- Dashboards → Browse → Kubernetes / Compute Resources / Cluster
This shows you cluster-wide CPU, memory, and network usage. Pretty neat, right?
Step 5: Set Up Your First Alert
Let’s create an alert for high pod CPU usage. Create a file called high-cpu-alert.yaml:
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: pod-cpu-alerts
namespace: monitoring
labels:
prometheus: kube-prometheus-prometheus
role: alert-rules
spec:
groups:
- name: pod-cpu
interval: 30s
rules:
- alert: PodHighCPU
expr: |
sum(rate(container_cpu_usage_seconds_total{pod!=""}[5m])) by (pod, namespace) > 0.8
for: 5m
labels:
severity: warning
annotations:
summary: "Pod {{ $labels.pod }} has high CPU usage"
description: "Pod {{ $labels.pod }} in namespace {{ $labels.namespace }} is using more than 80% CPU for 5 minutes."
Apply it:
kubectl apply -f high-cpu-alert.yaml
What if you want email alerts? You’ll need to configure Alertmanager. Here’s a quick snippet for the values file:
# alertmanager-config.yaml
alertmanager:
config:
route:
group_by: ['alertname', 'cluster']
receiver: 'email-notifications'
receivers:
- name: 'email-notifications'
email_configs:
- to: 'your-email@example.com'
from: 'alertmanager@example.com'
smarthost: 'smtp.gmail.com:587'
auth_username: 'your-email@gmail.com'
auth_password: 'your-app-password'
Gotcha I ran into: Gmail requires an app-specific password, not your regular password. Generate one in your Google Account settings.
What About Long-term Storage?
Prometheus stores data locally by default, which works great for 30-60 days. But for longer retention or multi-cluster setups, you’ll want Thanos or Cortex. They extend Prometheus with cheap object storage (S3, GCS) for long-term data.
Tool #2: Datadog – Enterprise Monitoring Made Easy
Why Choose Datadog?
If you want a fully-managed solution with minimal setup and powerful AI-driven anomaly detection, Datadog is your friend. It’s commercial (starts around $15/host/month), but the time savings and features can justify the cost for production workloads.
Step 1: Get Your API Key
- Sign up at datadoghq.com (free trial available)
- Navigate to Organization Settings → API Keys
- Copy your API key
Step 2: Install Datadog Operator
# Add Datadog Helm repo
helm repo add datadog https://helm.datadoghq.com
helm repo update
# Create namespace
kubectl create namespace datadog
# Create secret with your API key
kubectl create secret generic datadog-secret \
--from-literal=api-key=YOUR_API_KEY_HERE \
-n datadog
# Install Datadog Operator
helm install datadog-operator datadog/datadog-operator \
--namespace datadog
Step 3: Deploy the Datadog Agent
Create datadog-agent.yaml:
apiVersion: datadoghq.com/v2alpha1
kind: DatadogAgent
metadata:
name: datadog
namespace: datadog
spec:
global:
clusterName: my-k8s-cluster
site: datadoghq.com # or datadoghq.eu for EU
credentials:
apiSecret:
secretName: datadog-secret
keyName: api-key
features:
apm:
enabled: true # Application Performance Monitoring
logCollection:
enabled: true
containerCollectAll: true
liveContainerCollection:
enabled: true
Apply it:
kubectl apply -f datadog-agent.yaml
Step 4: Verify Data Flow
Within 5 minutes, head to your Datadog dashboard:
- Go to Infrastructure → Kubernetes
- You should see your cluster, nodes, and pods
The Live Containers view is particularly impressive – it shows real-time resource usage with minimal lag.
Step 5: Set Up Anomaly Detection
One of Datadog’s killer features is ML-powered anomaly detection. Here’s how:
- Go to Monitors → New Monitor
- Select Anomaly
- Define metric:
kubernetes.memory.usagebypod_name - Configure: “Alert if memory usage deviates from normal by 3 standard deviations”
- Set notification channel (Slack, PagerDuty, email)
My experience: This caught a memory leak in our app before it became critical. The ML baseline learns your normal patterns over a week, then alerts on deviations.
Tool #3: New Relic with Pixie – Zero-Instrumentation Observability
The Pixie Advantage
New Relic acquired Pixie, which uses eBPF (extended Berkeley Packet Filter) to collect telemetry without changing your application code. No SDK, no agents in your app containers – it just works.
Step 1: Install via Guided Setup
# Install New Relic CLI
curl -s https://download.newrelic.com/install/newrelic-cli/scripts/install.sh | bash
# Run guided install (interactive)
sudo newrelic install
Follow the prompts to select:
- Kubernetes infrastructure monitoring
- Pixie integration (select Yes)
- Your cluster name
Step 2: Verify Installation
kubectl get pods -n newrelic
# You should see:
# - newrelic-bundle pods
# - nri-kube-events
# - nri-prometheus
# - pixie pods (if enabled)
Step 3: Explore the Kubernetes Navigator
Log into one.newrelic.com:
- Go to Kubernetes → Cluster Explorer
- Click through: Cluster → Nodes → Pods → Containers
The drill-down experience is smooth, and the Pixie Live Debugging tab lets you run SQL-like queries on live traffic:
-- See HTTP requests in real-time
px.display(px.http_requests(start_time='-5m'))
What blew my mind: You can see actual HTTP headers, SQL queries, and Redis commands without instrumenting your app. It’s like Wireshark for Kubernetes.
Step 4: Create a Workload Alert
# In New Relic UI: Alerts & AI → Create Alert
Name: Pod Restart Alert
Condition: kubernetes.podRestart > 3 in 5 minutes
Threshold: Critical
Notification: Slack channel #alerts
Tool #4: SigNoz – Open-Source OpenTelemetry Platform
Why SigNoz?
If you want vendor-neutral observability with full control over your data, SigNoz is fantastic. It’s built on OpenTelemetry (the CNCF standard), so you can instrument once and switch backends later if needed.
Step 1: Install SigNoz via Helm
# Add SigNoz Helm repo
helm repo add signoz https://charts.signoz.io
helm repo update
# Install with default config
kubectl create namespace platform
helm install signoz signoz/signoz -n platform
Step 2: Access the UI
kubectl port-forward -n platform svc/signoz-frontend 3301:3301
# Open http://localhost:3301
# First-time setup will ask you to create an account
Step 3: Deploy OpenTelemetry Collector
SigNoz uses the OTel Collector to receive metrics, logs, and traces. Here’s a basic config:
# otel-collector-config.yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: otel-collector-config
namespace: platform
data:
otel-collector-config.yaml: |
receivers:
prometheus:
config:
scrape_configs:
- job_name: 'kubernetes-pods'
kubernetes_sd_configs:
- role: pod
exporters:
otlp:
endpoint: signoz-otel-collector:4317
tls:
insecure: true
service:
pipelines:
metrics:
receivers: [prometheus]
exporters: [otlp]
Step 4: Instrument a Sample App
Let’s monitor a Python Flask app:
# app.py
from flask import Flask
from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
from opentelemetry.sdk.trace.export import BatchSpanProcessor
# Initialize tracing
trace.set_tracer_provider(TracerProvider())
otlp_exporter = OTLPSpanExporter(endpoint="signoz-otel-collector:4317", insecure=True)
trace.get_tracer_provider().add_span_processor(BatchSpanProcessor(otlp_exporter))
app = Flask(__name__)
tracer = trace.get_tracer(__name__)
@app.route('/')
def hello():
with tracer.start_as_current_span("hello-span"):
return "Hello from SigNoz!"
if __name__ == '__main__':
app.run(host='0.0.0.0', port=5000)
Deploy it and watch traces appear in SigNoz!
Pro Tip: SigNoz’s query builder is way friendlier than PromQL if you’re just getting started. You can build queries visually without memorizing syntax.
Tool #5: ELK Stack – For Log-Centric Monitoring
When to Use ELK
If your debugging workflow heavily relies on logs (think microservices with detailed logging), Elasticsearch + Logstash + Kibana is a powerhouse. It’s not as metrics-focused as Prometheus, but unbeatable for log analysis.
Step 1: Deploy Elastic Cloud on Kubernetes (ECK)
# Install ECK operator
kubectl create -f https://download.elastic.co/downloads/eck/2.10.0/crds.yaml
kubectl apply -f https://download.elastic.co/downloads/eck/2.10.0/operator.yaml
# Verify operator is running
kubectl get pods -n elastic-system
Step 2: Deploy Elasticsearch Cluster
# elasticsearch.yaml
apiVersion: elasticsearch.k8s.elastic.co/v1
kind: Elasticsearch
metadata:
name: quickstart
namespace: default
spec:
version: 8.11.0
nodeSets:
- name: default
count: 3
config:
node.store.allow_mmap: false
volumeClaimTemplates:
- metadata:
name: elasticsearch-data
spec:
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 10Gi
kubectl apply -f elasticsearch.yaml
# Get password for 'elastic' user
kubectl get secret quickstart-es-elastic-user -o=jsonpath='{.data.elastic}' | base64 --decode
Step 3: Deploy Fluent Bit for Log Collection
# Add Fluent Bit Helm repo
helm repo add fluent https://fluent.github.io/helm-charts
helm install fluent-bit fluent/fluent-bit \
--set backend.type=es \
--set backend.es.host=quickstart-es-http \
--set backend.es.port=9200
Step 4: Access Kibana
# kibana.yaml
apiVersion: kibana.k8s.elastic.co/v1
kind: Kibana
metadata:
name: quickstart
spec:
version: 8.11.0
count: 1
elasticsearchRef:
name: quickstart
kubectl apply -f kibana.yaml
# Port-forward Kibana
kubectl port-forward svc/quickstart-kb-http 5601:5601
Open https://localhost:5601, login with the elastic user credentials.
Step 5: Create a Log Dashboard
In Kibana:
- Go to Management → Stack Management → Index Patterns
- Create pattern:
logstash-*orfluent-bit-* - Go to Discover to search logs
- Build visualizations in Dashboard
My favorite query: Finding all ERROR logs from a specific namespace:
kubernetes.namespace_name:"production" AND log:"ERROR"
Going Further: Experiments to Try
Now that you’ve got the basics down, here are some advanced scenarios:
Experiment #1: Multi-Cluster Monitoring with Thanos
Set up Thanos to aggregate metrics from multiple Kubernetes clusters into a single Prometheus query interface.
Hint: You’ll need:
- Thanos Sidecar on each Prometheus
- Thanos Query for global queries
- Object storage (S3/GCS) for long-term storage
Experiment #2: Custom Metrics with ServiceMonitor
Create a custom exporter for your app and let Prometheus auto-discover it:
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: my-app-monitor
spec:
selector:
matchLabels:
app: my-app
endpoints:
- port: metrics
interval: 30s
Experiment #3: Cost Optimization with Kubecost
Install Kubecost to track which namespaces/pods cost you the most:
helm install kubecost kubecost/cost-analyzer \
--namespace kubecost --create-namespace \
--set prometheus.server.global.external_labels.cluster_id=my-cluster
Experiment #4: Compare Alert Fatigue
Set up the same alerts in Prometheus, Datadog, and New Relic. Track:
- False positive rate
- Time to acknowledge
- Context provided in alert notifications
My results: Datadog’s anomaly detection had fewer false positives but higher latency. Prometheus was faster but noisier.
Comparison Table: Which Tool Should You Choose?
| Feature | Prometheus + Grafana | Datadog | New Relic | SigNoz | ELK Stack |
|---|---|---|---|---|---|
| Cost | Free (OSS) | $15+/host/month | $100+/month | Free (OSS) | Free (OSS) |
| Learning Curve | Medium | Low | Low | Medium | High |
| Best For | Metrics | All-in-one | APM + K8s | OTel Standard | Logs |
| Scalability | High (with Thanos) | Very High | Very High | Medium | High |
| APM | Via Exporters | Built-in | Built-in | Built-in | Limited |
| Log Management | Loki addon | Built-in | Built-in | Built-in | Excellent |
| Alerting | Alertmanager | Advanced | Advanced | Good | Good |
| Community Support | Excellent | Enterprise | Enterprise | Growing | Excellent |
Recommendation
For startups/small teams: Start with Prometheus + Grafana. It’s free, well-documented, and you’ll learn fundamental concepts. Upgrade to SigNoz if you want modern UX without vendor lock-in.
For established companies: Datadog or New Relic if budget allows. The time savings and advanced features (anomaly detection, distributed tracing, incident management) pay for themselves.
For log-heavy workloads: ELK Stack. If your debugging relies on grepping through logs, nothing beats Elasticsearch’s search capabilities.
For vendor-neutral future: SigNoz. OpenTelemetry is the future, and SigNoz implements it beautifully.
Wrapping Up
Kubernetes monitoring doesn’t have to be overwhelming. Start small:
- Pick one tool from this guide (I’d start with Prometheus + Grafana)
- Get basic cluster metrics flowing
- Add application-specific metrics
- Refine your alerts (start with high-severity only!)
- Iterate and improve
Next Learning Steps:
- Dive deeper into PromQL: Prometheus Query Basics
- Learn about SLIs and SLOs: Google SRE Book
- Explore distributed tracing: OpenTelemetry Tracing
- Check out my other guides on Kubernetes security and performance tuning
Got questions? Hit me up in the comments below I’d love to hear which tool you ended up choosing and why!