Introduction
In the complex, distributed landscape of modern microservices architectures, understanding how requests propagate through various services is paramount. When a user experiences a slow response, identifying the bottleneck often feels like searching for a needle in a haystack. Traditional logging and metrics provide valuable insights into individual service health, but they struggle to stitch together the complete journey of a single request across multiple, independent components. This is precisely where distributed tracing shines, offering an end-to-end view that illuminates latency, errors, and performance bottlenecks.
Grafana Tempo emerges as a powerful, high-volume distributed tracing backend designed to ingest traces from various formats (Jaeger, Zipkin, OpenTelemetry, OpenCensus) and integrate seamlessly with other Grafana products like Loki for logs and Prometheus for metrics. By leveraging Tempo, developers and operations teams can gain unparalleled visibility into their applications’ behavior, allowing for faster debugging, root cause analysis, and performance optimization. This guide will walk you through the process of deploying Grafana Tempo on Kubernetes, configuring your applications to send traces, and visualizing these traces to unlock deep insights into your distributed systems.
TL;DR: Deploying Grafana Tempo on Kubernetes
Get Grafana Tempo up and running quickly on your Kubernetes cluster to start tracing your applications. This summary covers the essential steps:
- Add Grafana Helm Chart:
- Install Tempo (Mini-distributed mode for simplicity):
- Install Grafana (for visualization):
- Expose Grafana UI (e.g., via port-forward):
- Configure your application to send OpenTelemetry traces to Tempo:
helm repo add grafana https://grafana.github.io/helm-charts
helm repo update
helm install tempo grafana/tempo -f - <
helm install grafana grafana/grafana \
--set persistence.enabled=true \
--set adminPassword='yourStrongPassword' \
--set datasources."datasources\.yaml".apiVersion=1 \
--set datasources."datasources\.yaml".datasources[0].name=Tempo \
--set datasources."datasources\.yaml".datasources[0].type=tempo \
--set datasources."datasources\.yaml".datasources[0].url=http://tempo.default.svc.cluster.local:3100 \
--set datasources."datasources\.yaml".datasources[0].uid=tempo \
--set datasources."datasources\.yaml".datasources[0].jsonData.nodeGraph.enabled=true \
--set datasources."datasources\.yaml".datasources[0].jsonData.serviceMap.enabled=false \
--set datasources."datasources\.yaml".datasources[0].jsonData.search.hide=false \
--set datasources."datasources\.yaml".datasources[0].jsonData.traceQuery.groupBy="resource.service.name" \
--set datasources."datasources\.yaml".datasources[0].jsonData.traceQuery.filterBy=""
kubectl port-forward service/grafana 3000:80
Point your application's OpenTelemetry exporter to tempo.default.svc.cluster.local:4317 (or the appropriate service name and port).
Access Grafana at http://localhost:3000, log in with admin and yourStrongPassword, then navigate to Explore and select the Tempo datasource to start querying traces!
Prerequisites
Before embarking on this tracing journey, ensure you have the following tools and knowledge:
- Kubernetes Cluster: A running Kubernetes cluster (e.g., Kind, Minikube, or a cloud-managed cluster like GKE, EKS, AKS).
kubectl: The Kubernetes command-line tool, configured to connect to your cluster. Refer to the official Kubernetes documentation for installation.- Helm 3: The package manager for Kubernetes. Install it by following the instructions on the Helm website.
- Basic Kubernetes Knowledge: Familiarity with Kubernetes concepts such as Pods, Deployments, Services, and Namespaces.
- Basic Tracing Concepts: An understanding of what distributed tracing is, including concepts like traces, spans, and instrumentation.
Step-by-Step Guide: Deploying and Using Grafana Tempo
This guide will walk you through setting up Grafana Tempo, deploying a sample application, and visualizing traces in Grafana.
Step 1: Add Grafana Helm Repository
First, we need to add the official Grafana Helm chart repository, which contains charts for Tempo, Grafana itself, and other related tools. This ensures we can easily install and manage these components.
Adding the repository allows Helm to fetch the latest stable versions of the charts. Subsequently, updating the repository ensures your local Helm cache has the most recent chart information, including any new releases or updates. This is a standard practice before installing any new software via Helm.
helm repo add grafana https://grafana.github.io/helm-charts
helm repo update
Verify: Helm Repositories
You should see output indicating that the repositories have been updated.
"grafana" has been added to your repositories
Hang tight while we grab the latest from your configured repos...
...Successfully got an update from the "grafana" chart repository
...Successfully got an update from the "bitnami" chart repository
Update Complete. ⎈Happy Helming!⎈
Step 2: Install Grafana Tempo
Now, let's install Grafana Tempo. For simplicity in a tutorial environment, we'll deploy Tempo in a "mini-distributed" mode using its Helm chart. This configuration bundles several Tempo components (distributor, ingester, querier, compactors) into a single StatefulSet, making it easier to manage for smaller deployments or testing. For production, you would typically deploy Tempo in a fully distributed mode with separate components for scalability and resilience.
We're enabling various trace ingestion protocols (OTLP gRPC, Jaeger gRPC, Jaeger Thrift, Zipkin HTTP) to ensure maximum compatibility with different application instrumentation libraries. OpenTelemetry (OTLP) is the recommended standard for new applications.
helm install tempo grafana/tempo -f - <
Verify: Tempo Deployment
Check that the Tempo Pods and Services are running.
kubectl get pods -l app.kubernetes.io/name=tempo
Expected output:
NAME READY STATUS RESTARTS AGE
tempo-tempo-0 1/1 Running 0 2m
And check the Tempo services:
kubectl get svc -l app.kubernetes.io/name=tempo
Expected output (ports might vary, but look for the main tempo service and its exposed ports):
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
tempo ClusterIP 10.96.145.195 3100/TCP,4317/TCP,4318/TCP,14250/TCP,14268/TCP,9411/TCP,80/TCP,55679/TCP 2m
tempo-agent ClusterIP 10.96.223.107 6831/UDP,6832/UDP,4317/TCP,4318/TCP,55679/TCP 2m
Note the `tempo` service's ClusterIP and ports. Specifically, `4317/TCP` is for OTLP gRPC, `14250/TCP` for Jaeger gRPC, `14268/TCP` for Jaeger Thrift HTTP, and `9411/TCP` for Zipkin HTTP. These are the endpoints your applications will use to send traces.
Step 3: Install Grafana (for Visualization)
To visualize the traces stored in Tempo, we need Grafana. We'll install Grafana via Helm and pre-configure it with a Tempo datasource. This streamlines the setup by automatically adding Tempo as a data source when Grafana starts.
The `datasources.yaml` configuration defines a Tempo datasource, pointing to the `tempo` service within the Kubernetes cluster. We enable `nodeGraph` for better visualization of trace dependencies. Remember to choose a strong password for the Grafana admin user. For more on observability, consider exploring eBPF Observability with Hubble, which complements tracing by providing deeper network insights.
helm install grafana grafana/grafana \
--set persistence.enabled=true \
--set adminPassword='yourStrongPassword' \
--set datasources."datasources\.yaml".apiVersion=1 \
--set datasources."datasources\.yaml".datasources[0].name=Tempo \
--set datasources."datasources\.yaml".datasources[0].type=tempo \
--set datasources."datasources\.yaml".datasources[0].url=http://tempo.default.svc.cluster.local:3100 \
--set datasources."datasources\.yaml".datasources[0].uid=tempo \
--set datasources."datasources\.yaml".datasources[0].jsonData.nodeGraph.enabled=true \
--set datasources."datasources\.yaml".datasources[0].jsonData.serviceMap.enabled=false \
--set datasources."datasources\.yaml".datasources[0].jsonData.search.hide=false \
--set datasources."datasources\.yaml".datasources[0].jsonData.traceQuery.groupBy="resource.service.name" \
--set datasources."datasources\.yaml".datasources[0].jsonData.traceQuery.filterBy=""
Verify: Grafana Deployment
Check if Grafana Pods are running.
kubectl get pods -l app.kubernetes.io/name=grafana
Expected output:
NAME READY STATUS RESTARTS AGE
grafana-7b98d9756b-abcde 1/1 Running 0 1m
Step 4: Deploy a Sample Instrumented Application
To generate traces, we need an application that is instrumented to send them. We'll use a simple Python Flask application configured with OpenTelemetry. This application will make a call to itself, generating two spans per request.
This example uses the OpenTelemetry Python SDK to instrument a basic Flask application. The `OTEL_EXPORTER_OTLP_ENDPOINT` environment variable is crucial here, as it directs the traces to our Tempo service. We're using `tempo.default.svc.cluster.local:4317`, which is the OTLP gRPC endpoint of the Tempo service in the `default` namespace.
apiVersion: apps/v1
kind: Deployment
metadata:
name: otel-flask-app
labels:
app: otel-flask-app
spec:
replicas: 1
selector:
matchLabels:
app: otel-flask-app
template:
metadata:
labels:
app: otel-flask-app
spec:
containers:
- name: otel-flask-app
image: python:3.9-slim-buster
ports:
- containerPort: 5000
env:
- name: OTEL_SERVICE_NAME
value: my-flask-service
- name: OTEL_EXPORTER_OTLP_ENDPOINT
value: http://tempo.default.svc.cluster.local:4317 # Tempo OTLP gRPC endpoint
- name: FLASK_APP
value: app.py
command: ["/bin/bash", "-c"]
args:
- |
pip install Flask opentelemetry-sdk opentelemetry-exporter-otlp-proto-grpc opentelemetry-instrumentation-flask opentelemetry-instrumentation-requests
cat > app.py <<'EOF'
from flask import Flask
import requests
from opentelemetry import trace
from opentelemetry.sdk.resources import Resource
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
from opentelemetry.instrumentation.flask import FlaskInstrumentor
from opentelemetry.instrumentation.requests import RequestsInstrumentor
# Set up OpenTelemetry
resource = Resource.create({
"service.name": "my-flask-service",
"application": "my-flask-app",
"environment": "demo",
})
provider = TracerProvider(resource=resource)
processor = BatchSpanProcessor(OTLPSpanExporter())
provider.add_span_processor(processor)
trace.set_tracer_provider(provider)
app = Flask(__name__)
FlaskInstrumentor().instrument_app(app)
RequestsInstrumentor().instrument()
tracer = trace.get_tracer(__name__)
@app.route('/')
def hello():
with tracer.start_as_current_span("hello-request"):
response = requests.get("http://localhost:5000/internal")
return f"Hello, World! Internal call status: {response.status_code}"
@app.route('/internal')
def internal():
with tracer.start_as_current_span("internal-call"):
return "Internal call successful!"
if __name__ == '__main__':
app.run(host='0.0.0.0', port=5000)
EOF
flask run --host=0.0.0.0 --port=5000
---
apiVersion: v1
kind: Service
metadata:
name: otel-flask-app
labels:
app: otel-flask-app
spec:
selector:
app: otel-flask-app
ports:
- protocol: TCP
port: 5000
targetPort: 5000
type: ClusterIP
Apply this manifest:
kubectl apply -f otel-flask-app.yaml
Verify: Application Deployment
Check that the application Pod is running.
kubectl get pods -l app=otel-flask-app
Expected output:
NAME READY STATUS RESTARTS AGE
otel-flask-app-7c7c7c7c7-xyz12 1/1 Running 0 1m
Step 5: Generate Traces
With the application deployed, let's generate some traffic to create traces. We'll port-forward the application service and hit its endpoint a few times.
This step simulates user requests to your application. Each request will trigger the Flask app, which in turn generates OpenTelemetry spans and sends them to the Tempo backend via the configured OTLP endpoint.
kubectl port-forward service/otel-flask-app 8080:5000 &
Now, hit the application multiple times:
curl http://localhost:8080
curl http://localhost:8080
curl http://localhost:8080/internal
curl http://localhost:8080
You should see output like:
Hello, World! Internal call status: 200
Hello, World! Internal call status: 200
Internal call successful!
Hello, World! Internal call status: 200
Don't forget to kill the port-forward process when done:
kill %1
Step 6: Access Grafana and Visualize Traces
Finally, let's access Grafana and explore the traces.
Port-forward the Grafana service to your local machine:
kubectl port-forward service/grafana 3000:80
Open your web browser and navigate to `http://localhost:3000`. Log in with username `admin` and the password you set (`yourStrongPassword`).
Verify: Grafana Data Source
Once logged in, navigate to `Connections -> Data sources`. You should see the `Tempo` datasource pre-configured.
Now, go to the `Explore` section (the compass icon on the left sidebar).
- Select `Tempo` from the datasource dropdown.
- In the Query editor, you can search for traces. For instance, to find traces from our Flask application, you could search by `service.name="my-flask-service"`.
- Click `Run Query`.
You should now see a list of traces. Click on any trace ID to view its detailed span breakdown, including the `hello-request` and `internal-call` spans, their durations, and dependencies. The node graph visualization should show the flow of calls.
This end-to-end view is incredibly powerful for debugging. If you see a long-running trace, you can quickly identify which span (and thus which service or specific operation) is contributing most to the latency. For managing network traffic and services, you might also find our Kubernetes Gateway API Migration Guide useful, especially as tracing often goes hand-in-hand with traffic management.
Production Considerations
Deploying Grafana Tempo in a production environment requires careful planning beyond a simple Helm install.
- Scalability: For high-volume tracing, Tempo should be deployed in its distributed mode, with separate components (Distributor, Ingester, Querier, Compactor, etc.) scaled independently. This allows you to handle massive trace ingestion and querying loads.
- Storage Backend: Tempo supports various storage backends like AWS S3, Google Cloud Storage, Azure Blob Storage, and MinIO. For production, choose a robust, scalable, and cost-effective object storage solution. Avoid local storage for production deployments.
- High Availability: Ensure your Tempo deployment is highly available. This means running multiple replicas of each component and configuring appropriate anti-affinity rules in Kubernetes.
- Resource Limits: Set appropriate CPU and memory limits and requests for all Tempo components to prevent resource exhaustion and ensure stable operation. Use tools like Karpenter for Cost Optimization to efficiently manage node resources for your tracing infrastructure.
- Monitoring and Alerting: Integrate Tempo's metrics (available via Prometheus) into your monitoring system. Set up alerts for ingestion errors, high latency, or storage issues.
- Security:
- Network Policies: Implement Kubernetes Network Policies to restrict traffic to and from Tempo components, ensuring only authorized services can send traces or query data.
- Authentication/Authorization: Secure access to Grafana using appropriate authentication methods (e.g., OAuth, LDAP) and configure role-based access control (RBAC).
- Encryption: Ensure data in transit (e.g., between services and Tempo, or between Tempo components) is encrypted using TLS. Consider solutions like Cilium WireGuard Encryption for pod-to-pod traffic within your cluster.
- Trace Sampling: For very high-volume applications, sending every single trace might be prohibitively expensive or resource-intensive. Implement intelligent trace sampling strategies at the application level or using an OpenTelemetry Collector to reduce data volume while retaining valuable insights.
- OpenTelemetry Collector: Consider deploying an OpenTelemetry Collector as an intermediary. It can perform various functions like batching, filtering, transforming, and sending traces to Tempo, reducing the load on application services and providing more control over trace pipelines.
- Integration with Logs & Metrics: Leverage Grafana's capabilities to link traces with relevant logs (from Loki) and metrics (from Prometheus) for a holistic observability experience.
Troubleshooting
Here are common issues you might encounter and their solutions.
-
Issue: Traces not appearing in Grafana.
Solution: This is the most common issue. Follow these steps:
- Check Application Logs: Look for errors in your application Pod logs related to OpenTelemetry or exporting traces. Ensure the `OTEL_EXPORTER_OTLP_ENDPOINT` is correctly set and reachable.
- Verify Tempo Service: Ensure the `tempo` service is running and its OTLP gRPC port (4317) is accessible from your application.
kubectl get svc tempo kubectl exec -it-- curl -v telnet://tempo.default.svc.cluster.local:4317 The `telnet` command should connect successfully. If not, check network policies or service definition.
- Check Tempo Pod Logs: Look for ingestion errors in the Tempo distributor/ingester logs.
kubectl logs tempo-tempo-0 -c distributor kubectl logs tempo-tempo-0 -c ingester - Grafana Data Source Configuration: Double-check the Tempo datasource URL in Grafana (`http://tempo.default.svc.cluster.local:3100`).
-
Issue: Grafana UI not accessible via port-forward.
Solution: Ensure your `kubectl port-forward` command is correct and the Grafana Pod is `Running`.
kubectl get pods -l app.kubernetes.io/name=grafana kubectl port-forward service/grafana 3000:80Also, check for any local firewall rules blocking port 3000.
-
Issue: "Failed to connect" error in Grafana when querying Tempo.
Solution: This usually means Grafana cannot reach the Tempo query endpoint.
- Verify the `tempo` service is running and healthy: `kubectl get svc tempo`.
- Ensure the `tempo` pod is healthy: `kubectl get pods -l app.kubernetes.io/name=tempo`.
- Check Grafana pod logs for connection errors: `kubectl logs
`. - If you're using a custom namespace, ensure the Tempo datasource URL in Grafana reflects that (e.g., `http://tempo.
.svc.cluster.local:3100`).
-
Issue: High resource usage by Tempo components.
Solution:
- Ingester/Distributor: If trace volume is very high, these components will consume more resources. Consider implementing client-side sampling in your applications or using an OpenTelemetry Collector to pre-process and sample traces before sending them to Tempo.
- Storage: If your storage backend is slow or has high latency, Tempo's ingesters might buffer more data, leading to higher memory usage. Ensure your object storage is performant.
- Compactor: High CPU/memory for compactor might indicate large trace batches or inefficient compaction. Review Tempo's configuration for compaction settings.
- For production, transition to a fully distributed Tempo deployment to scale components independently.
-
Issue: Traces are too short or missing spans.
Solution:
- Instrumentation: Double-check your application's instrumentation. Are all relevant functions and external calls being instrumented? For Python, ensure `FlaskInstrumentor().instrument_app(app)` and `RequestsInstrumentor().instrument()` (or equivalents for your language) are correctly called.
- Context Propagation: Ensure trace context is being propagated correctly across service boundaries. If you're making HTTP calls, the OpenTelemetry HTTP instrumentation should handle this automatically. If using other protocols (e.g., Kafka, gRPC), you might need explicit context propagation.
- Sampling: Are you accidentally sampling out too many traces or spans at the application level or in an OpenTelemetry Collector?
-
Issue: Difficulty correlating traces with logs and metrics.
Solution:
- Standardized Identifiers: Ensure your logging and metric systems include `trace_id` and `span_id` (if applicable) in their output.
- Grafana Configuration: In Grafana, configure your Tempo (and Loki/Prometheus) datasources to allow linking. For Tempo, ensure `serviceMap.enabled` is (or can be) configured and `traceQuery.groupBy` attributes are set.
- Application Instrumentation: Your application must enrich logs and metrics with trace context. The OpenTelemetry SDKs usually provide utilities for this.
- Explore Grafana's Trace to Logs/Metrics linking features.
FAQ Section
-
What is distributed tracing and why do I need it?
Distributed tracing is a method used to monitor requests as they flow through a distributed system, such as a microservices architecture. It stitches together individual operations (spans) into a single end-to-end view (trace). You need it to understand the flow of requests, identify performance bottlenecks, debug errors across service boundaries, and optimize the overall performance of your distributed applications. Without it, pinpointing issues in a complex system can be incredibly challenging and time-consuming.
-
How does Grafana Tempo differ from Jaeger or Zipkin?
Grafana Tempo is primarily a high-volume, cost-efficient trace storage backend. Unlike Jaeger or Zipkin, which are full-stack tracing solutions (including agents, collectors, and UIs), Tempo focuses specifically on trace ingesting and storing, relying on Grafana for visualization. It's designed to seamlessly integrate with OpenTelemetry, Loki (for logs), and Prometheus (for metrics) within the Grafana ecosystem, offering a unified observability platform. Tempo's architecture is optimized for object storage, making it very scalable and cost-effective for long-term trace retention.
-
What is OpenTelemetry and how does it relate to Tempo?
OpenTelemetry is a CNCF project that provides a set of APIs, SDKs, and tools for instrumenting, generating, collecting, and exporting telemetry data (traces, metrics, and logs). It aims to standardize telemetry data collection across various languages and frameworks. Grafana Tempo is a backend that can natively ingest OpenTelemetry traces (via OTLP), making it an ideal storage solution for applications instrumented with OpenTelemetry. OpenTelemetry acts as the "source" of your trace data, and Tempo is where that data is stored and made queryable.
-
How can I reduce the cost of storing traces in Tempo?
The primary cost driver for Tempo is usually the storage backend (e.g., S3). To reduce costs:
- Trace Sampling: Implement intelligent sampling strategies. Not every request needs to be fully traced. You can sample based on a fixed rate, error rate, or specific attributes. The OpenTelemetry Collector can help with advanced sampling.
- Shorten Retention: Configure Tempo to retain traces for a shorter period if not all historical data is needed.
- Attribute Filtering: Use the OpenTelemetry Collector to filter out unnecessary attributes from spans before they are sent to Tempo, reducing the size of each trace.
- Efficient Storage: Choose a cost-effective object storage tier if your cloud provider offers different options (e.g., infrequent access tiers for older traces).
-
Can I use Grafana Tempo with an existing service mesh like Istio?
Absolutely! Service meshes like Istio Ambient Mesh can automatically inject tracing headers and generate spans for inter-service communication, often using formats like Zipkin or Jaeger. You can configure Istio's telemetry to send these traces to an OpenTelemetry Collector, which then forwards them to Tempo. This allows you to get traces for both instrumented application code and the network layer provided by the mesh, offering a comprehensive view. The OpenTelemetry Collector can act as a bridge between Istio's native tracing output and Tempo's OTLP ingestion endpoint.
Cleanup Commands
To remove all the components deployed in this tutorial:
helm uninstall grafana
helm uninstall tempo
kubectl delete deployment otel-flask-app
kubectl delete service otel-flask-app
kubectl delete pvc data-grafana-0 # If persistence was enabled for Grafana
Next Steps / Further Reading
Congratulations! You've successfully deployed Grafana Tempo and visualized your first distributed traces. Here are some next steps to deepen your understanding and enhance your observability setup:
- Explore OpenTelemetry Collector: Learn how to deploy and configure the OpenTelemetry Collector to centralize trace collection, perform sampling, batching, and routing. This is a critical component for production tracing setups.
- Integrate with Prometheus and Loki: Configure Grafana to connect to Prometheus for metrics and Loki for logs. Then, explore Grafana's linking features to jump from a trace to relevant logs and metrics for a truly unified observability experience.
- Advanced Instrumentation: Instrument a more complex application, perhaps one with multiple services communicating via different protocols (HTTP, gRPC, Kafka) to see how traces propagate across these boundaries.
- Production Deployment of Tempo: Dive into the official Grafana Tempo documentation for details on deploying Tempo in a highly available and scalable distributed mode.
- Learn more about Kubernetes Networking: Understanding how services communicate is key to effective tracing. Check out our Kubernetes Network Policies: Complete Security Hardening Guide for deeper insights into network security and isolation.
- Cloud-native Tracing: Explore how cloud providers offer managed tracing services (e.g., AWS X-Ray, Google Cloud Trace) and how Tempo can complement or integrate with these.
Conclusion
Distributed tracing with Grafana Tempo is an indispensable tool for anyone operating modern, distributed applications on Kubernetes. By providing deep, end-to-end visibility into request flows, Tempo empowers teams to quickly identify and resolve performance issues, understand complex inter-service dependencies, and ultimately deliver a better user experience. While the initial setup might seem daunting, the insights gained are invaluable. Embrace tracing, and transform your debugging process from a frustrating guessing game into an efficient, data-driven investigation.