In the dynamic world of cloud-native applications, workloads rarely remain constant. Traffic spikes, batch jobs, and seasonal demand can quickly overwhelm a fixed number of pods, leading to degraded performance or even outages. Conversely, over-provisioning resources wastes valuable cloud budget.
The HPA is Kubernetes’ answer to this challenge. It automatically scales the number of pods in a deployment, replicaset, statefulset, or replication controller based on observed metrics like CPU utilization, memory consumption, or even custom metrics from your applications. This tutorial will guide you through setting up and configuring HPA for various scenarios, from basic CPU and memory scaling to advanced custom metric integration, ensuring your applications always have the right amount of resources to meet demand.
By the end of this comprehensive guide, you’ll not only understand how HPA works but also be equipped to implement it effectively in your production environments. We’ll cover everything from deploying a sample application to configuring HPA with different metric sources, troubleshooting common issues, and best practices for robust auto-scaling. Let’s get scaling!
Prerequisites
Before we embark on our HPA journey, ensure you have the following:
- A running Kubernetes cluster: This can be a local cluster like Minikube or Kind, or a cloud-managed cluster (EKS, GKE, AKS).
kubectlconfigured: Yourkubectlcommand-line tool should be configured to connect to your Kubernetes cluster.- Basic Kubernetes knowledge: Familiarity with Deployments, Services, Pods, and YAML syntax is essential.
- Metrics Server installed: For CPU and memory-based HPA, the Kubernetes Metrics Server must be running in your cluster. Most managed Kubernetes services include this by default. If not, you’ll need to install it.
- Prometheus and Prometheus Adapter (for custom metrics): If you plan to explore custom metrics, you’ll need a Prometheus instance scraping your application metrics and the Prometheus Adapter installed to expose these metrics to the HPA.
Step-by-Step Guide
1. Install Metrics Server (if not already present)
The Kubernetes Metrics Server is a cluster-wide aggregator of resource usage data. It collects CPU and memory metrics from Kubelets and exposes them via the Kubernetes API. The HPA relies on this data for CPU and memory-based scaling. If you’re on a managed Kubernetes service, it’s likely already installed. For local clusters or self-managed ones, you might need to install it.
You can check if the Metrics Server is running by attempting to get node or pod metrics. If this fails, you’ll need to install it. The installation typically involves applying a manifest from its GitHub repository.
kubectl get --raw "/apis/metrics.k8s.io/v1beta1/nodes"
# Expected output if Metrics Server is running:
# {
# "kind": "NodeMetricsList",
# "apiVersion": "metrics.k8s.io/v1beta1",
# "metadata": {
# "selfLink": "/apis/metrics.k8s.io/v1beta1/nodes"
# },
# "items": [
# {
# "metadata": {
# "name": "minikube",
# "selfLink": "/apis/metrics.k8s.io/v1beta1/nodes/minikube",
# "creationTimestamp": "2023-10-27T10:00:00Z"
# },
# "timestamp": "2023-10-27T10:01:00Z",
# "window": "30s",
# "usage": {
# "cpu": "123m",
# "memory": "1234Mi"
# }
# }
# # ... more nodes
# ]
# }
# If it's not running or you get an error like "Error from server (NotFound): the server could not find the requested resource", install it:
kubectl apply -f https://github.com/kubernetes-sigs/metrics-server/releases/latest/download/components.yaml
# For Minikube, you can enable it directly:
# minikube addons enable metrics-server
Verify: After installation, wait a minute or two for the pods to start and metrics to propagate. Then, check if you can retrieve metrics.
kubectl get pod -n kube-system -l k8s-app=metrics-server
kubectl top nodes
kubectl top pods -A
# Expected output for `kubectl top nodes`:
# NAME CPU(cores) CPU% MEMORY(bytes) MEMORY%
# minikube 123m 6% 1234Mi 12%
# Expected output for `kubectl top pods -A` (showing metrics for some pods):
# NAMESPACE NAME CPU(cores) MEMORY(bytes)
# kube-system coredns-6789f8947f-abcde 2m 8Mi
# kube-system metrics-server-67c65897c-fghij 4m 12Mi
# ...
2. Deploy a Sample Application
To demonstrate HPA, we need an application that can consume resources. We’ll use a simple Nginx deployment configured with resource requests. Resource requests are crucial because HPA uses them to calculate target utilization. Without requests, HPA cannot accurately determine the percentage of CPU or memory being consumed.
This deployment will create three Nginx pods. We’ll also expose it via a ClusterIP service so we can easily access it from within the cluster for testing.
# hpa-demo-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: hpa-demo
labels:
app: hpa-demo
spec:
replicas: 3
selector:
matchLabels:
app: hpa-demo
template:
metadata:
labels:
app: hpa-demo
spec:
containers:
- name: hpa-demo
image: nginx:latest
ports:
- containerPort: 80
resources:
requests:
cpu: "100m" # Request 100 millicores of CPU
memory: "100Mi" # Request 100 MiB of memory
limits:
cpu: "200m" # Limit to 200 millicores of CPU
memory: "200Mi" # Limit to 200 MiB of memory
---
apiVersion: v1
kind: Service
metadata:
name: hpa-demo-service
spec:
selector:
app: hpa-demo
ports:
- protocol: TCP
port: 80
targetPort: 80
type: ClusterIP
Verify: Apply the manifest and check the deployment and service status.
kubectl apply -f hpa-demo-deployment.yaml
kubectl get deployment hpa-demo
kubectl get service hpa-demo-service
kubectl get pods -l app=hpa-demo
# Expected output:
# NAME READY UP-TO-DATE AVAILABLE AGE
# hpa-demo 3/3 3 3 Xs
# NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
# hpa-demo-service ClusterIP 10.96.X.X <none> 80/TCP Xs
# NAME READY STATUS RESTARTS AGE
# hpa-demo-7f7f7f7f7f-abcde 1/1 Running 0 Xs
# hpa-demo-7f7f7f7f7f-fghij 1/1 Running 0 Xs
# hpa-demo-7f7f7f7f7f-ijklm 1/1 Running 0 Xs
3. Configure HPA Based on CPU Utilization
This is the most common and straightforward HPA configuration. We’ll configure HPA to scale our hpa-demo deployment when the average CPU utilization across all pods exceeds a certain percentage of their requested CPU.
The HPA controller periodically fetches metrics from the Metrics Server. If the average CPU utilization of the target pods (relative to their CPU requests) crosses the threshold, the HPA calculates the desired number of replicas and updates the target deployment.
# hpa-cpu.yaml
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: hpa-cpu-demo
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: hpa-demo
minReplicas: 3 # Minimum number of pods
maxReplicas: 10 # Maximum number of pods
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 50 # Target average CPU utilization at 50% of requested CPU
Explanation:
scaleTargetRef: Specifies the resource HPA should scale (ourhpa-demodeployment).minReplicas: The minimum number of pods HPA will maintain, even under zero load.maxReplicas: The maximum number of pods HPA will scale up to. This is a critical safeguard against runaway scaling.metrics: An array defining the metrics to watch.type: Resource: Indicates we’re using a built-in resource metric (CPU or memory).resource.name: cpu: Specifies CPU as the metric.target.type: Utilization: Means we’re targeting a percentage utilization.averageUtilization: 50: The HPA will try to keep the average CPU utilization across all pods at 50% of the requested CPU (100m in our deployment, so 50m).
Verify: Apply the HPA and watch its status.
kubectl apply -f hpa-cpu.yaml
kubectl get hpa hpa-cpu-demo --watch
# Expected initial output:
# NAME REFERENCE TARGETS MINPODS MAXPODS REPLICAS AGE
# hpa-cpu-demo Deployment/hpa-demo 0%/50% 3 10 3 Xs
The TARGETS column shows 0%/50%, meaning current CPU utilization is 0%, and the target is 50%. The REPLICAS column shows 3, matching our minReplicas.
Generate Load to Trigger HPA
Now, let’s generate some CPU load on our Nginx pods to see the HPA in action. We’ll use a temporary pod with wget to continuously hit our service, causing the Nginx pods to consume CPU.
# Create a busybox pod to generate load
kubectl run -it --rm load-generator --image=busybox -- /bin/sh
# Once inside the busybox pod, run this command:
# (Replace 'hpa-demo-service' with your service name if different)
# while true; do wget -q -O - http://hpa-demo-service; done
# To exit the busybox pod and stop generating load, press Ctrl+C and type 'exit'.
Verify: While the load generator is running, observe the HPA status in a separate terminal.
kubectl get hpa hpa-cpu-demo --watch
# You should see the TARGETS rise and REPLICAS increase:
# NAME REFERENCE TARGETS MINPODS MAXPODS REPLICAS AGE
# hpa-cpu-demo Deployment/hpa-demo 60%/50% 3 10 3 1m
# hpa-cpu-demo Deployment/hpa-demo 75%/50% 3 10 4 1m
# hpa-cpu-demo Deployment/hpa-demo 82%/50% 3 10 6 2m
# hpa-cpu-demo Deployment/hpa-demo 48%/50% 3 10 6 3m # HPA might scale down if load decreases
You can also check the number of pods in your deployment:
kubectl get deployment hpa-demo
# Observe the 'AVAILABLE' column increasing
Once you stop the load generator, the HPA will eventually scale down the pods back to minReplicas (3 in our case) after a cooldown period.
4. Configure HPA Based on Memory Utilization
Scaling based on memory utilization is similar to CPU, but it often requires a more careful approach. Unlike CPU, which can be throttled, exceeding memory limits usually leads to Out-Of-Memory (OOM) errors and pod termination. HPA can help prevent this by adding more pods before memory limits are hit.
For memory-based scaling, the target.type should be AverageValue instead of Utilization if you want to target an absolute memory value (e.g., 50Mi per pod). If you want to target a percentage of the requested memory, use Utilization.
# hpa-memory.yaml
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: hpa-memory-demo
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: hpa-demo
minReplicas: 3
maxReplicas: 10
metrics:
- type: Resource
resource:
name: memory
target:
type: AverageValue # Target an absolute memory value per pod
averageValue: 70Mi # Target average memory usage at 70 MiB per pod
# OR, if you want to target a percentage of requested memory (100Mi in our case):
#- type: Resource
# resource:
# name: memory
# target:
# type: Utilization
# averageUtilization: 70 # Target average memory utilization at 70% of requested memory
Explanation:
- We’ve changed
resource.nametomemory. - We’re using
target.type: AverageValuewithaverageValue: 70Mi. This means HPA will try to keep the average memory usage of each pod at 70 MiB. If the average exceeds this, HPA scales up. If it falls below, HPA scales down. - If you choose
Utilization, it would be 70% of the100Mirequested memory, so 70Mi. Both are valid, choose based on whether you prefer absolute or percentage targets.
Verify: Apply the HPA and check its status. For this example, we’ll assume we’re replacing the CPU HPA.
kubectl delete hpa hpa-cpu-demo # Delete previous HPA
kubectl apply -f hpa-memory.yaml
kubectl get hpa hpa-memory-demo --watch
# Expected initial output:
# NAME REFERENCE TARGETS MINPODS MAXPODS REPLICAS AGE
# hpa-memory-demo Deployment/hpa-demo 0/70Mi 3 10 3 Xs
Generate Memory Load (Example)
Generating memory load with Nginx is less straightforward than CPU. For a real application, you’d typically see memory usage increase with more active sessions or larger data sets. For a simple demonstration, we’ll use a busybox pod to run a command that allocates memory.
# Create a busybox pod to generate memory load
kubectl run -it --rm mem-load-generator --image=busybox -- /bin/sh
# Inside the busybox pod, target one of the hpa-demo pods directly
# First, get the name of one of your hpa-demo pods:
# kubectl get pods -l app=hpa-demo -o jsonpath='{.items[0].metadata.name}'
# Let's say it's 'hpa-demo-7f7f7f7f7f-abcde'
# Now, execute a command *inside* that specific hpa-demo pod to consume memory.
# This assumes the Nginx image has 'dd' and 'head' commands available.
# This is a bit of a hack for demonstration, as Nginx itself doesn't consume much memory on its own.
# In a real scenario, you'd observe your application's memory usage.
# Command to run from your local terminal to exec into an Nginx pod and consume memory:
POD_NAME=$(kubectl get pods -l app=hpa-demo -o jsonpath='{.items[0].metadata.name}')
kubectl exec -it $POD_NAME -- /bin/sh -c "dd if=/dev/zero of=/dev/null bs=1M count=1000"
# This command tries to allocate 1GB, which should hit the 200Mi limit and potentially trigger OOM or high memory usage.
# Nginx itself doesn't usually consume this much memory, so this is for demonstration purposes.
# Alternatively, a more realistic approach for a web server would be high traffic with large responses.
# For Nginx, you can't easily "force" memory usage like this without modifying the Nginx config or application.
# The 'dd' command above will run *inside* one Nginx pod, causing its memory usage to spike.
Verify: Observe the HPA status and pod counts.
kubectl get hpa hpa-memory-demo --watch
kubectl get deployment hpa-demo
# You should see the TARGETS rise (e.g., from 0/70Mi to 150Mi/70Mi) and REPLICAS increase.
5. Configure HPA Based on Multiple Metrics (CPU & Memory)
HPA can scale based on multiple metrics simultaneously. When multiple metrics are defined, HPA calculates the desired number of replicas for each metric independently and then takes the maximum of these desired replica counts. This ensures that the application scales up if *any* of the configured metrics exceed their target.
# hpa-multi-metric.yaml
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: hpa-multi-metric-demo
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: hpa-demo
minReplicas: 3
maxReplicas: 10
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 50 # Target 50% CPU utilization
- type: Resource
resource:
name: memory
target:
type: AverageValue
averageValue: 70Mi # Target 70 MiB average memory usage
Explanation:
- We’ve simply added both the CPU and Memory metric definitions under the
metricsarray. - If CPU utilization hits 60% (above 50% target), HPA will calculate a scale-up.
- If memory hits 80Mi (above 70Mi target), HPA will calculate a scale-up.
- The HPA will then choose the higher of the two calculated replica counts to scale the deployment.
Verify: Apply the HPA and observe its status.
kubectl delete hpa hpa-memory-demo # Delete previous HPA
kubectl apply -f hpa-multi-metric.yaml
kubectl get hpa hpa-multi-metric-demo --watch
# Expected initial output:
# NAME REFERENCE TARGETS MINPODS MAXPODS REPLICAS AGE
# hpa-multi-metric-demo Deployment/hpa-demo 0%/50%, 0/70Mi 3 10 3 Xs
You’ll see two targets listed in the TARGETS column. You can then generate CPU or memory load independently, and the HPA should react to whichever metric requires scaling.
6. Configure HPA Based on Custom Metrics
Resource metrics (CPU, memory) are great, but many applications have other critical performance indicators. For example, a message queue processing application might need to scale based on the number of messages in a queue, or a web service based on requests per second. This is where custom metrics come in.
To use custom metrics, you typically need:
- An application that exposes custom metrics (e.g., via Prometheus exposition format).
- A Prometheus instance scraping these metrics.
- The Prometheus Adapter installed in your cluster, which translates Prometheus queries into a format the HPA can consume via the
custom.metrics.k8s.ioAPI.
We’ll assume you have Prometheus and Prometheus Adapter already set up. If not, refer to their respective documentation for installation.
Let’s imagine our Nginx application (for demonstration purposes) exposed a metric called nginx_http_requests_total, and we want to scale based on the average number of requests per second per pod.
# hpa-custom-metric.yaml
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: hpa-custom-metric-demo
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: hpa-demo
minReplicas: 3
maxReplicas: 10
metrics:
- type: Pods # Or Object, depending on the metric
pods:
metric:
name: http_requests_per_second # The name of your custom metric as exposed by Prometheus Adapter
target:
type: AverageValue
averageValue: 1000m # Target 1 request per second per pod (1000 milli-requests)
Explanation:
type: Pods: Indicates that this is a custom metric aggregated across pods. If you have a single metric for a specific object (like a service or ingress), you might usetype: Object.pods.metric.name: http_requests_per_second: This is the name of the custom metric that Prometheus Adapter exposes. The Prometheus Adapter configuration maps Prometheus queries to these metric names. For example, a Prometheus query likesum(rate(nginx_http_requests_total[1m])) by (pod)might be exposed ashttp_requests_per_second.target.type: AverageValue: We’re targeting an absolute average value per pod.averageValue: 1000m: Kubernetes internal representation for values.1000mmeans 1. This HPA will aim for 1 request per second per pod. If it goes above, it scales up.
Verify: Apply the HPA and observe its status.
kubectl delete hpa hpa-multi-metric-demo # Delete previous HPA
kubectl apply -f hpa-custom-metric.yaml
kubectl get hpa hpa-custom-metric-demo --watch
# Expected initial output (assuming no traffic and Prometheus Adapter is configured):
# NAME REFERENCE TARGETS MINPODS MAXPODS REPLICAS AGE
# hpa-custom-metric-demo Deployment/hpa-demo 0/1 3 10 3 Xs
To test this, you would need to generate HTTP traffic to your Nginx service, and ensure Prometheus is scraping the Nginx metrics, and Prometheus Adapter is correctly configured to expose http_requests_per_second.
Example Prometheus Adapter config snippet (conceptual, not a full install):
# Part of a Prometheus Adapter configmap
rules:
- seriesQuery: '{__name__="nginx_http_requests_total",container="nginx"}'
resources:
overrides:
kubernetes_pod_name: {resource: "pod"}
kubernetes_namespace: {resource: "namespace"}
name:
matches: "nginx_http_requests_total"
as: "http_requests_per_second"
metricsQuery: 'sum by (pod) (rate(<<.Series>>{<<.LabelMatchers>>}[1m]))'
# This rule would expose 'http_requests_per_second' aggregated by pod.
Production Considerations
- Resource Requests and Limits are Crucial: HPA for CPU and memory relies on resource requests. Without them, HPA cannot calculate utilization percentages. Always set appropriate requests and limits for your pods.
- Cooldown and Stabilization Windows: HPA has default scale-up and scale-down stabilization windows (usually 3 minutes for scale-up and 5 minutes for scale-down). These prevent flapping (rapid scaling up and down). You can tune these in the HPA definition (
behaviorfield inautoscaling/v2) if needed for specific use cases. - Choose the Right Metrics:
- CPU: Good for stateless applications where CPU scales linearly with load.
- Memory: Use carefully. Spikes can cause OOM kills before HPA reacts. Often better to scale based on CPU or custom metrics that correlate with memory pressure.
- Custom Metrics: The most flexible. Scale on business-specific metrics like queue length, concurrent users, or transactions per second. These often provide a more accurate signal for application health and performance.
- Min and Max Replicas: Set these wisely.
minReplicasensures your application has a baseline capacity, even during low traffic.maxReplicasprevents runaway costs and resource exhaustion in case of unexpected load or misconfigured metrics. - Metrics Server Latency: Be aware that Metrics Server data has a slight delay. HPA reactions won’t be instantaneous.
- Thorough Testing: Always test your HPA configurations under realistic load conditions in a staging environment before deploying to production. Simulate peak loads and sudden drops.
- Combine with Vertical Pod Autoscaler (VPA): For optimal resource management, HPA (scaling out) can be combined with VPA (scaling up/down individual pod resources). However, they cannot directly manage the same pods simultaneously in a fully automated way for resource requests/limits. One common strategy is to use VPA in “recommender” mode to get optimal resource requests, then manually apply those to your deployments, and use HPA for horizontal scaling.
- Monitoring and Alerting: Monitor HPA events, current replica counts, and target metrics. Set up alerts if pods are consistently at
maxReplicasor if scaling events are not happening as expected.
Troubleshooting
- HPA status shows
<unknown>/<target>or<not available>for TARGETS.Issue: HPA cannot retrieve metrics for the target. This is common for CPU/Memory metrics if Metrics Server is not running or healthy, or for custom metrics if Prometheus Adapter is misconfigured.Solution:
- For Resource Metrics (CPU/Memory):
kubectl get --raw "/apis/metrics.k8s.io/v1beta1/pods" | head kubectl top pods -A kubectl get pod -n kube-system -l k8s-app=metrics-server # Check Metrics Server pod status kubectl logs -n kube-system -l k8s-app=metrics-server # Check Metrics Server logsEnsure Metrics Server pods are
Runningand logs show no errors. Check for firewall rules blocking communication between Kubelets and Metrics Server, or between HPA controller and Metrics Server. - For Custom Metrics:
kubectl get --raw "/apis/custom.metrics.k8s.io/v1beta1" # Check custom metrics API availability kubectl get --raw "/apis/custom.metrics.k8s.io/v1beta1/namespaces/default/pods/*/http_requests_per_second" # Replace with your namespace and metric name kubectl get pod -n <prometheus-adapter-namespace> -l app=prometheus-adapter # Check Prometheus Adapter pod status kubectl logs -n <prometheus-adapter-namespace> -l app=prometheus-adapter # Check Prometheus Adapter logsVerify Prometheus Adapter is running and its logs indicate successful metric discovery and exposure. Ensure Prometheus is scraping your application metrics correctly.
- For Resource Metrics (CPU/Memory):
- HPA is not scaling up even when metrics are high.Issue: The
REPLICAScount stays atminReplicasdespiteTARGETSbeing above the threshold.Solution:
- Check HPA events:
kubectl describe hpa <hpa-name>Look at the “Events” section. It will often explain why scaling isn’t happening (e.g., “too few pods available”, “stabilization window active”, “failed to get cpu utilization”).
- Verify resource requests: For CPU/Memory HPA, ensure your deployment’s pods have
resources.requests.cpuandresources.requests.memorydefined. Without requests, HPA cannot calculate utilization. - Check
minReplicasandmaxReplicas: EnsureminReplicasis not already equal tomaxReplicas. - Insufficient resources: Your cluster might not have enough available nodes or resources to schedule new pods. Check node statuses.
- Metrics are not high enough: Double-check the actual metric values vs. your HPA target.
- Check HPA events:
- HPA is not scaling down even when metrics are low.Issue: The
REPLICAScount stays high despiteTARGETSbeing below the threshold.Solution:
- Check HPA events:
kubectl describe hpa <hpa-name>Look for “stabilization window active” events. HPA has a default 5-minute cooldown for scale-down to prevent rapid scaling fluctuations. You might need to wait longer.
- Minimum replicas: Ensure the current replica count is not at
minReplicas. - Multiple metrics: If using multiple metrics, another metric might still be above its threshold, preventing a scale-down. HPA scales down only if *all* metrics are below their targets.
- Check HPA events:
- HPA scales to
maxReplicastoo quickly or not high enough.Issue: The HPA is too aggressive or too conservative.Solution:
- Adjust target utilization/value: Lowering the
averageUtilizationoraverageValuewill make HPA more aggressive (scale up sooner). Raising it will make it more conservative. - Tune
minReplicasandmaxReplicas: Ensure these ranges are appropriate for your application’s expected load. - Review stabilization windows: For very bursty traffic, you might adjust
behaviour.scaleDown.stabilizationWindowSecondsor.scaleUp.stabilizationWindowSeconds(though this is less common). - Metric accuracy: Ensure your metrics accurately reflect the load. For
- Adjust target utilization/value: Lowering the