Introduction
Deploying new versions of applications is a critical, yet often nerve-wracking, operation in the world of cloud-native development. The fear of downtime, service interruptions, or a broken user experience can keep even the most seasoned engineers on edge. Traditional deployment strategies often involve taking services offline, leading to frustrating outages and lost revenue. In today’s always-on world, such disruptions are simply unacceptable.
Enter Kubernetes Rolling Updates – a powerful, built-in mechanism designed to facilitate zero-downtime deployments. By carefully orchestrating the replacement of old application instances with new ones, Kubernetes ensures that your users never experience a hiccup. This guide will delve deep into the mechanics of rolling updates, demonstrating how to leverage them effectively to achieve seamless, resilient application deployments. We’ll explore the underlying concepts, practical implementations, and best practices to keep your services humming without interruption.
TL;DR: Kubernetes Rolling Updates
Kubernetes Rolling Updates enable zero-downtime deployments by gradually replacing old Pods with new ones. Key parameters like maxSurge and maxUnavailable control the pace and impact of the update. Always define readiness and liveness probes for robust deployments.
Key Commands:
# Create a deployment
kubectl apply -f deployment-v1.yaml
# Update an image (triggers rolling update)
kubectl set image deployment/my-app my-app=nginx:1.21.0
# Check deployment status
kubectl rollout status deployment/my-app
# View rollout history
kubectl rollout history deployment/my-app
# Undo the last rollout
kubectl rollout undo deployment/my-app
Prerequisites
To follow this guide, you’ll need the following:
- A Kubernetes Cluster: Access to a functional Kubernetes cluster (e.g., Minikube, Kind, or a cloud-managed cluster like GKE, EKS, AKS).
kubectl: The Kubernetes command-line tool, configured to connect to your cluster. You can find installation instructions on the official Kubernetes documentation.- Basic Kubernetes Knowledge: Familiarity with fundamental Kubernetes concepts like Pods, Deployments, Services, and YAML manifests.
- Text Editor: Any text editor for creating Kubernetes manifest files.
Step-by-Step Guide: Implementing Zero-Downtime Rolling Updates
Step 1: Create an Initial Deployment (v1)
First, let’s create a basic NGINX deployment. This deployment will serve as our initial application version (v1). We’ll expose it via a Service to make it accessible.
A Kubernetes Deployment manages a set of identical Pods, ensuring that a specified number of replicas are always running. When we perform a rolling update, the Deployment controller is responsible for orchestrating the creation of new Pods and the termination of old ones. We’ll also define a Service of type NodePort so we can easily access our application from outside the cluster.
# deployment-v1.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: my-app
labels:
app: my-app
spec:
replicas: 3
selector:
matchLabels:
app: my-app
template:
metadata:
labels:
app: my-app
version: v1
spec:
containers:
- name: my-app
image: nginx:1.20.0 # Our initial version
ports:
- containerPort: 80
readinessProbe: # Essential for rolling updates
httpGet:
path: /
port: 80
initialDelaySeconds: 5
periodSeconds: 5
livenessProbe: # Ensures healthy pods are kept running
httpGet:
path: /
port: 80
initialDelaySeconds: 15
periodSeconds: 20
---
# service.yaml
apiVersion: v1
kind: Service
metadata:
name: my-app-service
spec:
selector:
app: my-app
ports:
- protocol: TCP
port: 80
targetPort: 80
type: NodePort
Apply these manifests to your cluster:
kubectl apply -f deployment-v1.yaml
kubectl apply -f service.yaml
Verify Step 1
Check if the Deployment and Service are running correctly. You should see three Pods with the nginx:1.20.0 image.
kubectl get deployments
NAME READY UP-TO-DATE AVAILABLE AGE
my-app 3/3 3 3 2m
kubectl get pods -l app=my-app
NAME READY STATUS RESTARTS AGE
my-app-76d75574c8-2pghx 1/1 Running 0 2m
my-app-76d75574c8-7c8lq 1/1 Running 0 2m
my-app-76d75574c8-wxcf9 1/1 Running 0 2m
kubectl describe service my-app-service | grep NodePort
NodePort: <unset>/tcp 30000
Note: The specific NodePort might vary. You can access your application using http://<your-node-ip>:<NodePort>. For Minikube, use minikube service my-app-service --url.
Step 2: Perform a Rolling Update to v2
Now, let’s update our application to a new version (v2). We’ll change the NGINX image from 1.20.0 to 1.21.0. Kubernetes will automatically trigger a rolling update when it detects a change in the Pod template of the Deployment.
During a rolling update, Kubernetes ensures that the application remains available. It does this by creating new Pods with the updated configuration before terminating the old ones. The pace and overlap of this process are controlled by the strategy field within the Deployment spec, specifically maxSurge and maxUnavailable. We’ll discuss these in more detail in the next step.
You can update the image directly using kubectl set image or by modifying your YAML file and reapplying it. We’ll use kubectl set image for simplicity.
kubectl set image deployment/my-app my-app=nginx:1.21.0
Verify Step 2
Monitor the rollout status. You’ll see new Pods being created and old ones terminating. The UP-TO-DATE count will gradually increase until all Pods are running the new version.
kubectl rollout status deployment/my-app
Waiting for deployment "my-app" rollout to finish: 1 out of 3 new replicas have been updated...
Waiting for deployment "my-app" rollout to finish: 1 old replicas are pending termination...
Waiting for deployment "my-app" rollout to finish: 2 out of 3 new replicas have been updated...
Waiting for deployment "my-app" rollout to finish: 2 old replicas are pending termination...
Waiting for deployment "my-app" rollout to finish: 2 of 3 updated replicas are available...
deployment "my-app" successfully rolled out
kubectl get pods -l app=my-app
NAME READY STATUS RESTARTS AGE
my-app-699778749d-btv66 1/1 Running 0 1m
my-app-699778749d-j6x7b 1/1 Running 0 1m
my-app-699778749d-pfj2p 1/1 Running 0 1m
Notice the new Pod names and the new image version. The old Pods are gone. You can also check the rollout history:
kubectl rollout history deployment/my-app
deployment.apps/my-app
REVISION CHANGE-CAUSE
1 <none>
2 kubectl set image deployment/my-app my-app=nginx:1.21.0
Step 3: Understanding Rolling Update Strategy Parameters
The core of zero-downtime rolling updates lies in the strategy field of the Deployment spec. By default, Kubernetes uses RollingUpdate with specific values for maxSurge and maxUnavailable.
maxUnavailable: This is an optional field that specifies the maximum number of Pods that can be unavailable during the update process. It can be an absolute number (e.g.,1) or a percentage (e.g.,25%). If you set this to0or0%, it means no Pods can be unavailable at any time, requiringmaxSurgeto be greater than zero.maxSurge: This is an optional field that specifies the maximum number of Pods that can be created over the desired number of Pods. It can be an absolute number (e.g.,1) or a percentage (e.g.,25%). This allows you to bring up new Pods before taking down old ones, ensuring continuous service availability.
The default values are maxUnavailable: 25% and maxSurge: 25%. This means during an update, Kubernetes will ensure that at most 25% of your Pods are unavailable, and it can create up to 25% more Pods than your desired replica count.
Let’s modify our deployment to be more aggressive with maxSurge and less aggressive with maxUnavailable:
# deployment-v3.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: my-app
labels:
app: my-app
spec:
replicas: 3
selector:
matchLabels:
app: my-app
strategy:
type: RollingUpdate
rollingUpdate:
maxUnavailable: 1 # At most 1 Pod unavailable at any time
maxSurge: 2 # Can create up to 2 extra Pods
template:
metadata:
labels:
app: my-app
version: v3
spec:
containers:
- name: my-app
image: nginx:1.22.0 # Our new version
ports:
- containerPort: 80
readinessProbe:
httpGet:
path: /
port: 80
initialDelaySeconds: 5
periodSeconds: 5
livenessProbe:
httpGet:
path: /
port: 80
initialDelaySeconds: 15
periodSeconds: 20
Apply this updated manifest:
kubectl apply -f deployment-v3.yaml
Verify Step 3
Observe the rollout. With maxSurge: 2 and maxUnavailable: 1, you might briefly see up to 5 Pods (3 desired + 2 surge) during the transition, and a minimum of 2 Pods (3 desired – 1 unavailable) will always be running the old or new version. The rollout should still be smooth, but perhaps faster due to the higher surge capacity.
kubectl rollout status deployment/my-app
Waiting for deployment "my-app" rollout to finish: 1 out of 3 new replicas have been updated...
...
deployment "my-app" successfully rolled out
kubectl get pods -l app=my-app
NAME READY STATUS RESTARTS AGE
my-app-85db96979d-2k3d4 1/1 Running 0 1m
my-app-85db96979d-l5m6q 1/1 Running 0 1m
my-app-85db96979d-q8w9p 1/1 Running 0 1m
Check the history again:
kubectl rollout history deployment/my-app
deployment.apps/my-app
REVISION CHANGE-CAUSE
1 <none>
2 kubectl set image deployment/my-app my-app=nginx:1.21.0
3 <none> # Change cause is <none> because we applied the entire manifest, not just `set image`
Step 4: Rollback to a Previous Version
One of the most powerful features of Kubernetes Deployments is the ability to easily rollback to a previous stable version if an issue is discovered with the new deployment. This capability is crucial for maintaining high availability and reducing the impact of faulty deployments.
If your v3 deployment (nginx:1.22.0) has issues, you can quickly revert to v2 (nginx:1.21.0) or even v1 (nginx:1.20.0).
To rollback to the immediately previous revision:
kubectl rollout undo deployment/my-app
To rollback to a specific revision (e.g., revision 1):
kubectl rollout undo deployment/my-app --to-revision=1
Verify Step 4
Monitor the rollback process, which is essentially another rolling update in reverse. Verify that the Pods are now running the image from the target revision.
kubectl rollout status deployment/my-app
Waiting for deployment "my-app" rollout to finish: 2 of 3 updated replicas are available...
deployment "my-app" successfully rolled out
kubectl get pods -l app=my-app
NAME READY STATUS RESTARTS AGE
my-app-699778749d-abcde 1/1 Running 0 1m # This will be the v2 image (nginx:1.21.0) if we rolled back from v3
my-app-699778749d-fghij 1/1 Running 0 1m
my-app-699778749d-klmno 1/1 Running 0 1m
kubectl rollout history deployment/my-app
deployment.apps/my-app
REVISION CHANGE-CAUSE
1 <none>
2 kubectl set image deployment/my-app my-app=nginx:1.21.0
3 <none>
4 <none> # This new revision is the rollback to revision 2
You can also inspect the details of a specific revision:
kubectl rollout history deployment/my-app --revision=2
deployment.apps/my-app with revision 2
Pod Template:
Labels: app=my-app
pod-template-hash=699778749d
version=v2
Containers:
my-app:
Image: nginx:1.21.0
Port: 80/TCP
Host Port: 0/TCP
Liveness: http-get http://:80/ delay=15s timeout=1s period=20s #success=1 #failure=3
Readiness: http-get http://:80/ delay=5s timeout=1s period=5s #success=1 #failure=3
Volumes: <none>
Production Considerations
While rolling updates provide a robust mechanism for zero-downtime deployments, several factors need careful consideration in a production environment:
- Readiness and Liveness Probes: These are absolutely crucial.
- Readiness Probes tell Kubernetes when a Pod is ready to serve traffic. A Pod will not be considered “ready” and added to the Service’s endpoints until its readiness probe succeeds. This prevents traffic from being routed to an uninitialized or broken Pod during a rollout.
- Liveness Probes tell Kubernetes when a Pod is unhealthy and should be restarted. This ensures that even after a successful rollout, your application remains responsive.
Without proper probes, your “zero-downtime” deployment could still lead to service degradation. For advanced networking and traffic management in production, consider tools like Istio Ambient Mesh or utilizing the Kubernetes Gateway API for more granular control.
- Resource Requests and Limits: Define appropriate CPU and memory requests and limits for your containers. During a rolling update, particularly with
maxSurge, your cluster might temporarily need more resources. Insufficient resources can lead to Pods failing to schedule or being evicted, hindering the rollout. For optimizing node resource utilization and cost, tools like Karpenter can be invaluable. - Pod Disruption Budgets (PDBs): For critical applications, PDBs ensure that a minimum number of Pods are available during voluntary disruptions (like rolling updates or node drains). This adds an extra layer of safety on top of
maxUnavailable, especially when combined with cluster auto-scaling or maintenance operations. Refer to the official PDB documentation. - Pre-Stop Hooks and Graceful Shutdown: Applications should handle
SIGTERMsignals gracefully. Kubernetes sends aSIGTERMto a container before terminating it. A pre-stop hook can be used to ensure that the application finishes processing ongoing requests and closes connections before shutting down. This prevents data loss and errors during Pod termination. - Monitoring and Alerting: Implement robust monitoring and alerting for your deployments. Track key metrics like error rates, latency, and resource utilization. Set up alerts to notify you immediately if a new deployment causes performance degradation or increases error rates. Tools leveraging eBPF Observability with Hubble can provide deep insights into network and application behavior during rollouts.
- Immutable Deployments: Always use immutable container images. Never patch an existing image. Each new deployment should use a new, uniquely tagged image (e.g.,
nginx:1.21.0, notnginx:latest). This ensures reproducibility and simplifies rollbacks. - Testing Strategy: Beyond basic readiness/liveness, implement comprehensive integration and end-to-end tests that run against your deployed application. Automated canary or blue/green deployments (which can be built on top of rolling updates or with sophisticated traffic management) can further reduce risk by exposing new versions to a small subset of users first.
- Network Policies: Ensure your Kubernetes Network Policies are correctly configured to allow communication between old and new Pods, and any external services, during the transition period. Misconfigured policies can block traffic and cause rollout failures.
- Security Considerations: Regularly scan your container images for vulnerabilities. Integrating with tools like Sigstore and Kyverno can enforce image signing and policy-based admission control, adding a layer of trust to your deployment pipeline.
Troubleshooting
-
Issue: Rollout Stuck (Pending/Waiting)
Symptom:
kubectl rollout status deployment/my-appshows the rollout is stuck, new Pods are not starting or are stuck inPendingstate.Solution:
- Check Pod Events: Use
kubectl describe pod <new-pod-name>to see events. Look for clues likeFailedScheduling(insufficient resources),ImagePullBackOff(bad image name/tag or registry issues), orCrashLoopBackOff(application failing to start). - Check Logs: If Pods are starting but then crashing, inspect their logs:
kubectl logs <new-pod-name>. - Resource Constraints: Ensure your cluster has enough resources (CPU, Memory, GPU if applicable for LLM workloads) for new Pods, especially with
maxSurge. - Image Pull Issues: Verify the image name and tag are correct and that your cluster can access the image registry (e.g., correct image pull secret).
- Check Pod Events: Use
-
Issue: Application Downtime During Rollout
Symptom: Users experience errors or service unavailability during the update, despite using rolling updates.
Solution:
- Readiness Probes: The most common cause. Ensure your readiness probe is correctly configured and accurately reflects when your application is truly ready to serve traffic. If it’s too aggressive, traffic might be routed to unready Pods.
- Graceful Shutdown: Applications must handle
SIGTERMsignals and shut down gracefully within theterminationGracePeriodSeconds. If not, connections might be abruptly cut. maxUnavailableSetting: IfmaxUnavailableis too high (e.g., 100%), it defeats the purpose of zero-downtime. Review and adjust based on your service’s tolerance.- Service Selector: Ensure your Service’s selector correctly matches the labels of both old and new Pods during the transition.
-
Issue: Rollback Fails or is Slow
Symptom: Attempting to
kubectl rollout undodoesn’t work, or the rollback itself gets stuck.Solution:
- Check History: Verify the revision number you’re rolling back to is valid:
kubectl rollout history deployment/my-app. - Resource Issues: Just like a forward rollout, a rollback can be affected by resource constraints if the target revision’s Pods can’t be scheduled.
- Image Access: Ensure the image for the target rollback revision is still accessible in the registry.
- Application Health: If the previous version itself was unhealthy, rolling back to it won’t solve the problem. Check logs/events of the Pods from the target revision.
- Check History: Verify the revision number you’re rolling back to is valid:
-
Issue: High CPU/Memory Usage During Rollout
Symptom: Cluster resources spike, or nodes become overloaded during a rolling update.
Solution:
maxSurgeAdjustment: A highmaxSurgevalue means more Pods (old and new) are running concurrently. ReducemaxSurgeto limit the temporary resource increase.- Resource Requests/Limits: Ensure your containers have appropriate resource requests. If requests are too low, the scheduler might place too many Pods on a single node, leading to resource contention.
- Cluster Autoscaling: If your cluster supports it, ensure your Cluster Autoscaler is configured to provision new nodes quickly enough to handle the surge in demand.
-
Issue: Old Pods Not Terminating
Symptom: After a rollout, old version Pods remain in a
Terminatingstate indefinitely.Solution:
- Pre-Stop Hooks: If a pre-stop hook is defined and takes too long or gets stuck, it can prevent Pod termination. Debug the pre-stop hook’s logic.
terminationGracePeriodSeconds: If your application needs more time to shut down gracefully, increaseterminationGracePeriodSecondsin the Pod spec.- External Dependencies: Sometimes, external dependencies (e.g., persistent connections not being closed) can hold up a Pod’s termination.
- Network Policy Issues: In rare cases, a restrictive Network Policy might prevent a terminating Pod from communicating with necessary services to shut down.
FAQ Section
-
What is the difference between a Rolling Update and Blue/Green or Canary deployments?
Rolling Update: This is Kubernetes’ default strategy, gradually replacing old Pods with new ones. It’s built-in and simple to use, offering zero-downtime but with direct exposure of the new version to all users as it rolls out.
Blue/Green Deployment: Requires two identical environments (blue for old, green for new). Traffic is switched instantly from blue to green once the green environment is fully tested. It offers immediate rollback but doubles resource consumption temporarily.
Canary Deployment: A new version (canary) is deployed to a small subset of users, often controlled by an Ingress controller or Service Mesh (like Istio Ambient Mesh or Kubernetes Gateway API). If stable, traffic is gradually shifted to the new version. It minimizes risk but requires more sophisticated traffic management. Rolling updates are the foundation upon which more advanced strategies like canary deployments can be built. -
Why are readiness and liveness probes so important for rolling updates?
Readiness probes tell Kubernetes when a new Pod is truly ready to receive traffic. Without it, Kubernetes might route traffic to a Pod that hasn’t finished initializing, leading to errors. Liveness probes ensure that even after a Pod starts, it remains healthy. If it becomes unhealthy, Kubernetes will restart it, preventing a broken Pod from lingering and affecting service availability. They are the guardians of zero-downtime.
-
Can I pause a rolling update?
Yes, you can pause a rolling update using
kubectl rollout pause deployment/<deployment-name>. This is useful if you detect an issue midway through a rollout and want to halt it for investigation without immediately rolling back. To resume, usekubectl rollout resume deployment/<deployment-name>. -
What happens if a rolling update fails?
If a rolling update fails (e.g., new Pods repeatedly crash, or readiness probes never succeed), Kubernetes will stop the rollout. It will try to bring up the remaining old Pods to satisfy
maxUnavailable, but it won’t continue deploying the problematic new version. You will then need to investigate the cause of the failure (e.g., viakubectl describe podandkubectl logs) and either fix the issue in a new deployment or perform akubectl rollout undoto rollback to a stable version. -
How do I ensure my database migrations are handled safely during a rolling update?
Database migrations are a complex topic that requires careful planning. For schema changes, a common strategy is to make them backward-compatible. Deploy the new application version that can work with both the old and new schema. Then, perform the database migration. Finally, deploy another application version that deprecates support for the old schema. Never run destructive database migrations as part of your application Pod’s startup script during a rolling update, as this can lead to data inconsistency if the rollout fails or rolls back. Consider using Kubernetes Jobs for migrations or specialized database migration tools.
Cleanup Commands
To remove the resources created during this tutorial:
kubectl delete -f deployment-v1.yaml
kubectl delete -f service.yaml
# Or simply delete the deployment and service by name
kubectl delete deployment my-app
kubectl delete service my-app-service
Next Steps / Further Reading
- Advanced Deployment Strategies: Explore Blue/Green and Canary deployments. These often leverage Ingress controllers or Service Meshes like Istio or Linkerd.
- Pod Disruption Budgets: Dive deeper into Pod Disruption Budgets for enhanced availability guarantees.
- Helm Charts: Learn how to package and deploy applications using Helm, which simplifies managing Kubernetes manifests and their upgrades.
- CI/CD Pipelines: Integrate rolling updates into your Continuous Integration/Continuous Deployment (CI/CD) pipeline for automated, reliable deployments.
- Kubernetes API Reference: Consult the official Kubernetes API documentation for Deployments to understand all available fields and their behaviors.
Conclusion
Kubernetes rolling updates are a cornerstone of modern, resilient application deployments. By understanding and effectively utilizing parameters like maxSurge, maxUnavailable, and crucially, readiness/liveness probes, you can achieve true zero-downtime deployments, ensuring your users always have access to your services. This capability, combined with the power of rollbacks, empowers developers to iterate quickly and confidently, knowing that their deployments are robust and reversible. Master these concepts, and you’ll unlock a new level of operational excellence in your Kubernetes environments, allowing you to focus on innovation rather than fearing downtimes.