Redis Cluster on Kubernetes: High Availability Setup
Deploying stateful applications like Redis in a highly available and scalable manner on Kubernetes can be a complex undertaking. While Kubernetes excels at orchestrating stateless workloads, managing distributed data stores requires careful consideration of persistence, network stability, and fault tolerance. A production-ready Redis Cluster, designed for sharding and replication, demands a robust deployment strategy that leverages Kubernetes’ strengths while mitigating its challenges.
This guide will walk you through the process of setting up a Redis Cluster on Kubernetes, focusing on achieving high availability and scalability. We’ll leverage Kubernetes StatefulSets for stable pod identities and persistent storage, and delve into the intricacies of configuring Redis Cluster to ensure data redundancy and automatic failover. By the end, you’ll have a resilient Redis Cluster capable of handling significant loads and recovering gracefully from node failures, all managed within your Kubernetes environment.
TL;DR: Redis Cluster on Kubernetes
Deploying a highly available Redis Cluster on Kubernetes involves StatefulSets for stable identities, PersistentVolumes for data, and careful Redis Cluster configuration. Here’s the quick rundown:
- Provision Storage: Define a
StorageClassandPersistentVolumeClaims. - Deploy Redis Headless Service: Enable stable network identities for StatefulSet members.
- Deploy Redis StatefulSet: Use a custom Redis image with cluster support and entrypoint script.
- Initialize Cluster: Run
redis-cli --cluster createfrom a temporary pod. - Verify: Check cluster status with
redis-cli -c -p 6379 cluster info.
# 1. Apply StorageClass and Headless Service
kubectl apply -f storageclass.yaml
kubectl apply -f headless-service.yaml
# 2. Deploy Redis StatefulSet
kubectl apply -f redis-statefulset.yaml
# 3. Wait for all pods to be Ready
kubectl get pods -l app=redis-cluster
# 4. Initialize the Redis Cluster (adjust pod names if needed)
# Get pod IPs
REDIS_IPS=$(kubectl get pods -l app=redis-cluster -o jsonpath='{range .items[*]}{.status.podIP}:6379 {end}')
echo "Redis Pod IPs: $REDIS_IPS"
# Run cluster creation command from a temporary pod
kubectl run -it --rm redis-cli --image=redis:6.2-alpine -- bash
redis-cli --cluster create $REDIS_IPS --cluster-replicas 1
exit
# 5. Verify Cluster Status
kubectl exec -it redis-cluster-0 -- redis-cli -c -p 6379 cluster info
kubectl exec -it redis-cluster-0 -- redis-cli -c -p 6379 cluster nodes
Prerequisites
Before you begin, ensure you have the following:
- Kubernetes Cluster: A running Kubernetes cluster (v1.18+ recommended). This can be a local cluster like Minikube or Kind, or a cloud-managed service like GKE, EKS, or AKS.
kubectl: The Kubernetes command-line tool, configured to connect to your cluster.- Helm (Optional but Recommended): For easier deployment of certain components, though we’ll focus on raw Kubernetes manifests. You can find installation instructions on the official Helm website.
- Basic Kubernetes Knowledge: Familiarity with concepts like Pods, Deployments, Services, StatefulSets, and PersistentVolumes.
- Basic Redis Knowledge: Understanding of Redis Cluster concepts (sharding, master-replica architecture). Refer to the official Redis Cluster specification for details.
Step-by-Step Guide
1. Design Your Redis Cluster Topology
Before deploying, it’s crucial to decide on your cluster’s size and replication factor. A Redis Cluster requires at least 3 master nodes for fault tolerance. For high availability, each master should have at least one replica. A common setup is 3 masters and 3 replicas (one replica per master), totaling 6 nodes. We’ll use this 3 masters, 3 replicas configuration for our example.
2. Create a StorageClass
Redis, as a stateful application, requires persistent storage. We’ll define a StorageClass to dynamically provision PersistentVolumes (PVs) for our Redis pods. If your cluster already has a default StorageClass or you’re using a cloud provider with a pre-configured one (e.g., gp2 on AWS, standard on GCP), you might skip this step or adapt it. Here, we’ll create a simple HostPath StorageClass for local testing or a basic standard class for cloud environments. For production, always use a cloud-provider-specific or network-attached storage solution.
# storageclass.yaml
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
name: redis-sc
provisioner: kubernetes.io/no-provisioner # For HostPath or manual PVs
volumeBindingMode: WaitForFirstConsumer # Important for StatefulSets
reclaimPolicy: Retain # Retain data even if PVC is deleted (be careful!)
---
# For cloud environments, you'd use something like:
# apiVersion: storage.k8s.io/v1
# kind: StorageClass
# metadata:
# name: redis-sc
# provisioner: ebs.csi.aws.com # Example for AWS EBS
# volumeBindingMode: WaitForFirstConsumer
# parameters:
# type: gp2 # Or gp3, io1, etc.
# reclaimPolicy: Delete # Or Retain, depending on your needs
The WaitForFirstConsumer volume binding mode is crucial for StatefulSets. It ensures that the PersistentVolumeClaim (PVC) is not bound until a Pod using it is scheduled. This allows the scheduler to consider node affinity and resource requirements before provisioning the storage, leading to better resource utilization and avoiding issues with unavailable storage on the target node. For deeper dives into storage, consider exploring Kubernetes Persistent Volumes documentation.
Apply the StorageClass:
kubectl apply -f storageclass.yaml
Verify the StorageClass is created:
kubectl get sc redis-sc
NAME PROVISIONER RECLAIMPOLICY VOLUMEBINDINGMODE ALLOWVOLUMEEXPANSION AGE
redis-sc kubernetes.io/no-provisioner Retain WaitForFirstConsumer false 10s
3. Create a Headless Service for Redis Cluster
A Headless Service is essential for StatefulSets. Unlike a regular Service that load-balances traffic, a Headless Service doesn’t have a cluster IP. Instead, it directly exposes the IP addresses of its associated pods. This allows Redis nodes to discover each other by their stable network identities (e.g., redis-cluster-0.redis-cluster-svc.default.svc.cluster.local).
# headless-service.yaml
apiVersion: v1
kind: Service
metadata:
name: redis-cluster-svc
labels:
app: redis-cluster
spec:
ports:
- port: 6379
name: redis
- port: 16379 # Port for cluster bus communication
name: cluster-bus
clusterIP: None # This makes it a Headless Service
selector:
app: redis-cluster
The clusterIP: None line is what makes this a Headless Service. The cluster-bus port (16379) is critical for Redis Cluster’s inter-node communication, including heartbeats, configuration updates, and failover coordination. Without it, the cluster cannot function correctly. For more on Kubernetes networking, including services and network policies, check out our Kubernetes Network Policies: Complete Security Hardening Guide.
Apply the Headless Service:
kubectl apply -f headless-service.yaml
Verify the Service:
kubectl get svc redis-cluster-svc
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
redis-cluster-svc ClusterIP None <none> 6379/TCP,16379/TCP 15s
4. Deploy the Redis StatefulSet
The StatefulSet is the core of our Redis deployment. It provides stable, unique network identifiers, ordered deployment and scaling, and persistent storage for each Redis instance. We’ll use a custom entrypoint.sh script to ensure Redis starts in cluster mode and uses the correct configuration.
# redis-statefulset.yaml
apiVersion: apps/v1
kind: StatefulSet
metadata:
name: redis-cluster
spec:
serviceName: redis-cluster-svc
replicas: 6 # 3 masters + 3 replicas
selector:
matchLabels:
app: redis-cluster
template:
metadata:
labels:
app: redis-cluster
spec:
containers:
- name: redis
image: redis:6.2-alpine # Using alpine for smaller image size
command: ["/bin/sh", "-c"]
args:
- |
REDIS_NODE_IP=$(hostname -i)
echo "Starting Redis with IP: $REDIS_NODE_IP"
# Ensure the /data directory exists for persistence
mkdir -p /data
# Start Redis in cluster mode
redis-server /conf/redis.conf --bind 0.0.0.0 --port 6379 --cluster-enabled yes --cluster-config-file nodes.conf --cluster-node-timeout 5000 --appendonly yes --dir /data --protected-mode no
ports:
- name: redis
containerPort: 6379
- name: cluster-bus
containerPort: 16379
volumeMounts:
- name: data
mountPath: /data
- name: redis-config
mountPath: /conf
env:
- name: POD_IP
valueFrom:
fieldRef:
fieldPath: status.podIP
volumes:
- name: redis-config
configMap:
name: redis-cluster-config
volumeClaimTemplates:
- metadata:
name: data
spec:
accessModes: [ "ReadWriteOnce" ]
storageClassName: redis-sc
resources:
requests:
storage: 1Gi # Adjust storage size as needed
---
# redis-cluster-config.yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: redis-cluster-config
data:
redis.conf: |
# Redis Cluster configuration
port 6379
cluster-enabled yes
cluster-config-file nodes.conf
cluster-node-timeout 5000
appendonly yes
dir /data
protected-mode no
# Important for Kubernetes: bind to all interfaces
bind 0.0.0.0
# Enable AOF persistence
# appendfsync everysec
# daemonize no
# loglevel notice
# pidfile /var/run/redis_6379.pid
Let’s break down key parts of the StatefulSet:
replicas: 6: We’re deploying 6 Redis instances, which will form 3 masters and 3 replicas.serviceName: redis-cluster-svc: This links the StatefulSet to our Headless Service, enabling stable network identities.commandandargs: Instead of relying on the default Redis entrypoint, we provide a custom script. This script explicitly sets the bind IP to0.0.0.0(crucial in a containerized environment), enables cluster mode, sets the config file, and points to the persistent/datadirectory. The--protected-mode nois used for simplification in this guide; in production, consider stronger authentication.volumeMountsandvolumeClaimTemplates: These ensure each Redis instance gets its own persistent storage volume, mounted at/data. Theredis-configvolume mounts our ConfigMap, providing theredis.conffile.ConfigMap: Stores our Redis configuration, allowing easy updates without rebuilding the image. Thebind 0.0.0.0is critical for Redis to listen on all network interfaces within the container, making it accessible from other pods.
For large-scale deployments, you might also consider Karpenter for cost optimization by dynamically provisioning nodes tailored to your Redis resource requirements.
Apply the ConfigMap and StatefulSet:
kubectl apply -f redis-cluster-config.yaml
kubectl apply -f redis-statefulset.yaml
Monitor the StatefulSet rollout:
kubectl get pods -w -l app=redis-cluster
NAME READY STATUS RESTARTS AGE
redis-cluster-0 1/1 Running 0 20s
redis-cluster-1 1/1 Running 0 18s
redis-cluster-2 1/1 Running 0 16s
redis-cluster-3 1/1 Running 0 14s
redis-cluster-4 1/1 Running 0 12s
redis-cluster-5 1/1 Running 0 10s
Ensure all 6 pods are in a Running and Ready state before proceeding.
5. Initialize the Redis Cluster
Once all Redis pods are running, they are independent instances ready to form a cluster. We need to use redis-cli --cluster create to tell them to form a cluster, assigning slots and replicas.
First, get the IP addresses of all Redis pods:
REDIS_IPS=$(kubectl get pods -l app=redis-cluster -o jsonpath='{range .items[*]}{.status.podIP}:6379 {end}')
echo "Redis Pod IPs: $REDIS_IPS"
Redis Pod IPs: 10.42.0.10:6379 10.42.0.11:6379 10.42.0.12:6379 10.42.0.13:6379 10.42.0.14:6379 10.42.0.15:6379
Next, run the cluster creation command from a temporary pod. We use --cluster-replicas 1 to ensure each master node gets one replica.
kubectl run -it --rm redis-cli --image=redis:6.2-alpine -- bash
# Inside the temporary pod, run:
redis-cli --cluster create $REDIS_IPS --cluster-replicas 1
>>> Performing hash slots allocation on 6 nodes...
Master nodes:
10.42.0.10:6379
10.42.0.11:6379
10.42.0.12:6379
Adding replica 10.42.0.13:6379 to 10.42.0.10:6379
Adding replica 10.42.0.14:6379 to 10.42.0.11:6379
Adding replica 10.42.0.15:6379 to 10.42.0.12:6379
M: 4210c4f... 10.42.0.10:6379
slots:[0-5460] (5461 slots) master
M: 5f7d3a0... 10.42.0.11:6379
slots:[5461-10922] (5462 slots) master
M: 7a8b1c2... 10.42.0.12:6379
slots:[10923-16383] (5461 slots) master
S: 8d9e0f1... 10.42.0.13:6379
replicates 4210c4f...
S: 9c0d1e2... 10.42.0.14:6379
replicates 5f7d3a0...
S: a1b2c3d... 10.42.0.15:6379
replicates 7a8b1c2...
Can I set the above configuration? (type 'yes' to accept): yes
>>> Nodes configuration updated
>>> Assign a different config epoch to new nodes
>>> Sending CLUSTER MEET messages to join the cluster
Waiting for the cluster to join
...
>>> Performing Cluster Check (using node 10.42.0.10:6379)
M: 4210c4f... 10.42.0.10:6379
slots:[0-5460] (5461 slots) master
1 additional replica(s)
S: 8d9e0f1... 10.42.0.13:6379
slots: (0 slots) slave
replicates 4210c4f...
M: 5f7d3a0... 10.42.0.11:6379
slots:[5461-10922] (5462 slots) master
1 additional replica(s)
S: 9c0d1e2... 10.42.0.14:6379
slots: (0 slots) slave
replicates 5f7d3a0...
M: 7a8b1c2... 10.42.0.12:6379
slots:[10923-16383] (5461 slots) master
1 additional replica(s)
S: a1b2c3d... 10.42.0.15:6379
slots: (0 slots) slave
replicates 7a8b1c2...
[OK] All nodes agree about slots configuration.
>>> Check for open slots...
>>> Check contents of all slots against actual data...
>>> The cluster is fully covered: 16384 slots covered.
Type yes when prompted to accept the configuration. Once complete, type exit to leave the temporary pod.
6. Verify Cluster Status
Now that the cluster is formed, let’s verify its health and configuration from any of the Redis pods.
kubectl exec -it redis-cluster-0 -- redis-cli -c -p 6379 cluster info
cluster_state:ok
cluster_slots_assigned:16384
cluster_slots_ok:16384
cluster_slots_pfail:0
cluster_slots_fail:0
cluster_known_nodes:6
cluster_size:3
cluster_current_epoch:6
cluster_my_epoch:3
cluster_stats_messages_ping_sent:170
cluster_stats_messages_pong_sent:182
cluster_stats_messages_meet_sent:1
cluster_stats_messages_sent:353
cluster_stats_messages_ping_received:177
cluster_stats_messages_pong_received:171
cluster_stats_messages_meet_received:5
cluster_stats_messages_received:353
Look for cluster_state:ok, cluster_slots_assigned:16384 and cluster_known_nodes:6. This confirms the cluster is healthy and all slots are assigned. You can also check individual node roles:
kubectl exec -it redis-cluster-0 -- redis-cli -c -p 6379 cluster nodes
4210c4f... 10.42.0.10:6379@16379 master - 0 1678881234000 3 connected 0-5460
5f7d3a0... 10.42.0.11:6379@16379 master - 0 1678881234000 2 connected 5461-10922
7a8b1c2... 10.42.0.12:6379@16379 master - 0 1678881234000 1 connected 10923-16383
8d9e0f1... 10.42.0.13:6379@16379 slave 4210c4f... 0 1678881234000 3 connected
9c0d1e2... 10.42.0.14:6379@16379 slave 5f7d3a0... 0 1678881234000 2 connected
a1b2c3d... 10.42.0.15:6379@16379 slave 7a8b1c2... 0 1678881234000 1 connected
This output shows 3 masters and 3 slaves, each replicating a specific master, confirming our desired topology.
7. Test Redis Cluster Functionality
Let’s perform some basic read/write operations to ensure data is being sharded and replicated correctly.
kubectl run -it --rm redis-test --image=redis:6.2-alpine -- bash
# Inside the temporary pod, connect to any Redis node (e.g., redis-cluster-0)
# Use -c for cluster mode
redis-cli -c -h redis-cluster-svc -p 6379
# Set a key (it will be sharded to the correct master)
SET mykey "Hello Kubezilla"
-> Redirected to slot 15729 from 10.42.0.12:6379 to 10.42.0.11:6379
OK
Notice the redirection! The client automatically connects to the correct node based on the key’s hash slot. Now, retrieve the key:
GET mykey
-> Redirected to slot 15729 from 10.42.0.12:6379 to 10.42.0.11:6379
"Hello Kubezilla"
This confirms your Redis Cluster is fully operational and handling data sharding transparently.
Production Considerations
Deploying Redis Cluster in a production Kubernetes environment requires more than just functional setup. Here are key considerations:
- Persistent Storage:
- Cloud Provider CSI Drivers: Always use CSI (Container Storage Interface) drivers for your cloud provider (e.g., AWS EBS CSI, GCE Persistent Disk CSI, Azure Disk CSI) for robust, highly available, and performant storage.
HostPathis only for local testing. - Storage Class Configuration: Configure your
StorageClasswith appropriate parameters like disk type (SSD, IOPS-optimized), replication, and snapshot capabilities. ConsiderreclaimPolicy: Retainfor critical data to prevent accidental data loss if a PVC is deleted, but manage PVs carefully. - Backup and Restore: Implement a robust backup strategy. This could involve Redis’s built-in RDB/AOF persistence, Kubernetes volume snapshots, or dedicated backup solutions for stateful applications.
- Cloud Provider CSI Drivers: Always use CSI (Container Storage Interface) drivers for your cloud provider (e.g., AWS EBS CSI, GCE Persistent Disk CSI, Azure Disk CSI) for robust, highly available, and performant storage.
- Resource Management:
- Requests and Limits: Define appropriate CPU and memory
requestsandlimitsfor your Redis pods. Redis is memory-intensive; ensure enough memory is allocated to prevent OOMKills. - Node Affinity/Anti-affinity: Use pod anti-affinity to ensure master and replica nodes of the same shard are scheduled on different physical nodes for higher availability. This prevents a single node failure from taking down both a master and its replica.
- Horizontal Pod Autoscaler (HPA): While Redis Cluster scales horizontally by adding more shards, HPA isn’t directly applicable for scaling Redis nodes within a shard. However, it can be used for client applications connecting to Redis.
- Vertical Pod Autoscaler (VPA): VPA can help recommend optimal resource requests/limits, but be cautious with its auto-update mode for stateful workloads.
- Requests and Limits: Define appropriate CPU and memory
- Monitoring and Alerting:
- Redis Metrics: Monitor key Redis metrics like memory usage, hit/miss ratio, connected clients, replication status, and cluster state. Prometheus and Grafana are excellent choices.
- Kubernetes Metrics: Monitor pod health, node resource utilization, and persistent volume performance.
- Alerting: Set up alerts for critical conditions like cluster down, master without replica, high memory usage, or network issues. For advanced observability, explore eBPF Observability with Hubble, especially if using Cilium.
- Security:
- Network Policies: Implement Kubernetes Network Policies to restrict traffic to Redis pods to only authorized applications and other Redis cluster members on ports 6379 and 16379.
- Authentication: Enable Redis authentication (
requirepass,masterauth) and use Kubernetes Secrets to manage credentials securely. - TLS/SSL: For encrypted communication, consider using Stunnel or a service mesh like Istio Ambient Mesh to encrypt traffic between clients and Redis, and between Redis nodes.
- Least Privilege: Run Redis containers with a non-root user and restrict capabilities where possible.
- Networking:
- Node IP vs. Pod IP: Ensure Redis binds to
0.0.0.0and advertises its Pod IP correctly for cluster communication. In some complex network setups, especially with CNI plugins that perform NAT, you might need to explicitly configurecluster-announce-ipin Redis. - MTU: Verify MTU settings across your Kubernetes network, especially if you encounter issues with large data transfers or cluster communication.
- Node IP vs. Pod IP: Ensure Redis binds to
- Upgrades:
- Rolling Updates: StatefulSets support rolling updates, but always test Redis version upgrades thoroughly in a staging environment. Be aware of potential breaking changes between Redis versions.
- Helm Charts: For managing upgrades and configuration, using a well-maintained Helm chart for Redis Cluster can simplify the process significantly.
Troubleshooting
Here are common issues you might encounter and their solutions:
-
Issue: Redis Pods are stuck in
PendingorContainerCreating.Solution:
- Pending: Check events for scheduling issues:
kubectl describe pod <pod-name>. Common causes are lack of resources (CPU/memory), no available nodes, or PersistentVolumeClaim (PVC) not binding. Ensure yourStorageClassis correctly configured and there are available PVs (or dynamic provisioning is working). - ContainerCreating: Check pod logs:
kubectl logs <pod-name>. Image pull errors, incorrect entrypoint commands, or missing ConfigMaps/Secrets are common culprits.
- Pending: Check events for scheduling issues:
-
Issue: Redis Cluster initialization fails with “Waiting for the cluster to join” or “Can’t meet myself”.
Solution:
- Network Reachability: Ensure all Redis pods can communicate with each other on both 6379 and 16379 ports. Check firewall rules or Network Policies.
- Pod IPs: Double-check that the
REDIS_IPSvariable contains the correct, current Pod IPs. Pod IPs can change if pods restart or are rescheduled. - Redis Configuration: Verify
cluster-enabled yes,bind 0.0.0.0, andprotected-mode no(or proper auth setup) in yourredis.conf. Thecluster-announce-ipmight be needed if Pod IPs are not directly routable or if your CNI does complex NAT. - Hostnames: For StatefulSets, Redis typically relies on stable hostnames. Ensure your Headless Service is correctly configured.
-
Issue:
redis-cli cluster infoshowscluster_state:failorcluster_slots_fail.Solution:
- Missing Slots: If
cluster_slots_assignedis not 16384, some slots weren’t assigned during creation. You might need to re-run the--cluster createcommand or manually add slots usingredis-cli cluster addslotsandredis-cli cluster setslot. - Node Disconnection: Check individual node logs for errors relating to peer discovery or communication. A node might be down or unreachable.
- Quorum Loss: If more than half of the master nodes are down, the cluster will enter a failed state. Restore the failed nodes.
- Missing Slots: If
-
Issue: Data loss after pod restarts or deletion.
Solution:
- Persistent Volumes: Ensure your
volumeClaimTemplatesin the StatefulSet is correctly configured and bound to a reliableStorageClass. Verify the PVs are healthy and data is being written to the mounted path (e.g.,/data). appendonly yes: Confirm AOF persistence is enabled in yourredis.conf. While RDB snapshots are good for backups, AOF provides better durability.- Reclaim Policy: If your
StorageClasshasreclaimPolicy: Delete, deleting the PVC will delete the PV and its data. For critical data, useRetain, but manage PVs manually.
- Persistent Volumes: Ensure your
-
Issue: High latency or slow performance.
Solution:
- Resource Constraints: Check CPU and memory utilization of Redis pods. Increase
requestsandlimitsif they are consistently hitting limits. - Network Latency: Investigate network latency between pods and between the client and Redis. Cilium WireGuard Encryption can add overhead, but usually minimal.
- Storage Performance: Ensure your underlying persistent storage (e.g., EBS, Azure Disk) has sufficient IOPS and throughput for your workload.
- Cluster Rebalancing: An uneven distribution of hash slots or data size across masters can lead to hot spots. Use
redis-cli --cluster rebalanceto redistribute slots.
- Resource Constraints: Check CPU and memory utilization of Redis pods. Increase
-
Issue: Client applications fail to connect or see “MOVED” errors repeatedly.
Solution: