Introduction
In today’s distributed world, applications rarely live in isolation on a single Kubernetes cluster. Organizations frequently operate multiple clusters across different regions, cloud providers, or even hybrid environments to achieve high availability, disaster recovery, data locality, or simply to manage organizational boundaries. Connecting these disparate clusters seamlessly, allowing services to communicate as if they were in the same network, presents a significant challenge. Traditional networking solutions often involve complex VPNs, load balancers, and intricate routing configurations that are difficult to manage and scale.
Enter Cilium Cluster Mesh. Cilium, a powerful CNCF graduated project, leverages the power of eBPF to provide high-performance networking, observability, and security for Kubernetes. Cluster Mesh extends Cilium’s capabilities, enabling secure, performant, and transparent communication between services running in different Kubernetes clusters. It simplifies multi-cluster connectivity, allowing pods in one cluster to directly address and consume services from another, without needing complex ingress/egress configurations or sidecar proxies. This guide will walk you through setting up a Cilium Cluster Mesh, demonstrating how to achieve true multi-cluster networking for your Kubernetes workloads.
TL;DR: Cilium Cluster Mesh in a Nutshell
Cilium Cluster Mesh enables seamless, secure, and high-performance communication between multiple Kubernetes clusters using eBPF. It allows services in different clusters to discover and communicate with each other as if they were in the same cluster, simplifying multi-cluster deployments. Key features include identity-aware security policies, transparent service discovery, and efficient data plane operation.
Key Commands:
# Install Cilium CLI
CILIUM_CLI_VERSION=$(curl -s https://api.github.com/repos/cilium/cilium-cli/releases/latest | grep -oP "v\d+\.\d+\.\d+")
curl -L --fail --remote-name-all https://github.com/cilium/cilium-cli/releases/download/${CILIUM_CLI_VERSION}/cilium-linux-amd64.tar.gz{,.sha256sum}
sudo tar xzvfC cilium-linux-amd64.tar.gz /usr/local/bin
rm cilium-linux-amd64.tar.gz{,.sha256sum}
# Install Cilium on Cluster 1 (e.g., Kind cluster "cluster1")
kind create cluster --name cluster1
cilium install --cluster-name cluster1 --context kind-cluster1 --enable-cross-cluster-ipvs --enable-l7-proxy --enable-identity-allocation-mode CRD
# Install Cilium on Cluster 2 (e.g., Kind cluster "cluster2")
kind create cluster --name cluster2
cilium install --cluster-name cluster2 --context kind-cluster2 --enable-cross-cluster-ipvs --enable-l7-proxy --enable-identity-allocation-mode CRD
# Enable Cluster Mesh on Cluster 1
cilium clustermesh enable --context kind-cluster1 --service-type NodePort
# Connect Cluster 2 to the Mesh
cilium clustermesh connect --context kind-cluster2 --destination-context kind-cluster1
# Verify Cluster Mesh Status
cilium clustermesh status --context kind-cluster1 --wait
cilium clustermesh status --context kind-cluster2 --wait
# Deploy a service to Cluster 1 and expose it to Cluster Mesh
kubectl --context kind-cluster1 apply -f <(cat <<EOF
apiVersion: apps/v1
kind: Deployment
metadata:
name: echo-service-c1
labels:
app: echo-service-c1
spec:
replicas: 1
selector:
matchLabels:
app: echo-service-c1
template:
metadata:
labels:
app: echo-service-c1
spec:
containers:
- name: echo
image: mendhak/http-https-echo:28
ports:
- containerPort: 8080
---
apiVersion: v1
kind: Service
metadata:
name: echo-service-c1
labels:
app: echo-service-c1
spec:
selector:
app: echo-service-c1
ports:
- protocol: TCP
port: 80
targetPort: 8080
---
apiVersion: discovery.k8s.io/v1
kind: EndpointSlice
metadata:
name: echo-service-c1-mesh
labels:
kubernetes.io/service-name: echo-service-c1
cilium.io/global-service: "true" # Mark for Cluster Mesh export
addressType: IPv4
ports:
- name: http
protocol: TCP
port: 80
endpoints: [] # Cilium will populate this
EOF
)
# Access service from Cluster 2
kubectl --context kind-cluster2 run -it --rm --restart=Never curl --image=curlimages/curl -- curl http://echo-service-c1.default.svc.cluster.local:80
Prerequisites
Before diving into the setup, ensure you have the following:
- Two Kubernetes Clusters: For this guide, we’ll use Kind (Kubernetes in Docker) to create local clusters. You can adapt these steps for any cloud provider (AWS EKS, GCP GKE, Azure AKS) or on-premise clusters.
- Kind installed: Follow the Kind installation guide.
kubectlinstalled and configured to interact with your clusters.
- Cilium CLI: The Cilium command-line interface simplifies installation and management.
- Basic Kubernetes Knowledge: Familiarity with Deployments, Services, and
kubectlcommands. - Network Connectivity: The nodes of your clusters must be able to reach each other over the network. For Kind, this is handled by Docker’s networking. For cloud providers, ensure appropriate security groups and routing tables are configured.
Install Cilium CLI
The Cilium CLI is an essential tool for managing Cilium deployments and Cluster Mesh. Let’s install it.
CILIUM_CLI_VERSION=$(curl -s https://api.github.com/repos/cilium/cilium-cli/releases/latest | grep -oP "v\d+\.\d+\.\d+")
curl -L --fail --remote-name-all https://github.com/cilium/cilium-cli/releases/download/${CILIUM_CLI_VERSION}/cilium-linux-amd64.tar.gz{,.sha256sum}
sudo tar xzvfC cilium-linux-amd64.tar.gz /usr/local/bin
rm cilium-linux-amd64.tar.gz{,.sha256sum}
# Verify installation
cilium version
Verify
You should see output similar to this, indicating the installed Cilium CLI version:
cilium version
cilium-cli: v0.15.2
build-date: 2023-11-09T18:31:01Z
go-version: go1.21.3
kernel-version: 6.5.0-13-generic
Step-by-Step Guide: Setting Up Cilium Cluster Mesh
Step 1: Create Your Kubernetes Clusters
We’ll start by creating two Kind clusters. Each cluster will represent an independent Kubernetes environment that we want to connect.
kind create cluster --name cluster1
kind create cluster --name cluster2
# Verify clusters are running and kubectl contexts are set
kubectl config get-contexts
Verify
You should see two new contexts, kind-cluster1 and kind-cluster2, in your kubectl configuration:
kubectl config get-contexts
CURRENT NAME CLUSTER AUTHINFO NAMESPACE
* kind-cluster1 kind-cluster1 kind-cluster1
kind-cluster2 kind-cluster2 kind-cluster2
Step 2: Install Cilium on Both Clusters
Now, install Cilium on both cluster1 and cluster2. When installing Cilium for Cluster Mesh, it’s crucial to enable a few specific features:
--cluster-name: Assigns a unique name to each cluster, essential for identification within the mesh.--enable-cross-cluster-ipvs: Enables IPVS for cross-cluster load balancing. This is a key component for how Cilium distributes traffic across the mesh.--enable-l7-proxy: Enables L7 policy enforcement, useful for advanced use cases like HTTP/gRPC policies.--enable-identity-allocation-mode CRD: Ensures that security identities are managed via CRDs, which is necessary for consistent identity propagation across clusters. For more on Cilium’s security features, including identity-based policies, explore our guide on Kubernetes Network Policies: Complete Security Hardening Guide.
# Install Cilium on cluster1
cilium install --cluster-name cluster1 --context kind-cluster1 --enable-cross-cluster-ipvs --enable-l7-proxy --enable-identity-allocation-mode CRD
# Install Cilium on cluster2
cilium install --cluster-name cluster2 --context kind-cluster2 --enable-cross-cluster-ipvs --enable-l7-proxy --enable-identity-allocation-mode CRD
Verify
Check the status of Cilium on both clusters. All Cilium pods should be running and healthy.
cilium status --context kind-cluster1 --wait
cilium status --context kind-cluster2 --wait
# Expected output for each cluster (may vary slightly)
/¯¯\
/¯¯\ | Cilium: OK
\__/ | Operator: OK
/¯¯\ | Hubble: disabled
\__/ | Clustermesh: disabled
\__/
DaemonSet cilium Desired: 1, Ready: 1/1, Available: 1/1
Deployment cilium-operator Desired: 1, Ready: 1/1, Available: 1/1
Deployment cilium-operator-aws Desired: 0, Ready: 0/0, Available: 0/0
Deployment cilium-operator-azure Desired: 0, Ready: 0/0, Available: 0/0
Deployment cilium-operator-gcp Desired: 0, Ready: 0/0, Available: 0/0
Step 3: Enable and Connect Cilium Cluster Mesh
Now that Cilium is installed, we can enable the Cluster Mesh feature. One cluster will act as the “hub” for the mesh (though the architecture is peer-to-peer, one cluster initializes the mesh service), and others will connect to it. We’ll use cluster1 to enable the mesh and then connect cluster2.
The --service-type NodePort is crucial for Kind clusters as it allows external access to the Cluster Mesh control plane. In a cloud environment, you might use LoadBalancer or ClusterIP with an Ingress controller, depending on your setup. For more advanced networking configurations and encryption across clusters, consider integrating with features like Cilium WireGuard Encryption.
# Enable Cluster Mesh on cluster1
cilium clustermesh enable --context kind-cluster1 --service-type NodePort
# Connect cluster2 to the mesh using cluster1 as the initial connection point
cilium clustermesh connect --context kind-cluster2 --destination-context kind-cluster1
Verify
Check the Cluster Mesh status on both clusters. You should see both clusters listed in the mesh.
cilium clustermesh status --context kind-cluster1 --wait
cilium clustermesh status --context kind-cluster2 --wait
# Expected output for cluster1
/¯¯\
/¯¯\ | Cilium: OK
\__/ | Operator: OK
/¯¯\ | Hubble: disabled
\__/ | Clustermesh: OK
\__/
Cluster Mesh: 2/2 clusters
cluster1 (local)
cluster2
# Expected output for cluster2
/¯¯\
/¯¯\ | Cilium: OK
\__/ | Operator: OK
/¯¯\ | Hubble: disabled
\__/ | Clustermesh: OK
\__/
Cluster Mesh: 2/2 clusters
cluster1
cluster2 (local)
Step 4: Deploy and Expose a Service in Cluster 1
Now let’s deploy a sample application (an echo service) to cluster1. To make this service discoverable and accessible from other clusters in the mesh, we need to add a special label: cilium.io/global-service: "true" to its EndpointSlice definition. This tells Cilium to export the service’s endpoints to the Cluster Mesh.
kubectl --context kind-cluster1 apply -f <(cat <<EOF
apiVersion: apps/v1
kind: Deployment
metadata:
name: echo-service-c1
labels:
app: echo-service-c1
spec:
replicas: 1
selector:
matchLabels:
app: echo-service-c1
template:
metadata:
labels:
app: echo-service-c1
spec:
containers:
- name: echo
image: mendhak/http-https-echo:28
ports:
- containerPort: 8080
---
apiVersion: v1
kind: Service
metadata:
name: echo-service-c1
labels:
app: echo-service-c1
spec:
selector:
app: echo-service-c1
ports:
- protocol: TCP
port: 80
targetPort: 8080
---
apiVersion: discovery.k8s.io/v1
kind: EndpointSlice
metadata:
name: echo-service-c1-mesh
labels:
kubernetes.io/service-name: echo-service-c1
cilium.io/global-service: "true" # This is the magic label for Cluster Mesh
addressType: IPv4
ports:
- name: http
protocol: TCP
port: 80
endpoints: [] # Cilium will populate this
EOF
)
Verify
Confirm the deployment and service are running in cluster1. More importantly, verify that Cilium has created the global EndpointSlice.
kubectl --context kind-cluster1 get deploy,svc,endpointslices
NAME READY UP-TO-DATE AVAILABLE AGE
deployment.apps/echo-service-c1 1/1 1 1 1m
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
service/echo-service-c1 ClusterIP 10.96.123.45 <none> 80/TCP 1m
service/kubernetes ClusterIP 10.96.0.1 <none> 443/TCP 5m
NAME ADDRESSTYPE PORTS ENDPOINTS AGE
endpointslices.discovery.k8s.io/echo-service-c1-mesh IPv4 80 10.244.1.10:8080 1m
endpointslices.discovery.k8s.io/echo-service-c1-t9n6d IPv4 80 10.244.1.10:8080 1m
Notice the echo-service-c1-mesh EndpointSlice with the cilium.io/global-service: "true" label. This indicates Cilium is managing it for the mesh.
Step 5: Access the Service from Cluster 2
Now for the exciting part: accessing the service deployed in cluster1 directly from cluster2. Cilium Cluster Mesh automatically propagates service information, allowing pods in cluster2 to resolve echo-service-c1.default.svc.cluster.local just as if it were a local service.
kubectl --context kind-cluster2 run -it --rm --restart=Never curl --image=curlimages/curl -- curl http://echo-service-c1.default.svc.cluster.local:80
Verify
You should receive a successful HTTP response from the echo service, indicating that the request from cluster2 was successfully routed to the pod in cluster1.
kubectl --context kind-cluster2 run -it --rm --restart=Never curl --image=curlimages/curl -- curl http://echo-service-c1.default.svc.cluster.local:80
# Expected output (may vary slightly based on echo service version)
{
"path": "/",
"headers": {
"host": "echo-service-c1.default.svc.cluster.local",
"user-agent": "curl/8.4.0"
},
"method": "GET",
"body": "",
"fresh": false,
"hostname": "echo-service-c1.default.svc.cluster.local",
"ip": "10.244.1.10", # This IP belongs to the pod in cluster1!
"ips": [
"10.244.1.10"
],
"protocol": "http",
"query": {},
"xhr": false
}
pod "curl" deleted
Notice the "ip" field in the JSON response. It should be an IP address from cluster1‘s pod CIDR, confirming that the traffic traversed the Cluster Mesh.
Step 6: Deploy and Expose a Service in Cluster 2 (Optional, for bi-directional)
To demonstrate bi-directional communication, let’s deploy a similar echo service in cluster2 and expose it to the mesh.
kubectl --context kind-cluster2 apply -f <(cat <<EOF
apiVersion: apps/v1
kind: Deployment
metadata:
name: echo-service-c2
labels:
app: echo-service-c2
spec:
replicas: 1
selector:
matchLabels:
app: echo-service-c2
template:
metadata:
labels:
app: echo-service-c2
spec:
containers:
- name: echo
image: mendhak/http-https-echo:28
ports:
- containerPort: 8080
---
apiVersion: v1
kind: Service
metadata:
name: echo-service-c2
labels:
app: echo-service-c2
spec:
selector:
app: echo-service-c2
ports:
- protocol: TCP
port: 80
targetPort: 8080
---
apiVersion: discovery.k8s.io/v1
kind: EndpointSlice
metadata:
name: echo-service-c2-mesh
labels:
kubernetes.io/service-name: echo-service-c2
cilium.io/global-service: "true"
addressType: IPv4
ports:
- name: http
protocol: TCP
port: 80
endpoints: []
EOF
)
Verify
kubectl --context kind-cluster2 get deploy,svc,endpointslices
You should see echo-service-c2 and its corresponding global EndpointSlice.
Step 7: Access Service in Cluster 2 from Cluster 1
kubectl --context kind-cluster1 run -it --rm --restart=Never curl --image=curlimages/curl -- curl http://echo-service-c2.default.svc.cluster.local:80
Verify
You should get a response from echo-service-c2, with an IP address from cluster2‘s pod CIDR.
# Expected output
{
"path": "/",
"headers": {
"host": "echo-service-c2.default.svc.cluster.local",
"user-agent": "curl/8.4.0"
},
"method": "GET",
"body": "",
"fresh": false,
"hostname": "echo-service-c2.default.svc.cluster.local",
"ip": "10.244.2.X", # This IP belongs to the pod in cluster2!
"ips": [
"10.244.2.X"
],
"protocol": "http",
"query": {},
"xhr": false
}
pod "curl" deleted
Production Considerations
While Kind is excellent for local testing, deploying Cilium Cluster Mesh in production requires careful planning:
- Network Connectivity: Ensure robust, low-latency, and secure network connectivity between your clusters. This often involves VPNs, direct connects, or peering between VPCs/VNets.
- Cluster Mesh Service Type: For cloud environments, use
LoadBalancerfor the Cluster Mesh service to expose the control plane securely. Ensure the load balancer is configured with appropriate security groups to restrict access to trusted sources only. - Identity Management: Cilium’s identity-based security is a core feature. With Cluster Mesh, these identities are propagated across clusters, allowing for consistent security policies. Ensure your identity allocation mode (e.g., CRD) is robust.
- DNS Integration: While Cilium provides internal service discovery, consider integrating with a global DNS solution (like CoreDNS with external plugins or a dedicated multi-cluster DNS) for more flexible service naming and resolution across different teams or environments.
- Observability: Enable eBPF Observability with Hubble to gain deep insights into network flows, policy enforcement, and connectivity issues across your mesh. This is critical for debugging and monitoring.
- Security Policies: Define Cilium Network Policies that span clusters, allowing you to enforce granular access control between services regardless of which cluster they reside in. Remember that Cilium’s security policies are identity-aware, providing stronger guarantees than IP-based rules.
- Resource Management: Monitor the resource consumption of Cilium agents and operators, especially in large clusters or those with high traffic volumes. Consider node autoscaling solutions like Karpenter for Cost Optimization to efficiently manage your cluster resources.
- Upgrade Strategy: Plan a careful upgrade strategy for Cilium, ensuring minimal disruption to cross-cluster communication. Refer to the official Cilium upgrade documentation.
- Resilience: Design your applications to be resilient to inter-cluster network failures. While Cluster Mesh improves connectivity, transient issues can still occur. Implement retries, timeouts, and circuit breakers.
- Multi-Cloud/Hybrid Cloud: For complex environments spanning multiple cloud providers or on-premise data centers, ensure your underlying network infrastructure supports the necessary connectivity and bandwidth for Cluster Mesh to operate efficiently.
Troubleshooting
Here are some common issues you might encounter with Cilium Cluster Mesh and their solutions:
-
Issue:
cilium clustermesh statusshows0/X clustersor missing clusters.Solution:
This usually indicates a connectivity problem between the Cilium control planes or that the Cluster Mesh service isn’t reachable.
- Check Cilium Pods: Ensure all
ciliumandcilium-operatorpods are running in both clusters. - Verify Cluster Mesh Service Reachability:
# From cluster2, try to ping cluster1's Cluster Mesh service IP/Port # Get the Cluster Mesh service IP from cluster1 kubectl --context kind-cluster1 get svc -n kube-system cilium-clustermesh # Then, from a pod in cluster2, try to reach it. kubectl --context kind-cluster2 run -it --rm --image=busybox -- nc -vz <CLUSTER1_CLUSTERSERVICE_IP> <PORT>If using
NodePort, ensure the node’s external IP and the NodePort are accessible. For cloud LBs, check LB health and security groups. - Firewall/Security Groups: Ensure that the necessary ports (typically
4240for gRPC and the NodePort/LoadBalancer port) are open between the clusters’ nodes or control planes. - Context Mismatch: Double-check that you are using the correct
--contextfor each command.
- Check Cilium Pods: Ensure all
-
Issue: Services marked with
cilium.io/global-service: "true"are not discoverable from other clusters.Solution:
- EndpointSlice Label: Ensure the
EndpointSlice(not just the Service) has the labelcilium.io/global-service: "true". Cilium uses EndpointSlices for mesh service discovery. - Cilium Operator Logs: Check the logs of the
cilium-operatorin both clusters. It’s responsible for managing global EndpointSlices.kubectl --context kind-cluster1 logs -n kube-system deploy/cilium-operator kubectl --context kind-cluster2 logs -n kube-system deploy/cilium-operator - Cilium Agent Logs: Check
cilium-agentlogs on nodes where the service pods are running for any errors related to service export. - Service IPVS Configuration: Ensure
cilium-agentis correctly configured to handle cross-cluster IPVS for the service.
- EndpointSlice Label: Ensure the
-
Issue: Cross-cluster traffic is blocked, even with Cilium Network Policies allowing it.
Solution:
- Identity Mismatch: Cilium policies are identity-based. Ensure that the security identities of the source and destination pods are correctly propagated across the mesh. Check
cilium endpoint get <POD_NAME> -o jsonpath='{.status.identity.id}'in both clusters. - Policy Scope: Verify that your Cilium Network Policies are correctly applied and cover the cross-cluster traffic. Policies usually apply based on labels.
- Hubble Observability: If you have Hubble enabled (see our guide), use
hubble observeor the Hubble UI to visualize traffic flows and identify where traffic is being dropped or blocked. This is often the most effective way to debug policy issues.
- Identity Mismatch: Cilium policies are identity-based. Ensure that the security identities of the source and destination pods are correctly propagated across the mesh. Check
-
Issue: DNS resolution for cross-cluster services fails (e.g.,
echo-service-c1.default.svc.cluster.local).Solution:
Cilium Cluster Mesh handles DNS resolution for global services by intercepting DNS requests.
- Verify EndpointSlice: Ensure the global EndpointSlice for the service exists and is populated in the consuming cluster (e.g.,
kubectl --context kind-cluster2 get endpointslices -l kubernetes.io/service-name=echo-service-c1). - Cilium DNS Proxy: Cilium acts as a DNS proxy. Check Cilium agent logs for DNS-related errors.
- Pod DNS Configuration: Ensure pods are configured to use Cilium’s DNS proxy (usually
10.96.0.10or similar forkube-dns/CoreDNS, with Cilium intercepting).
- Verify EndpointSlice: Ensure the global EndpointSlice for the service exists and is populated in the consuming cluster (e.g.,
-
Issue: High latency or poor performance for cross-cluster traffic.
Solution:
- Underlying Network: The performance of Cluster Mesh is highly dependent on the underlying network connectivity between clusters. Check network latency and bandwidth between the nodes of your clusters.
- Node Resources: Ensure that Cilium agents and Kubernetes nodes have sufficient CPU, memory, and network I/O resources.
- eBPF Program Overhead: While eBPF is highly performant, complex policies or a very large number of endpoints can introduce some overhead. Use Hubble to inspect eBPF metrics.
- IPVS Mode: Cilium uses IPVS for cross-cluster load balancing. Ensure it’s correctly configured and operating.
- MTU Issues: Mismatched MTU settings across the network path can lead to packet fragmentation and performance degradation.
FAQ Section
Q1: What is the primary benefit of Cilium Cluster Mesh over traditional multi-cluster solutions?
A1: The primary benefit is transparent, identity-aware networking and security. Unlike traditional methods that rely on complex VPNs, external load balancers, or service mesh sidecars (like Istio Ambient Mesh in full proxy mode), Cilium Cluster Mesh uses eBPF to extend the Kubernetes network seamlessly across clusters. This allows pods to communicate directly via their service names as if they were local, with security policies applied consistently based on workload identity, not just IP addresses. It eliminates the need for manual IP route management and complex ingress/egress configurations.
Q2: Can Cilium Cluster Mesh span different cloud providers or hybrid environments?
A2: Yes, Cilium Cluster Mesh is designed to be cloud-agnostic and can span different cloud providers (e.g., AWS EKS, GCP GKE, Azure AKS) or even hybrid environments (on-premise to cloud). The key requirement is that the nodes of the participating clusters have IP-level connectivity to each other. This often involves setting up VPNs, Direct Connects, or inter-VPC peering between the different network environments.
Q3: How does Cilium Cluster Mesh handle service discovery across clusters?
A3: Cilium Cluster Mesh extends Kubernetes’ native service discovery. When a service’s EndpointSlice is labeled with cilium.io/global-service: "true", Cilium propagates this information to other clusters in the mesh. When a pod in a remote cluster attempts to resolve the service’s DNS name (e.g., myservice.mynamespace.svc.cluster.local), Cilium intercepts the DNS request and provides the IP addresses of the service’s endpoints in the remote cluster. Traffic is then routed directly to those endpoints, often using an efficient eBPF-powered IPVS load balancer.
Q4: Is there any performance overhead with Cilium Cluster Mesh?
A4: Cilium, by leveraging eBPF, is known for its high-performance networking capabilities. Cluster Mesh maintains this advantage by using eBPF to handle cross-cluster routing and load balancing directly in the kernel, avoiding costly context switches. While any cross-network communication will incur some latency due to physical distance, Cilium minimizes software overhead. Performance is primarily dictated by the underlying network infrastructure (latency, bandwidth) between