Introduction
In today’s distributed world, organizations often find themselves managing multiple Kubernetes clusters across different environments, cloud providers, or even geographical regions. While this multi-cluster approach offers benefits like increased resilience, improved isolation, and compliance with data residency requirements, it also introduces significant challenges, particularly around resource utilization and application deployment. How do you efficiently share spare capacity between clusters? How can an application running in one cluster leverage specialized hardware or services available only in another? The answer lies in intelligent multi-cluster resource sharing.
Liqo (LIghtweight KOmpose) emerges as a powerful open-source solution designed to address these very challenges. It transforms disparate Kubernetes clusters into a unified, fluid environment where resources can be seamlessly exchanged and consumed. Imagine a scenario where a spike in traffic on your primary cluster can automatically offload pods to a secondary cluster with available capacity, or where a specialized AI workload in a development cluster can temporarily burst into a production-grade GPU cluster without manual intervention. Liqo makes this vision a reality by establishing secure, transparent peering connections and enabling dynamic resource allocation across cluster boundaries. This guide will walk you through the process of setting up and utilizing Liqo for efficient multi-cluster resource sharing.
TL;DR: Multi-Cluster Resource Sharing with Liqo
Liqo enables seamless resource sharing between Kubernetes clusters by creating virtual nodes that represent remote clusters. This allows pods to be scheduled across cluster boundaries, improving resource utilization and application resilience. Here’s a quick rundown of the essential steps:
# Install Liqo CLI
curl -sL https://get.liqo.io/install.sh | bash
# Install Liqo on Cluster A (e.g., control-plane)
liqo install --cluster-name control-plane --enable-lan
# Install Liqo on Cluster B (e.g., worker-cluster)
liqo install --cluster-name worker-cluster --enable-lan
# Peer Cluster A with Cluster B (run on Cluster A)
liqo peer add liqo-worker-cluster --cluster-id $(liqo get cluster-id --kubeconfig ~/.kube/config-worker) --kubeconfig ~/.kube/config-control
# Check peering status (on Cluster A or B)
liqo get foreignclusters
# Deploy an application to utilize shared resources
kubectl apply -f - <<EOF
apiVersion: apps/v1
kind: Deployment
metadata:
name: nginx-offloaded
spec:
replicas: 2
selector:
matchLabels:
app: nginx-offloaded
template:
metadata:
labels:
app: nginx-offloaded
spec:
schedulerName: liqo-scheduler
containers:
- name: nginx
image: nginx:latest
ports:
- containerPort: 80
EOF
# Verify pod offloading
kubectl get pods -o wide
Prerequisites
Before diving into Liqo, ensure you have the following:
- Two or more Kubernetes clusters: These can be local (e.g., Kind, Minikube) or cloud-based (AWS EKS, GCP GKE, Azure AKS). For this guide, we’ll assume two clusters.
kubectl: The Kubernetes command-line tool, configured to access both clusters. You’ll need separate kubeconfig files or contexts. For example,~/.kube/config-controland~/.kube/config-worker.helm: The Kubernetes package manager, used by Liqo for installation. You can find installation instructions on the official Helm website.- Liqo CLI: The Liqo command-line interface, which simplifies installation and peering.
- Network Connectivity: Clusters must be able to communicate with each other over the network. For local clusters, this might involve exposing ports; for cloud clusters, appropriate security groups and routing are necessary. Liqo handles the overlay network for pod-to-pod communication, but initial peering requires basic reachability. For advanced networking and encryption, consider solutions like Cilium WireGuard Encryption.
- Basic Kubernetes Knowledge: Familiarity with Deployments, Services, Pods, and kubectl commands is assumed.
Step-by-Step Guide: Setting Up Multi-Cluster Resource Sharing with Liqo
Step 1: Install the Liqo CLI
The Liqo CLI is a convenient tool that simplifies the installation and management of Liqo. It’s the recommended way to get started.
This command downloads an installation script and executes it, placing the liqo binary in your user’s local bin directory (typically /usr/local/bin or ~/.local/bin). Ensure this directory is in your system’s PATH. The CLI will be used for installing Liqo on your clusters and managing peering relationships. Having the CLI readily available streamlines the entire process, allowing you to interact with Liqo’s custom resources and commands effortlessly.
curl -sL https://get.liqo.io/install.sh | bash
Verify the installation by checking the version:
liqo version
Expected Output:
Client Version: vX.Y.Z
(The exact version number may vary)
Step 2: Install Liqo on Your Clusters
Now, we’ll install Liqo on both of your Kubernetes clusters. For this guide, let’s call them control-plane and worker-cluster. Replace the kubeconfig paths with your actual paths. The --enable-lan flag is useful for local setups or when clusters are in the same private network, automatically detecting and configuring the network. For production environments across different cloud providers, you might need to specify the --cluster-labels and --cluster-service-cidr flags, and potentially configure a public IP for the Liqo gateway.
Liqo components will be installed in the liqo namespace. This includes the Liqo controller, the virtual Kubelet, the network manager, and the Liqo scheduler. The virtual Kubelet is key here, as it will later represent the remote cluster’s resources within the local cluster, allowing the local scheduler to offload pods. The network manager sets up a secure, encrypted tunnel (e.g., using WireGuard) between clusters, enabling seamless pod-to-pod communication across cluster boundaries. For more details on secure networking, refer to the Cilium WireGuard Encryption article.
# Install Liqo on the control-plane cluster
liqo install --cluster-name control-plane --enable-lan --kubeconfig ~/.kube/config-control
# Install Liqo on the worker-cluster
liqo install --cluster-name worker-cluster --enable-lan --kubeconfig ~/.kube/config-worker
Verify Liqo Installation:
Check the status of Liqo components in both clusters. Look for pods running in the liqo namespace.
# On control-plane cluster
kubectl get pods -n liqo --kubeconfig ~/.kube/config-control
# On worker-cluster
kubectl get pods -n liqo --kubeconfig ~/.kube/config-worker
Expected Output (similar for both clusters):
NAME READY STATUS RESTARTS AGE
liqo-controller-manager-69d58498b-sx7h5 1/1 Running 0 2m
liqo-gateway-67f784d5c-p4m8b 1/1 Running 0 2m
liqo-network-manager-78c776fd8-9g6vj 1/1 Running 0 2m
liqo-virtual-kubelet-5569f6bb7-x8c2g 1/1 Running 0 2m
Step 3: Peer the Clusters
Now that Liqo is installed, we need to establish a peering relationship between the clusters. This step tells Liqo which clusters can share resources. We’ll initiate the peering from the control-plane cluster, referencing the worker-cluster.
The liqo peer add command creates a ForeignCluster custom resource in the originating cluster. This resource contains information about the remote cluster, including its cluster ID and network configuration. Liqo then uses this information to set up a secure, bidirectional peering link. Once peered, the control-plane cluster will create a virtual node representing the resources of the worker-cluster, making them available for scheduling. The --cluster-id is crucial as it uniquely identifies the remote cluster. You can obtain it using liqo get cluster-id.
# Get the cluster ID of the worker-cluster
WORKER_CLUSTER_ID=$(liqo get cluster-id --kubeconfig ~/.kube/config-worker)
echo "Worker Cluster ID: $WORKER_CLUSTER_ID"
# Peer control-plane with worker-cluster (run this command from your machine, targeting the control-plane cluster)
liqo peer add liqo-worker-cluster --cluster-id "$WORKER_CLUSTER_ID" --kubeconfig ~/.kube/config-control
Verify Peering Status:
Check the status of the peering from both clusters. You should see the foreign cluster listed as Established.
# On control-plane cluster
liqo get foreignclusters --kubeconfig ~/.kube/config-control
# On worker-cluster
liqo get foreignclusters --kubeconfig ~/.kube/config-worker
Expected Output (similar for both clusters, but roles might be different):
# Output on control-plane:
CLUSTER ID CLUSTER NAME AUTH TYPE NETWORK STATUS OVERALL STATUS
a1b2c3d4e5f6g7h8i9j0k1l2m3n4o5p6q7r8s9t0 liqo-worker-cluster Local Established Established
# Output on worker-cluster:
CLUSTER ID CLUSTER NAME AUTH TYPE NETWORK STATUS OVERALL STATUS
u1v2w3x4y5z6a7b8c9d0e1f2g3h4i5j6k7l8m9n0 liqo-control-plane Local Established Established
Additionally, check for the virtual node created in the control-plane cluster:
kubectl get nodes --kubeconfig ~/.kube/config-control
Expected Output:
NAME STATUS ROLES AGE VERSION
control-plane-node-1 Ready control-plane 2d v1.27.3
liqo-worker-cluster-node Ready liqo 1m v1.27.3-liqo
Notice the liqo-worker-cluster-node – this is the virtual node representing the remote cluster!
Step 4: Offload a Workload to the Remote Cluster
With peering established, you can now instruct Kubernetes to offload pods to the remote cluster. This is achieved by using the Liqo scheduler. By default, Liqo intercepts scheduling decisions for pods that explicitly request it, or for namespaces that are configured for offloading. For this example, we’ll explicitly use the liqo-scheduler.
When a pod is created with schedulerName: liqo-scheduler, the Liqo scheduler (which is part of the Liqo control plane) evaluates if the pod can be scheduled on a remote virtual node. If suitable resources are available on the liqo-worker-cluster-node, Liqo creates a “shadow pod” in the remote cluster. The actual pod manifest is then sent to the remote cluster, and its lifecycle is managed there. From the perspective of the originating cluster, the pod appears to be running on the virtual node, providing a seamless experience. This mechanism greatly improves resource utilization, as idle capacity in one cluster can be leveraged by another. This capability is a cornerstone of effective Karpenter Cost Optimization strategies, allowing clusters to dynamically scale and share resources.
# Create a deployment that will be offloaded to the worker-cluster
# Run this on the control-plane cluster
kubectl apply --kubeconfig ~/.kube/config-control -f - <<EOF
apiVersion: apps/v1
kind: Deployment
metadata:
name: nginx-offloaded
labels:
app: nginx-offloaded
spec:
replicas: 2
selector:
matchLabels:
app: nginx-offloaded
template:
metadata:
labels:
app: nginx-offloaded
spec:
schedulerName: liqo-scheduler # This tells Liqo to offload the pod
containers:
- name: nginx
image: nginx:latest
ports:
- containerPort: 80
EOF
Verify Pod Offloading:
Check where the pods are running. You should see them on the liqo-worker-cluster-node.
kubectl get pods -o wide --kubeconfig ~/.kube/config-control
Expected Output:
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
nginx-offloaded-7d5c7c7f7-abcde 1/1 Running 0 1m 10.0.0.10 liqo-worker-cluster-node <none> <none>
nginx-offloaded-7d5c7c7f7-fghij 1/1 Running 0 1m 10.0.0.11 liqo-worker-cluster-node <none> <none>
You can also verify directly on the worker-cluster:
kubectl get pods -o wide --kubeconfig ~/.kube/config-worker
Expected Output:
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
nginx-offloaded-7d5c7c7f7-abcde 1/1 Running 0 1m 10.0.0.10 worker-cluster-node-1 <none> <none>
nginx-offloaded-7d5c7c7f7-fghij 1/1 Running 0 1m 10.0.0.11 worker-cluster-node-1 <none> <none>
Notice that the pods are actually running on a node within the worker-cluster, but from the control-plane‘s perspective, they appear on the virtual node.
Step 5: Expose the Offloaded Service
To access the offloaded application, you’ll typically expose it via a Kubernetes Service. Liqo intelligently handles service discovery and routing between clusters.
When a Service is created in the control-plane cluster for pods offloaded to the worker-cluster, Liqo’s network manager ensures that traffic destined for this Service is correctly routed to the remote pods. This might involve creating a corresponding Service in the remote cluster or using IP address translation and tunneling. This network transparency is crucial for multi-cluster applications and is a feature often handled by sophisticated service meshes like Istio Ambient Mesh or advanced CNI plugins. Liqo provides this without requiring a full service mesh, making it lightweight and efficient.
# Create a service for the offloaded Nginx deployment
# Run this on the control-plane cluster
kubectl apply --kubeconfig ~/.kube/config-control -f - <<EOF
apiVersion: v1
kind: Service
metadata:
name: nginx-offloaded-service
spec:
selector:
app: nginx-offloaded
ports:
- protocol: TCP
port: 80
targetPort: 80
type: ClusterIP # Or NodePort/LoadBalancer if you need external access
EOF
Verify Service Connectivity:
Get the ClusterIP of the service and try to access it from a pod in the control-plane cluster.
# Get the service IP on control-plane
kubectl get svc nginx-offloaded-service --kubeconfig ~/.kube/config-control
# Create a temporary pod to test connectivity on control-plane
kubectl run -it --rm --kubeconfig ~/.kube/config-control curl-test --image=curlimages/curl -- sh
Inside the curl-test pod, execute:
curl nginx-offloaded-service
Expected Output:
<!DOCTYPE html>
<html>
<head>
<title>Welcome to nginx!</title>
...
</html>
This confirms that your application, although running in a different Kubernetes cluster, is accessible via its service IP in the local cluster, demonstrating Liqo’s seamless networking capabilities.
Production Considerations
Deploying Liqo in a production environment requires careful planning beyond a simple tutorial. Here are key aspects to consider:
- Security:
- Authentication: Liqo supports different authentication methods for peering, including token-based (default) and certificate-based. For production, consider using more robust certificate-based authentication. Refer to the Liqo documentation on authentication.
- Network Encryption: Liqo uses WireGuard for inter-cluster pod-to-pod communication by default. Ensure this is sufficient for your security requirements. For highly sensitive data, consider additional layers of encryption or integrating with specialized solutions. For more on secure networking, see our guide on Cilium WireGuard Encryption.
- Network Policies: Implement strict Kubernetes Network Policies in both clusters to control traffic flow to and from Liqo components and offloaded pods.
- RBAC: Carefully review and restrict the RBAC permissions granted to Liqo components.
- Networking:
- CIDR Overlaps: Prevent IP address range overlaps between clusters. Liqo has mechanisms to handle this, but it’s best to design your network to avoid them from the start.
- Egress/Ingress: Plan how external traffic will reach offloaded services. This might involve setting up Kubernetes Gateway API controllers or Ingress controllers in the appropriate cluster and configuring DNS.
- Firewalls and Security Groups: Ensure that necessary ports (e.g., WireGuard UDP port 51820) are open between clusters for Liqo’s network tunnels.
- Resource Management:
- Resource Limits and Requests: Always define accurate resource requests and limits for your offloaded pods to prevent resource exhaustion in the remote cluster.
- Resource Quotas: Implement Resource Quotas in namespaces that are offloaded to control resource consumption on the remote cluster.
- Node Selectors/Taints/Tolerations: Use these Kubernetes features to control which types of nodes (local or virtual) your pods can be scheduled on.
- Observability and Monitoring:
- Logging: Centralize logs from all clusters, including Liqo components and offloaded applications.
- Metrics: Monitor resource utilization (CPU, memory, network) in both local and remote clusters. Liqo exposes Prometheus metrics. Consider integrating with tools like eBPF Observability with Hubble for deep network insights.
- Alerting: Set up alerts for peering status, offloaded pod failures, and resource thresholds.
- High Availability and Disaster Recovery:
- Liqo HA: Liqo components are typically deployed as Deployments, providing some level of HA. Ensure your underlying Kubernetes clusters are highly available.
- Backup and Restore: Include Liqo’s custom resources (e.g.,
ForeignCluster) in your cluster backup strategy.
- Cost Management:
- Offloading to different cloud providers or regions can have cost implications. Monitor costs closely, especially with dynamic scaling. Tools like Karpenter Cost Optimization can be used in conjunction with Liqo to manage node costs.
- Application Compatibility:
- Ensure your applications are designed for distributed environments. Stateful applications might require shared storage solutions (e.g., object storage, distributed file systems) accessible from both clusters, or careful consideration of data locality.
- Applications requiring specific hardware (e.g., GPUs for LLM GPU Scheduling) must ensure the remote cluster provides those resources.
Troubleshooting
1. Liqo CLI installation fails or liqo command not found.
Issue: The liqo command is not recognized after running the install script.
Solution:
The installation script typically places the liqo binary in /usr/local/bin or ~/.local/bin. Ensure that this directory is included in your system’s PATH environment variable. You might need to restart your terminal or manually add it.
# Check your PATH
echo $PATH
# If ~/.local/bin is not in PATH, add it (for current session)
export PATH=$PATH:~/.local/bin
# Or add permanently to your shell config (e.g., ~/.bashrc, ~/.zshrc)
echo 'export PATH=$PATH:~/.local/bin' >> ~/.bashrc
source ~/.bashrc # Apply changes
2. Liqo pods are not running or stuck in Pending/CrashLoopBackOff.
Issue: After liqo install, some pods in the liqo namespace are not in a Running state.
Solution:
First, check the events and logs for the problematic pods to diagnose the specific issue. Common causes include resource constraints, incorrect cluster configuration, or network issues.
# Get pod status
kubectl get pods -n liqo --kubeconfig ~/.kube/config-control
# Describe a problematic pod (replace with actual pod name)
kubectl describe pod liqo-controller-manager-xxxx -n liqo --kubeconfig ~/.kube/config-control
# View logs for a problematic pod
kubectl logs liqo-controller-manager-xxxx -n liqo --kubeconfig ~/.kube/config-control
Look for messages indicating missing permissions (RBAC), image pull errors, or network configuration problems.
3. Peering status is stuck in “Pending” or “Error”.
Issue: After running liqo peer add, liqo get foreignclusters shows the status as Pending or Error.
Solution:
This usually indicates a network connectivity problem between the clusters or an incorrect cluster ID.
- Verify Network Connectivity: Ensure that the clusters can reach each other over the network. If using cloud providers, check security groups, network ACLs, and routing tables. For local setups, ensure ports are exposed correctly.
- Correct Cluster ID: Double-check that the
--cluster-idprovided toliqo peer addis correct for the remote cluster. - Check Liqo Gateway Pods: Ensure the
liqo-gatewaypods are running correctly in both clusters. - Check Gateway Logs: Examine logs of the
liqo-gatewaypods for peering-related errors.
# Get gateway logs on control-plane
kubectl logs -f -n liqo deploy/liqo-gateway --kubeconfig ~/.kube/config-control
# Get gateway logs on worker-cluster
kubectl logs -f -n liqo deploy/liqo-gateway --kubeconfig ~/.kube/config-worker
4. Pods with schedulerName: liqo-scheduler are stuck in “Pending”.
Issue: Pods intended for offloading are not scheduled and remain in a Pending state.
Solution:
This means the Liqo scheduler could not find a suitable virtual node to offload the pod.
- Verify Peering: Ensure the peering status is
Establishedin both directions. - Check Virtual Node: Confirm that the virtual node representing the remote cluster exists and is in a
Readystate (e.g.,liqo-worker-cluster-node). - Resource Availability: Does the remote cluster have enough resources (CPU, memory) to accommodate the pod? Check the virtual node’s capacity and the remote cluster’s actual node capacity.
- Taints/Tolerations: Are there any taints on the remote cluster’s nodes that the offloaded pods don’t tolerate? Liqo automatically adds some tolerations, but custom taints can interfere.
- Liqo Scheduler Logs: Examine the logs of the
liqo-controller-managerfor scheduling decisions and errors.
# Describe the pending pod for events
kubectl describe pod nginx-offloaded-xxxxx --kubeconfig ~/.kube/config-control
# Check virtual node status
kubectl get nodes liqo-worker-cluster-node -o yaml --kubeconfig ~/.kube/config-control
# Check liqo-controller-manager logs
kubectl logs -f -n liqo deploy/liqo-controller-manager --kubeconfig ~/.kube/config-control
5. Offloaded pods cannot communicate with local services (or vice-versa).
Issue: After offloading, applications cannot connect to services running in the original cluster, or local applications cannot reach offloaded services.
Solution:
This is a networking issue, typically related to Liqo’s network fabric or Kubernetes service discovery.
- Verify Network Status: Check that the
NETWORK STATUSfor the foreign cluster isEstablished. - Check Liqo Network Manager Pods: Ensure
liqo-network-managerandliqo-gatewaypods are healthy in both clusters. - Pod IP Ranges: Ensure there are no overlapping CIDRs between the clusters, especially if
--enable-lanwas not used or if you have complex network configurations. Liqo handles this by default, but manual misconfigurations can cause issues. - Service Export/Import (if applicable): If you’re using Liqo’s service reflection features, ensure the services are correctly exported and imported.
- Network Policy Interference: If you have Kubernetes Network Policies, ensure they allow traffic between the Liqo network bridge and your pods, and between clusters.
- DNS Resolution: Verify that DNS resolution works correctly for services across clusters.
# Check foreign cluster network status
liqo get foreignclusters --kubeconfig ~/.kube/config-control
# Check network manager logs
kubectl logs -f -n liqo deploy/liqo-network-manager --kubeconfig ~/.kube/config-control
6. Resource offloading is not working as expected (e.g., specific resources like GPUs).
Issue: Pods requiring specialized resources (e.g., GPUs, custom devices) are not offloaded or fail when offloaded.
Solution:
Liqo reflects standard Kubernetes resources. For specialized resources:
- Device Plugins: Ensure the necessary Kubernetes device plugins are installed and correctly configured on the nodes of the remote cluster that provides the specialized hardware.
- Resource Naming: Verify that the resource names (e.g.,
nvidia.com/gpu) are consistent between the pod request and the remote cluster’s advertised capacity. Refer to the Kubernetes Device Plugin documentation. - Liqo Configuration: Liqo typically discovers these resources automatically, but if issues persist, check Liqo’s configuration for any resource reflection filters. For detailed GPU scheduling best practices, see LLM GPU Scheduling Guide.
FAQ Section
Q1: What is Liqo and why would I use it?
A1: Liqo (LIghtweight KOmpose) is an open-source project that enables dynamic and transparent resource sharing between multiple Kubernetes clusters. You would use it to improve resource utilization, handle traffic spikes by bursting workloads to other clusters, enable multi-cloud or hybrid-cloud deployments, and facilitate disaster recovery scenarios by providing a unified resource pool across disparate clusters. It abstracts away the complexity of cross-cluster networking and scheduling.
Q2: How does Liqo handle networking between clusters?
A2: Liqo establishes a secure, encrypted overlay network between peered clusters, typically using WireGuard. It manages IP address translation and routing, ensuring that pods offloaded to a remote cluster can seamlessly communicate with services and other pods in the original cluster, and vice-versa. It also handles service discovery across cluster boundaries, making offloaded applications appear as if they are running locally. For deeper insights into secure networking, consider exploring Cilium WireGuard Encryption.
Q3: Can Liqo offload stateful applications?
A3: While Liqo can technically offload stateful applications, it does not inherently provide distributed persistent storage. For stateful applications, you need to ensure that the persistent data is accessible from the remote cluster. This often involves using shared storage solutions like object storage (e.g., AWS S3, GCP Cloud Storage), distributed file systems (e.g., Ceph, GlusterFS), or database-as-a-service offerings that are accessible from both clusters. Designing stateful applications for multi-cluster environments requires careful consideration of data locality and consistency.
Q4: What’s the difference between Liqo and a service mesh like Istio?
A4: Liqo and service meshes like Istio serve different primary purposes, though they can complement each other. Liqo focuses on multi-cluster resource sharing and scheduling, allowing pods to run on remote clusters and providing basic cross-cluster networking. Istio (or Istio Ambient Mesh) focuses on traffic management, security, and observability for services within and across clusters, but it doesn’t directly handle resource offloading or scheduling across cluster boundaries. You can use Liqo to offload pods to a remote cluster, and then use Istio to manage traffic to those offloaded services and provide advanced policies.
Q5: How does Liqo integrate with Kubernetes schedulers?
A5: Liqo introduces its own scheduler, liqo-scheduler, which works alongside the default Kubernetes scheduler. When a pod is configured to use liqo-scheduler (either explicitly via schedulerName or implicitly via namespace offloading), Liqo intercepts the scheduling decision. It then determines if the pod should be scheduled locally or offloaded to a virtual node representing a remote cluster. If offloaded, Liqo creates a “shadow pod” in the remote cluster and manages its lifecycle, making the remote resources appear as an extension of the local cluster.
Cleanup Commands
To remove Liqo and all its associated resources from your clusters, follow these steps. It’s crucial to remove offloaded applications first.
# 1. Delete any offloaded applications from the