The burgeoning field of Artificial Intelligence, particularly the rapid adoption of Large Language Models (LLMs), has introduced new complexities into application architecture. Deploying and managing LLMs in production environments requires robust infrastructure capable of handling high-throughput, low-latency requests, often with varying computational demands. Traditional load balancing strategies, while effective for stateless microservices, may fall short when dealing with the stateful nature, resource intensity, and dynamic scaling needs of LLM inference endpoints.
This guide delves into the critical patterns for building an “AI Gateway” on Kubernetes, specifically focusing on advanced load balancing techniques for LLM requests. We’ll explore how to leverage Kubernetes-native constructs and service mesh capabilities to create an intelligent routing layer that optimizes performance, cost, and reliability for your AI workloads. From basic HTTP routing to sophisticated content-based and GPU-aware load balancing, you’ll learn to design an AI Gateway that ensures your LLM applications are not only highly available but also efficiently utilize underlying resources, especially crucial for managing expensive GPU infrastructure. For more insights on optimizing these resources, refer to our LLM GPU Scheduling Guide.
By the end of this tutorial, you’ll have a comprehensive understanding of how to implement an AI Gateway that intelligently distributes LLM inference requests, handles retries, manages circuit breaking, and provides critical observability into your AI infrastructure. We’ll walk through practical examples using popular tools like NGINX Ingress, Gateway API, and Istio, demonstrating how to build a resilient and performant AI serving layer.
TL;DR
- Problem: Efficiently load balance and manage LLM inference requests on Kubernetes, addressing resource intensity, dynamic scaling, and varying computational demands.
- Solution: Implement an AI Gateway using Kubernetes Ingress controllers (like NGINX) or advanced service mesh solutions (like Istio/Gateway API) for intelligent routing, retry policies, and observability.
- Key Concepts: HTTP Load Balancing, Content-Based Routing, GPU-Aware Routing, Rate Limiting, Circuit Breaking, Request Retries.
- Core Tools: NGINX Ingress Controller, Kubernetes Gateway API, Istio Service Mesh, Custom Admission Controllers.
Key Commands:
# Deploy NGINX Ingress Controller
helm upgrade --install ingress-nginx ingress-nginx \
--repo https://kubernetes.github.io/ingress-nginx \
--namespace ingress-nginx --create-namespace
# Deploy an LLM service example
kubectl apply -f https://raw.githubusercontent.com/kubezilla/ai-gateway-examples/main/llm-service.yaml
# Configure Ingress for basic routing
kubectl apply -f <<EOF
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
name: llm-ingress
annotations:
nginx.ingress.kubernetes.io/affinity: "cookie"
nginx.ingress.kubernetes.io/session-cookie-name: "llm_session"
nginx.ingress.kubernetes.io/session-cookie-expires: "1h"
nginx.ingress.kubernetes.io/session-cookie-max-age: "1h"
spec:
rules:
- host: llm.example.com
http:
paths:
- path: /v1/chat/completions
pathType: Prefix
backend:
service:
name: llm-service
port:
number: 80
EOF
# Deploy Istio for advanced routing (if applicable)
# See Istio installation docs: https://istio.io/latest/docs/setup/install/
# Example Istio Gateway/VirtualService
kubectl apply -f <<EOF
apiVersion: networking.istio.io/v1beta1
kind: Gateway
metadata:
name: llm-gateway
spec:
selector:
istio: ingressgateway # use Istio default ingress gateway
servers:
- port:
number: 80
name: http
protocol: HTTP
hosts:
- "llm.example.com"
---
apiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata:
name: llm-virtualservice
spec:
hosts:
- "llm.example.com"
gateways:
- llm-gateway
http:
- match:
- uri:
prefix: /v1/chat/completions
route:
- destination:
host: llm-service
port:
number: 80
retries:
attempts: 3
perTryTimeout: 2s
retryOn: gateway-error,connect-failure,refused-stream
EOF
Prerequisites
- Kubernetes Cluster: A running Kubernetes cluster (v1.20+). You can use Minikube, Kind, or a managed service like EKS, GKE, or AKS.
- kubectl: The Kubernetes command-line tool, configured to connect to your cluster.
- Helm: The Kubernetes package manager, used for deploying Ingress controllers and service meshes. Install from Helm’s official documentation.
- Basic Kubernetes Knowledge: Familiarity with Deployments, Services, Ingress, and YAML manifests.
- Optional (for advanced patterns):
- Istio: For service mesh capabilities. Follow the Istio installation guide.
- Kubernetes Gateway API: For modern ingress management. Install the CRDs as per the Gateway API documentation. For a deeper dive into Gateway API, check out our Kubernetes Gateway API Migration Guide.
- LLM Inference Endpoint: A simulated or actual LLM inference service running in your cluster. We’ll use a simple NGINX echo server for demonstration, but the principles apply to any LLM endpoint.
Step-by-Step Guide
1. Deploy a Sample LLM Inference Service
To demonstrate load balancing, we first need a backend service that mimics an LLM inference endpoint. We’ll deploy a simple NGINX server that echoes request headers and body, allowing us to observe how requests are routed. We’ll deploy two instances of this service to showcase load balancing across multiple pods.
This setup simulates a scenario where you might have different versions of an LLM, or simply multiple replicas of the same LLM, serving requests. Each pod will respond with its hostname, helping us verify that load balancing is indeed distributing traffic across them. This is a foundational step for any AI workload, as efficient LLM GPU Scheduling Guide and resource distribution are key to performance.
# llm-service.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: llm-service
labels:
app: llm-service
spec:
replicas: 2 # Deploy two instances to observe load balancing
selector:
matchLabels:
app: llm-service
template:
metadata:
labels:
app: llm-service
spec:
containers:
- name: llm-container
image: nginxdemos/hello:plain-text # A simple echo server
ports:
- containerPort: 80
resources: # Simulate resource requirements for an LLM
requests:
memory: "1Gi"
cpu: "500m"
limits:
memory: "2Gi"
cpu: "1"
---
apiVersion: v1
kind: Service
metadata:
name: llm-service
labels:
app: llm-service
spec:
selector:
app: llm-service
ports:
- protocol: TCP
port: 80
targetPort: 80
type: ClusterIP
Apply the manifest:
kubectl apply -f llm-service.yaml
Verify:
kubectl get pods -l app=llm-service
NAME READY STATUS RESTARTS AGE
llm-service-6789xxxx-abcde 1/1 Running 0 1m
llm-service-6789xxxx-fghij 1/1 Running 0 1m
kubectl get svc llm-service
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
llm-service ClusterIP 10.xx.xx.xx <none> 80/TCP 1m
2. Basic HTTP Load Balancing with NGINX Ingress
The simplest way to expose and load balance your LLM service is by using an Ingress controller. NGINX Ingress Controller is a popular choice, providing robust HTTP routing capabilities. It acts as the entry point for external traffic into your Kubernetes cluster, directing it to the appropriate backend services. For more details on ingress, consider our Kubernetes Gateway API Migration Guide, which discusses modern alternatives.
We’ll first deploy the NGINX Ingress controller and then configure an Ingress resource to route traffic from a specific hostname to our llm-service. NGINX Ingress supports various load balancing algorithms by default, such as round-robin, which distributes requests evenly across all available backend pods. This is a good starting point for many LLM deployments, especially if all your LLM instances are identical and can handle any request.
# Deploy NGINX Ingress Controller
# For a production setup, consider customizing the values.yaml
helm upgrade --install ingress-nginx ingress-nginx \
--repo https://kubernetes.github.io/ingress-nginx \
--namespace ingress-nginx --create-namespace
Verify NGINX Ingress Controller deployment:
kubectl get pods -n ingress-nginx
NAME READY STATUS RESTARTS AGE
ingress-nginx-controller-7b4c6d9f8-xxxxx 1/1 Running 0 2m
Now, create an Ingress resource for the LLM service:
# llm-ingress-basic.yaml
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
name: llm-ingress-basic
annotations:
# Optional: Enable session affinity (sticky sessions) if your LLM requires it
# This ensures requests from the same client go to the same backend pod.
nginx.ingress.kubernetes.io/affinity: "cookie"
nginx.ingress.kubernetes.io/session-cookie-name: "llm_session"
nginx.ingress.kubernetes.io/session-cookie-expires: "1h"
nginx.ingress.kubernetes.io/session-cookie-max-age: "1h"
spec:
ingressClassName: nginx # Specify the IngressClass
rules:
- host: llm.example.com # Replace with your desired hostname
http:
paths:
- path: /v1/chat/completions # Example API path
pathType: Prefix
backend:
service:
name: llm-service
port:
number: 80
Apply the Ingress:
kubectl apply -f llm-ingress-basic.yaml
Verify:
kubectl get ingress llm-ingress-basic
NAME CLASS HOSTS ADDRESS PORTS AGE
llm-ingress-basic nginx llm.example.com <EXTERNAL-IP> 80 1m
You’ll need to get the external IP address of your Ingress controller (<EXTERNAL-IP> above) and map llm.example.com to it in your /etc/hosts file or DNS. Then, test with curl:
# Replace <EXTERNAL-IP> with the actual IP from 'kubectl get ingress'
# Add "llm.example.com <EXTERNAL-IP>" to your /etc/hosts file OR
# Use --resolve "llm.example.com:80:<EXTERNAL-IP>" with curl
curl -H "Host: llm.example.com" http://<EXTERNAL-IP>/v1/chat/completions
You should see responses from different pods (indicated by the hostname in the response body) as you send multiple requests, demonstrating round-robin load balancing. If you enabled session affinity, requests from the same client (with the same cookie) should consistently hit the same backend.
3. Advanced Routing with Istio Service Mesh
For more sophisticated AI gateway patterns, such as content-based routing, request retries, circuit breaking, and traffic shifting for A/B testing or canary deployments, a service mesh like Istio is invaluable. Istio provides a powerful control plane and data plane (Envoy proxies) that can intercept and manage all network traffic within your cluster.
Before proceeding, ensure Istio is installed in your cluster. Refer to the official Istio installation guide. For a production-ready setup, consider our Istio Ambient Mesh Production Guide. Once installed, you will need to label the namespace where your LLM service is deployed for Istio’s sidecar injection or configure Ambient Mesh.
# Assuming default namespace or your LLM service namespace
kubectl label namespace default istio-injection=enabled --overwrite
# Or if using Istio Ambient Mesh:
# kubectl label namespace default istio.io/dataplane-mode=ambient --overwrite
# Restart LLM service pods for sidecar injection or ambient mode to take effect
kubectl rollout restart deployment llm-service
Now, let’s configure an Istio Gateway and VirtualService to route traffic to our LLM service. This allows us to define more granular traffic management rules.
# llm-istio-gateway.yaml
apiVersion: networking.istio.io/v1beta1
kind: Gateway
metadata:
name: llm-gateway
spec:
selector:
istio: ingressgateway # use Istio default ingress gateway
servers:
- port:
number: 80
name: http
protocol: HTTP
hosts:
- "llm.example.com"
---
apiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata:
name: llm-virtualservice
spec:
hosts:
- "llm.example.com"
gateways:
- llm-gateway
http:
- match:
- uri:
prefix: /v1/chat/completions
route:
- destination:
host: llm-service
port:
number: 80
# Add advanced policies for LLM requests
retries:
attempts: 3
perTryTimeout: 2s # Retry if response not received within 2 seconds
retryOn: gateway-error,connect-failure,refused-stream,retriable-4xx,5xx # Define conditions for retry
timeout: 10s # Overall request timeout
# You can also add fault injection, circuit breaking, etc. here
# Example: Circuit breaker to protect backend
# trafficPolicy:
# connectionPool:
# http:
# maxRequests: 100 # Max concurrent requests to a backend
# http1MaxPendingRequests: 10
# maxRetries: 3
Apply the Istio configuration:
kubectl apply -f llm-istio-gateway.yaml
Verify:
kubectl get gateway,virtualservice
NAME AGE
gateway.networking.istio.io/llm-gateway 1m
NAME AGE
virtualservice.networking.istio.io/llm-virtualservice 1m
Get the external IP of the Istio Ingress Gateway:
export INGRESS_HOST=$(kubectl -n istio-system get service istio-ingressgateway -o jsonpath='{.status.loadBalancer.ingress[0].ip}')
export INGRESS_PORT=$(kubectl -n istio-system get service istio-ingressgateway -o jsonpath='{.spec.ports[?(@.name=="http2")].port}')
export GATEWAY_URL=$INGRESS_HOST:$INGRESS_PORT
echo "$GATEWAY_URL"
Then, test with curl, again updating your /etc/hosts or using --resolve:
curl -H "Host: llm.example.com" http://$GATEWAY_URL/v1/chat/completions
You’ll observe similar load balancing behavior, but now with Istio’s advanced features like retries active. If a backend temporarily fails, Istio will automatically retry the request up to 3 times with a 2-second timeout per attempt.
4. Content-Based Routing for LLMs
Often, different LLMs (or different versions of the same LLM) are optimized for specific tasks or prompt lengths. An AI Gateway can route requests based on content in the request body or headers. For example, short prompts might go to a smaller, faster model, while complex, long prompts go to a larger, more capable model, or even to a specific GPU-accelerated endpoint. This is crucial for Karpenter Cost Optimization by ensuring expensive resources are used only when necessary.
While NGINX Ingress can do some header-based routing, Istio excels here with its ability to inspect request bodies (via Envoy filters, though this can be complex) or more commonly, route based on custom headers or URL paths. We’ll simulate this by deploying two different LLM services (llm-small-model and llm-large-model) and routing based on a custom header.
# llm-models.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: llm-small-model
labels:
app: llm-small-model
spec:
replicas: 1
selector:
matchLabels:
app: llm-small-model
template:
metadata:
labels:
app: llm-small-model
spec:
containers:
- name: llm-small-container
image: nginxdemos/hello:plain-text
env:
- name: MODEL_TYPE
value: "small"
ports:
- containerPort: 80
resources:
requests:
memory: "512Mi"
cpu: "250m"
limits:
memory: "1Gi"
cpu: "500m"
---
apiVersion: v1
kind: Service
metadata:
name: llm-small-model
labels:
app: llm-small-model
spec:
selector:
app: llm-small-model
ports:
- protocol: TCP
port: 80
targetPort: 80
type: ClusterIP
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: llm-large-model
labels:
app: llm-large-model
spec:
replicas: 1
selector:
matchLabels:
app: llm-large-model
template:
metadata:
labels:
app: llm-large-model
spec:
containers:
- name: llm-large-container
image: nginxdemos/hello:plain-text
env:
- name: MODEL_TYPE
value: "large"
ports:
- containerPort: 80
resources:
requests:
memory: "2Gi"
cpu: "1"
# Example for GPU resource; requires proper GPU operator/device plugin
# nvidia.com/gpu: "1"
limits:
memory: "4Gi"
cpu: "2"
# nvidia.com/gpu: "1"
---
apiVersion: v1
kind: Service
metadata:
name: llm-large-model
labels:
app: llm-large-model
spec:
selector:
app: llm-large-model
ports:
- protocol: TCP
port: 80
targetPort: 80
type: ClusterIP
Apply the models and restart the namespace for Istio:
kubectl apply -f llm-models.yaml
kubectl rollout restart deployment llm-small-model llm-large-model
Now, update the Istio VirtualService to route based on a custom header x-llm-model-type:
# llm-istio-content-routing.yaml
apiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata:
name: llm-virtualservice
spec:
hosts:
- "llm.example.com"
gateways:
- llm-gateway
http:
- match: # Route to large model if header is present and "large"
- uri:
prefix: /v1/chat/completions
headers:
x-llm-model-type:
exact: "large"
route:
- destination:
host: llm-large-model
port:
number: 80
retries:
attempts: 3
perTryTimeout: 5s # Larger timeout for potentially slower large model
retryOn: gateway-error,connect-failure,5xx
- match: # Route to small model by default or if header is "small"
- uri:
prefix: /v1/chat/completions
headers:
x-llm-model-type:
exact: "small"
- uri: # Default route if no specific header
prefix: /v1/chat/completions
route:
- destination:
host: llm-small-model
port:
number: 80
retries:
attempts: 2
perTryTimeout: 2s
retryOn: gateway-error,connect-failure
Apply the updated VirtualService:
kubectl apply -f llm-istio-content-routing.yaml
Verify:
# Test with default routing (should go to small model)
curl -H "Host: llm.example.com" http://$GATEWAY_URL/v1/chat/completions
# Expected: "Hello from llm-small-model-xxxx-xxxxx"
# Test with large model header
curl -H "Host: llm.example.com" -H "x-llm-model-type: large" http://$GATEWAY_URL/v1/chat/completions
# Expected: "Hello from llm-large-model-xxxx-xxxxx"
This demonstrates how you can route requests to different backend LLM services based on client-provided metadata. In a real-world scenario, a client application or an intermediate API gateway would add these headers based on prompt characteristics or user preferences.
5. Rate Limiting and Security Policies
Protecting your LLM endpoints from abuse, managing traffic, and ensuring fair usage often requires rate limiting. Both NGINX Ingress and Istio offer rate-limiting capabilities. Additionally, applying network policies can enhance security by restricting communication paths. For a deeper dive into securing your cluster, review our Network Policies Security Guide.
NGINX Ingress Rate Limiting:
# llm-ingress-rate-limit.yaml
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
name: llm-ingress-rate-limit
annotations:
ingress.kubernetes.io/rewrite-target: /
nginx.ingress.kubernetes.io/limit-rpm: "60" # Limit to 60 requests per minute per IP
nginx.ingress.kubernetes.io/limit-rps: "1" # Limit to 1 request per second per IP
nginx.ingress.kubernetes.io/limit-burst: "5" # Allow bursts of up to 5 requests
nginx.ingress.kubernetes.io/limit-rate-after-burst: "1" # Rate after burst
spec:
ingressClassName: nginx
rules:
- host: llm.example.com
http:
paths:
- path: /v1/chat/completions
pathType: Prefix
backend:
service:
name: llm-small-model # Apply rate limit to small model
port:
number: 80
Apply the Ingress:
kubectl apply -f llm-ingress-rate-limit.yaml
Test Rate Limiting: Rapidly send requests to llm.example.com. You should start seeing 429 Too Many Requests after exceeding the limit.
for i in $(seq 1 10); do curl -s -o /dev/null -w "%{http_code}\n" -H "Host: llm.example.com" http://<EXTERNAL-IP>/v1/chat/completions; done
Istio Rate Limiting (requires Istio’s rate limit service):
Istio’s rate limiting is more advanced, allowing granular policies based on various request attributes. It typically involves deploying a rate limit service and configuring EnvoyFilter or RateLimitSpec (if using an external rate limit service like Redis). This is a more complex setup, but provides enterprise-grade control. Refer to the Istio Rate Limiting documentation for detailed steps.
Example Istio RequestAuthentication (for JWT validation):
Securing access to your LLM endpoints is paramount. Istio can enforce authentication policies, for example, by validating JWT tokens before requests reach your LLM services. This is a crucial aspect of securing sensitive AI models.
# llm-istio-auth.yaml