Orchestration

AI Gateway: Load Balance LLM Requests

The burgeoning field of Artificial Intelligence, particularly the rapid adoption of Large Language Models (LLMs), has introduced new complexities into application architecture. Deploying and managing LLMs in production environments requires robust infrastructure capable of handling high-throughput, low-latency requests, often with varying computational demands. Traditional load balancing strategies, while effective for stateless microservices, may fall short when dealing with the stateful nature, resource intensity, and dynamic scaling needs of LLM inference endpoints.

This guide delves into the critical patterns for building an “AI Gateway” on Kubernetes, specifically focusing on advanced load balancing techniques for LLM requests. We’ll explore how to leverage Kubernetes-native constructs and service mesh capabilities to create an intelligent routing layer that optimizes performance, cost, and reliability for your AI workloads. From basic HTTP routing to sophisticated content-based and GPU-aware load balancing, you’ll learn to design an AI Gateway that ensures your LLM applications are not only highly available but also efficiently utilize underlying resources, especially crucial for managing expensive GPU infrastructure. For more insights on optimizing these resources, refer to our LLM GPU Scheduling Guide.

By the end of this tutorial, you’ll have a comprehensive understanding of how to implement an AI Gateway that intelligently distributes LLM inference requests, handles retries, manages circuit breaking, and provides critical observability into your AI infrastructure. We’ll walk through practical examples using popular tools like NGINX Ingress, Gateway API, and Istio, demonstrating how to build a resilient and performant AI serving layer.

TL;DR

  • Problem: Efficiently load balance and manage LLM inference requests on Kubernetes, addressing resource intensity, dynamic scaling, and varying computational demands.
  • Solution: Implement an AI Gateway using Kubernetes Ingress controllers (like NGINX) or advanced service mesh solutions (like Istio/Gateway API) for intelligent routing, retry policies, and observability.
  • Key Concepts: HTTP Load Balancing, Content-Based Routing, GPU-Aware Routing, Rate Limiting, Circuit Breaking, Request Retries.
  • Core Tools: NGINX Ingress Controller, Kubernetes Gateway API, Istio Service Mesh, Custom Admission Controllers.

Key Commands:

# Deploy NGINX Ingress Controller
helm upgrade --install ingress-nginx ingress-nginx \
  --repo https://kubernetes.github.io/ingress-nginx \
  --namespace ingress-nginx --create-namespace

# Deploy an LLM service example
kubectl apply -f https://raw.githubusercontent.com/kubezilla/ai-gateway-examples/main/llm-service.yaml

# Configure Ingress for basic routing
kubectl apply -f <<EOF
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: llm-ingress
  annotations:
    nginx.ingress.kubernetes.io/affinity: "cookie"
    nginx.ingress.kubernetes.io/session-cookie-name: "llm_session"
    nginx.ingress.kubernetes.io/session-cookie-expires: "1h"
    nginx.ingress.kubernetes.io/session-cookie-max-age: "1h"
spec:
  rules:
  - host: llm.example.com
    http:
      paths:
      - path: /v1/chat/completions
        pathType: Prefix
        backend:
          service:
            name: llm-service
            port:
              number: 80
EOF

# Deploy Istio for advanced routing (if applicable)
# See Istio installation docs: https://istio.io/latest/docs/setup/install/
# Example Istio Gateway/VirtualService
kubectl apply -f <<EOF
apiVersion: networking.istio.io/v1beta1
kind: Gateway
metadata:
  name: llm-gateway
spec:
  selector:
    istio: ingressgateway # use Istio default ingress gateway
  servers:
  - port:
      number: 80
      name: http
      protocol: HTTP
    hosts:
    - "llm.example.com"
---
apiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata:
  name: llm-virtualservice
spec:
  hosts:
  - "llm.example.com"
  gateways:
  - llm-gateway
  http:
  - match:
    - uri:
        prefix: /v1/chat/completions
    route:
    - destination:
        host: llm-service
        port:
          number: 80
    retries:
      attempts: 3
      perTryTimeout: 2s
      retryOn: gateway-error,connect-failure,refused-stream
EOF

Prerequisites

  • Kubernetes Cluster: A running Kubernetes cluster (v1.20+). You can use Minikube, Kind, or a managed service like EKS, GKE, or AKS.
  • kubectl: The Kubernetes command-line tool, configured to connect to your cluster.
  • Helm: The Kubernetes package manager, used for deploying Ingress controllers and service meshes. Install from Helm’s official documentation.
  • Basic Kubernetes Knowledge: Familiarity with Deployments, Services, Ingress, and YAML manifests.
  • Optional (for advanced patterns):
  • LLM Inference Endpoint: A simulated or actual LLM inference service running in your cluster. We’ll use a simple NGINX echo server for demonstration, but the principles apply to any LLM endpoint.

Step-by-Step Guide

1. Deploy a Sample LLM Inference Service

To demonstrate load balancing, we first need a backend service that mimics an LLM inference endpoint. We’ll deploy a simple NGINX server that echoes request headers and body, allowing us to observe how requests are routed. We’ll deploy two instances of this service to showcase load balancing across multiple pods.

This setup simulates a scenario where you might have different versions of an LLM, or simply multiple replicas of the same LLM, serving requests. Each pod will respond with its hostname, helping us verify that load balancing is indeed distributing traffic across them. This is a foundational step for any AI workload, as efficient LLM GPU Scheduling Guide and resource distribution are key to performance.

# llm-service.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: llm-service
  labels:
    app: llm-service
spec:
  replicas: 2 # Deploy two instances to observe load balancing
  selector:
    matchLabels:
      app: llm-service
  template:
    metadata:
      labels:
        app: llm-service
    spec:
      containers:
      - name: llm-container
        image: nginxdemos/hello:plain-text # A simple echo server
        ports:
        - containerPort: 80
        resources: # Simulate resource requirements for an LLM
          requests:
            memory: "1Gi"
            cpu: "500m"
          limits:
            memory: "2Gi"
            cpu: "1"
---
apiVersion: v1
kind: Service
metadata:
  name: llm-service
  labels:
    app: llm-service
spec:
  selector:
    app: llm-service
  ports:
    - protocol: TCP
      port: 80
      targetPort: 80
  type: ClusterIP

Apply the manifest:

kubectl apply -f llm-service.yaml

Verify:

kubectl get pods -l app=llm-service
NAME                           READY   STATUS    RESTARTS   AGE
llm-service-6789xxxx-abcde   1/1     Running   0          1m
llm-service-6789xxxx-fghij   1/1     Running   0          1m
kubectl get svc llm-service
NAME          TYPE        CLUSTER-IP    EXTERNAL-IP   PORT(S)   AGE
llm-service   ClusterIP   10.xx.xx.xx   <none>        80/TCP    1m

2. Basic HTTP Load Balancing with NGINX Ingress

The simplest way to expose and load balance your LLM service is by using an Ingress controller. NGINX Ingress Controller is a popular choice, providing robust HTTP routing capabilities. It acts as the entry point for external traffic into your Kubernetes cluster, directing it to the appropriate backend services. For more details on ingress, consider our Kubernetes Gateway API Migration Guide, which discusses modern alternatives.

We’ll first deploy the NGINX Ingress controller and then configure an Ingress resource to route traffic from a specific hostname to our llm-service. NGINX Ingress supports various load balancing algorithms by default, such as round-robin, which distributes requests evenly across all available backend pods. This is a good starting point for many LLM deployments, especially if all your LLM instances are identical and can handle any request.

# Deploy NGINX Ingress Controller
# For a production setup, consider customizing the values.yaml
helm upgrade --install ingress-nginx ingress-nginx \
  --repo https://kubernetes.github.io/ingress-nginx \
  --namespace ingress-nginx --create-namespace

Verify NGINX Ingress Controller deployment:

kubectl get pods -n ingress-nginx
NAME                                       READY   STATUS    RESTARTS   AGE
ingress-nginx-controller-7b4c6d9f8-xxxxx   1/1     Running   0          2m

Now, create an Ingress resource for the LLM service:

# llm-ingress-basic.yaml
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: llm-ingress-basic
  annotations:
    # Optional: Enable session affinity (sticky sessions) if your LLM requires it
    # This ensures requests from the same client go to the same backend pod.
    nginx.ingress.kubernetes.io/affinity: "cookie"
    nginx.ingress.kubernetes.io/session-cookie-name: "llm_session"
    nginx.ingress.kubernetes.io/session-cookie-expires: "1h"
    nginx.ingress.kubernetes.io/session-cookie-max-age: "1h"
spec:
  ingressClassName: nginx # Specify the IngressClass
  rules:
  - host: llm.example.com # Replace with your desired hostname
    http:
      paths:
      - path: /v1/chat/completions # Example API path
        pathType: Prefix
        backend:
          service:
            name: llm-service
            port:
              number: 80

Apply the Ingress:

kubectl apply -f llm-ingress-basic.yaml

Verify:

kubectl get ingress llm-ingress-basic
NAME                CLASS    HOSTS            ADDRESS         PORTS   AGE
llm-ingress-basic   nginx    llm.example.com  <EXTERNAL-IP>   80      1m

You’ll need to get the external IP address of your Ingress controller (<EXTERNAL-IP> above) and map llm.example.com to it in your /etc/hosts file or DNS. Then, test with curl:

# Replace <EXTERNAL-IP> with the actual IP from 'kubectl get ingress'
# Add "llm.example.com <EXTERNAL-IP>" to your /etc/hosts file OR
# Use --resolve "llm.example.com:80:<EXTERNAL-IP>" with curl
curl -H "Host: llm.example.com" http://<EXTERNAL-IP>/v1/chat/completions

You should see responses from different pods (indicated by the hostname in the response body) as you send multiple requests, demonstrating round-robin load balancing. If you enabled session affinity, requests from the same client (with the same cookie) should consistently hit the same backend.

3. Advanced Routing with Istio Service Mesh

For more sophisticated AI gateway patterns, such as content-based routing, request retries, circuit breaking, and traffic shifting for A/B testing or canary deployments, a service mesh like Istio is invaluable. Istio provides a powerful control plane and data plane (Envoy proxies) that can intercept and manage all network traffic within your cluster.

Before proceeding, ensure Istio is installed in your cluster. Refer to the official Istio installation guide. For a production-ready setup, consider our Istio Ambient Mesh Production Guide. Once installed, you will need to label the namespace where your LLM service is deployed for Istio’s sidecar injection or configure Ambient Mesh.

# Assuming default namespace or your LLM service namespace
kubectl label namespace default istio-injection=enabled --overwrite
# Or if using Istio Ambient Mesh:
# kubectl label namespace default istio.io/dataplane-mode=ambient --overwrite

# Restart LLM service pods for sidecar injection or ambient mode to take effect
kubectl rollout restart deployment llm-service

Now, let’s configure an Istio Gateway and VirtualService to route traffic to our LLM service. This allows us to define more granular traffic management rules.

# llm-istio-gateway.yaml
apiVersion: networking.istio.io/v1beta1
kind: Gateway
metadata:
  name: llm-gateway
spec:
  selector:
    istio: ingressgateway # use Istio default ingress gateway
  servers:
  - port:
      number: 80
      name: http
      protocol: HTTP
    hosts:
    - "llm.example.com"
---
apiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata:
  name: llm-virtualservice
spec:
  hosts:
  - "llm.example.com"
  gateways:
  - llm-gateway
  http:
  - match:
    - uri:
        prefix: /v1/chat/completions
    route:
    - destination:
        host: llm-service
        port:
          number: 80
    # Add advanced policies for LLM requests
    retries:
      attempts: 3
      perTryTimeout: 2s # Retry if response not received within 2 seconds
      retryOn: gateway-error,connect-failure,refused-stream,retriable-4xx,5xx # Define conditions for retry
    timeout: 10s # Overall request timeout
    # You can also add fault injection, circuit breaking, etc. here
    # Example: Circuit breaker to protect backend
    # trafficPolicy:
    #   connectionPool:
    #     http:
    #       maxRequests: 100 # Max concurrent requests to a backend
    #       http1MaxPendingRequests: 10
    #       maxRetries: 3

Apply the Istio configuration:

kubectl apply -f llm-istio-gateway.yaml

Verify:

kubectl get gateway,virtualservice
NAME                              AGE
gateway.networking.istio.io/llm-gateway   1m

NAME                                          AGE
virtualservice.networking.istio.io/llm-virtualservice   1m

Get the external IP of the Istio Ingress Gateway:

export INGRESS_HOST=$(kubectl -n istio-system get service istio-ingressgateway -o jsonpath='{.status.loadBalancer.ingress[0].ip}')
export INGRESS_PORT=$(kubectl -n istio-system get service istio-ingressgateway -o jsonpath='{.spec.ports[?(@.name=="http2")].port}')
export GATEWAY_URL=$INGRESS_HOST:$INGRESS_PORT

echo "$GATEWAY_URL"

Then, test with curl, again updating your /etc/hosts or using --resolve:

curl -H "Host: llm.example.com" http://$GATEWAY_URL/v1/chat/completions

You’ll observe similar load balancing behavior, but now with Istio’s advanced features like retries active. If a backend temporarily fails, Istio will automatically retry the request up to 3 times with a 2-second timeout per attempt.

4. Content-Based Routing for LLMs

Often, different LLMs (or different versions of the same LLM) are optimized for specific tasks or prompt lengths. An AI Gateway can route requests based on content in the request body or headers. For example, short prompts might go to a smaller, faster model, while complex, long prompts go to a larger, more capable model, or even to a specific GPU-accelerated endpoint. This is crucial for Karpenter Cost Optimization by ensuring expensive resources are used only when necessary.

While NGINX Ingress can do some header-based routing, Istio excels here with its ability to inspect request bodies (via Envoy filters, though this can be complex) or more commonly, route based on custom headers or URL paths. We’ll simulate this by deploying two different LLM services (llm-small-model and llm-large-model) and routing based on a custom header.

# llm-models.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: llm-small-model
  labels:
    app: llm-small-model
spec:
  replicas: 1
  selector:
    matchLabels:
      app: llm-small-model
  template:
    metadata:
      labels:
        app: llm-small-model
    spec:
      containers:
      - name: llm-small-container
        image: nginxdemos/hello:plain-text
        env:
        - name: MODEL_TYPE
          value: "small"
        ports:
        - containerPort: 80
        resources:
          requests:
            memory: "512Mi"
            cpu: "250m"
          limits:
            memory: "1Gi"
            cpu: "500m"
---
apiVersion: v1
kind: Service
metadata:
  name: llm-small-model
  labels:
    app: llm-small-model
spec:
  selector:
    app: llm-small-model
  ports:
    - protocol: TCP
      port: 80
      targetPort: 80
  type: ClusterIP
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: llm-large-model
  labels:
    app: llm-large-model
spec:
  replicas: 1
  selector:
    matchLabels:
      app: llm-large-model
  template:
    metadata:
      labels:
        app: llm-large-model
    spec:
      containers:
      - name: llm-large-container
        image: nginxdemos/hello:plain-text
        env:
        - name: MODEL_TYPE
          value: "large"
        ports:
        - containerPort: 80
        resources:
          requests:
            memory: "2Gi"
            cpu: "1"
            # Example for GPU resource; requires proper GPU operator/device plugin
            # nvidia.com/gpu: "1"
          limits:
            memory: "4Gi"
            cpu: "2"
            # nvidia.com/gpu: "1"
---
apiVersion: v1
kind: Service
metadata:
  name: llm-large-model
  labels:
    app: llm-large-model
spec:
  selector:
    app: llm-large-model
  ports:
    - protocol: TCP
      port: 80
      targetPort: 80
  type: ClusterIP

Apply the models and restart the namespace for Istio:

kubectl apply -f llm-models.yaml
kubectl rollout restart deployment llm-small-model llm-large-model

Now, update the Istio VirtualService to route based on a custom header x-llm-model-type:

# llm-istio-content-routing.yaml
apiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata:
  name: llm-virtualservice
spec:
  hosts:
  - "llm.example.com"
  gateways:
  - llm-gateway
  http:
  - match: # Route to large model if header is present and "large"
    - uri:
        prefix: /v1/chat/completions
      headers:
        x-llm-model-type:
          exact: "large"
    route:
    - destination:
        host: llm-large-model
        port:
          number: 80
    retries:
      attempts: 3
      perTryTimeout: 5s # Larger timeout for potentially slower large model
      retryOn: gateway-error,connect-failure,5xx
  - match: # Route to small model by default or if header is "small"
    - uri:
        prefix: /v1/chat/completions
      headers:
        x-llm-model-type:
          exact: "small"
    - uri: # Default route if no specific header
        prefix: /v1/chat/completions
    route:
    - destination:
        host: llm-small-model
        port:
          number: 80
    retries:
      attempts: 2
      perTryTimeout: 2s
      retryOn: gateway-error,connect-failure

Apply the updated VirtualService:

kubectl apply -f llm-istio-content-routing.yaml

Verify:

# Test with default routing (should go to small model)
curl -H "Host: llm.example.com" http://$GATEWAY_URL/v1/chat/completions
# Expected: "Hello from llm-small-model-xxxx-xxxxx"
# Test with large model header
curl -H "Host: llm.example.com" -H "x-llm-model-type: large" http://$GATEWAY_URL/v1/chat/completions
# Expected: "Hello from llm-large-model-xxxx-xxxxx"

This demonstrates how you can route requests to different backend LLM services based on client-provided metadata. In a real-world scenario, a client application or an intermediate API gateway would add these headers based on prompt characteristics or user preferences.

5. Rate Limiting and Security Policies

Protecting your LLM endpoints from abuse, managing traffic, and ensuring fair usage often requires rate limiting. Both NGINX Ingress and Istio offer rate-limiting capabilities. Additionally, applying network policies can enhance security by restricting communication paths. For a deeper dive into securing your cluster, review our Network Policies Security Guide.

NGINX Ingress Rate Limiting:

# llm-ingress-rate-limit.yaml
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: llm-ingress-rate-limit
  annotations:
    ingress.kubernetes.io/rewrite-target: /
    nginx.ingress.kubernetes.io/limit-rpm: "60" # Limit to 60 requests per minute per IP
    nginx.ingress.kubernetes.io/limit-rps: "1"  # Limit to 1 request per second per IP
    nginx.ingress.kubernetes.io/limit-burst: "5" # Allow bursts of up to 5 requests
    nginx.ingress.kubernetes.io/limit-rate-after-burst: "1" # Rate after burst
spec:
  ingressClassName: nginx
  rules:
  - host: llm.example.com
    http:
      paths:
      - path: /v1/chat/completions
        pathType: Prefix
        backend:
          service:
            name: llm-small-model # Apply rate limit to small model
            port:
              number: 80

Apply the Ingress:

kubectl apply -f llm-ingress-rate-limit.yaml

Test Rate Limiting: Rapidly send requests to llm.example.com. You should start seeing 429 Too Many Requests after exceeding the limit.

for i in $(seq 1 10); do curl -s -o /dev/null -w "%{http_code}\n" -H "Host: llm.example.com" http://<EXTERNAL-IP>/v1/chat/completions; done

Istio Rate Limiting (requires Istio’s rate limit service):

Istio’s rate limiting is more advanced, allowing granular policies based on various request attributes. It typically involves deploying a rate limit service and configuring EnvoyFilter or RateLimitSpec (if using an external rate limit service like Redis). This is a more complex setup, but provides enterprise-grade control. Refer to the Istio Rate Limiting documentation for detailed steps.

Example Istio RequestAuthentication (for JWT validation):

Securing access to your LLM endpoints is paramount. Istio can enforce authentication policies, for example, by validating JWT tokens before requests reach your LLM services. This is a crucial aspect of securing sensitive AI models.

# llm-istio-auth.yaml
			

Leave a Reply

Your email address will not be published. Required fields are marked *