Orchestration

Run LLMs on Kubernetes: vLLM Deployment

Running LLMs on Kubernetes: vLLM Deployment

Deploying Large Language Models (LLMs) in production can be a complex endeavor, especially when striving for high throughput and low latency. Traditional serving frameworks often struggle with the unique demands of LLMs, such as large model sizes, intricate memory access patterns, and the need for efficient GPU utilization. This is where specialized solutions like vLLM shine. vLLM is an open-source library designed for fast and efficient LLM inference, leveraging techniques like PagedAttention to dramatically improve throughput compared to naive implementations.

Kubernetes, with its robust orchestration capabilities, provides an ideal platform for deploying and scaling vLLM. By combining vLLM’s inference optimizations with Kubernetes’ resource management, auto-scaling, and high-availability features, you can build a resilient and performant LLM serving infrastructure. This guide will walk you through the process of deploying vLLM on Kubernetes, from setting up your environment to configuring GPU resources and exposing your LLM endpoint, ensuring you can serve your models efficiently and reliably. For a broader understanding of optimizing LLM deployments on Kubernetes, check out our comprehensive LLM GPU Scheduling Guide.

TL;DR: Deploying vLLM on Kubernetes

This guide helps you deploy vLLM for efficient LLM inference on Kubernetes with GPU support. Key steps involve setting up NVIDIA Device Plugin, creating a vLLM deployment, and exposing it via a Service.


# 1. Install NVIDIA Device Plugin (if not already done)
kubectl apply -f https://raw.githubusercontent.com/NVIDIA/gpu-operator/master/deployments/gpu-operator/nvidia-device-plugin.yaml

# 2. Deploy vLLM (example using Llama 2 7B)
kubectl apply -f vllm-deployment.yaml
kubectl apply -f vllm-service.yaml

# 3. Verify deployment
kubectl get pods -l app=vllm
kubectl get svc vllm-service

# 4. Test the endpoint (replace  with your service's external IP)
curl -X POST "http://:8000/generate" \
     -H "Content-Type: application/json" \
     -d '{
           "prompt": "Hello, my name is",
           "max_tokens": 16,
           "temperature": 0.7
         }'
    

Prerequisites

  • Kubernetes Cluster: A running Kubernetes cluster (v1.20+ recommended). For cloud providers, ensure your nodes are configured with GPUs.
  • GPU-enabled Nodes: Your Kubernetes cluster nodes must have NVIDIA GPUs.
  • NVIDIA Drivers: Appropriate NVIDIA GPU drivers installed on your cluster nodes.
  • NVIDIA Container Toolkit: Installed on your nodes to enable Docker/containerd to access GPUs. Refer to the NVIDIA Container Toolkit Installation Guide.
  • NVIDIA Device Plugin for Kubernetes: This plugin is essential for Kubernetes to recognize and schedule workloads on GPUs.
  • kubectl: Command-line tool for interacting with your Kubernetes cluster.
  • Helm (Optional): Useful for managing complex deployments, though we’ll use raw YAML for simplicity here.
  • Basic understanding of Kubernetes concepts: Pods, Deployments, Services, and Resource Requests/Limits.

Step-by-Step Guide

1. Install NVIDIA Device Plugin

The NVIDIA Device Plugin for Kubernetes allows your cluster to expose GPU resources to pods. Without it, Kubernetes cannot see or schedule workloads onto your GPUs. This plugin runs as a DaemonSet, ensuring that it’s deployed on every GPU-enabled node in your cluster. It watches for GPU availability and registers them as schedulable resources (e.g., nvidia.com/gpu: 1) within Kubernetes.

Before proceeding, ensure your nodes have the correct NVIDIA drivers and the NVIDIA Container Toolkit installed. The device plugin relies on these underlying components to function correctly. You can find detailed installation instructions on the NVIDIA Device Plugin GitHub repository.


kubectl apply -f https://raw.githubusercontent.com/NVIDIA/gpu-operator/master/deployments/gpu-operator/nvidia-device-plugin.yaml

Verify: After applying, check if the device plugin pods are running and healthy. You should see pods in the nvidia-device-plugin namespace (or default, depending on the manifest) with a Running status for each GPU-enabled node.


kubectl get pods -n nvidia-device-plugin

Expected Output:


NAME                                          READY   STATUS    RESTARTS   AGE
nvidia-device-plugin-daemonset-abcde          1/1     Running   0          2m
nvidia-device-plugin-daemonset-fghij          1/1     Running   0          2m
# ... one pod per GPU-enabled node

2. Create a vLLM Deployment

Now that Kubernetes can recognize your GPUs, we can create a Deployment for vLLM. This Deployment will define the vLLM server, specifying the Docker image, resource requests (including GPU), and command-line arguments for serving a specific LLM. We’ll use a public vLLM image and configure it to serve a popular model like Llama 2 7B.

The model argument specifies the Hugging Face model ID. The tensor-parallel-size argument is crucial for distributing the model across multiple GPUs if your model is too large for a single GPU or if you want to optimize for throughput across available GPUs. Remember that efficient GPU scheduling for LLMs is key to performance and cost-effectiveness.


# vllm-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: vllm-llama2-7b
  labels:
    app: vllm
spec:
  replicas: 1
  selector:
    matchLabels:
      app: vllm
  template:
    metadata:
      labels:
        app: vllm
    spec:
      containers:
      - name: vllm-server
        image: vllm/vllm-openai:latest # Or a specific tag like v0.3.3
        imagePullPolicy: IfNotPresent
        command: ["python", "-m", "vllm.entrypoints.openai.api_server"]
        args:
          - "--host"
          - "0.0.0.0"
          - "--port"
          - "8000"
          - "--model"
          - "meta-llama/Llama-2-7b-chat-hf" # Example model
          - "--tensor-parallel-size"
          - "1" # Number of GPUs to use for tensor parallelism (set to 1 for a single GPU)
          - "--gpu-memory-utilization"
          - "0.9" # Fraction of GPU memory to use. Adjust based on model size.
        ports:
        - containerPort: 8000
          name: http
        resources:
          limits:
            nvidia.com/gpu: 1 # Request 1 GPU
          requests:
            nvidia.com/gpu: 1 # Request 1 GPU
            memory: "24Gi" # Adjust based on your model's memory requirements
            cpu: "8"
        volumeMounts:
          - name: huggingface-cache
            mountPath: /root/.cache/huggingface
      volumes:
        - name: huggingface-cache
          persistentVolumeClaim:
            claimName: vllm-huggingface-cache-pvc # PVC for model caching
      tolerations:
      - key: "nvidia.com/gpu"
        operator: "Exists"
        effect: "NoSchedule"
      restartPolicy: Always

Before applying the Deployment, you need a PersistentVolumeClaim (PVC) for caching Hugging Face models. This prevents re-downloading the model every time the pod restarts, saving bandwidth and startup time.


# vllm-huggingface-cache-pvc.yaml
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: vllm-huggingface-cache-pvc
spec:
  accessModes:
    - ReadWriteOnce # Or ReadWriteMany if your storage class supports it and you have multiple replicas
  resources:
    requests:
      storage: 100Gi # Allocate enough storage for your models

kubectl apply -f vllm-huggingface-cache-pvc.yaml
kubectl apply -f vllm-deployment.yaml

Verify: Check the status of your vLLM pod. It might take some time to download the model initially.


kubectl get pods -l app=vllm

Expected Output:


NAME                               READY   STATUS    RESTARTS   AGE
vllm-llama2-7b-abcdefg-hijkl       1/1     Running   0          5m # Status should eventually be Running

3. Expose vLLM with a Kubernetes Service

To access your vLLM server from outside the cluster, you need to expose it using a Kubernetes Service. For demonstration purposes, we’ll use a LoadBalancer type Service, which provisions an external IP address in most cloud environments. For production, you might consider an Ingress controller or, even better, the Kubernetes Gateway API for more advanced traffic management, routing, and TLS termination.


# vllm-service.yaml
apiVersion: v1
kind: Service
metadata:
  name: vllm-service
spec:
  selector:
    app: vllm
  ports:
    - protocol: TCP
      port: 8000
      targetPort: 8000
      name: http
  type: LoadBalancer # Use NodePort or ClusterIP for internal access, or Ingress/Gateway API for external

kubectl apply -f vllm-service.yaml

Verify: Get the external IP address of your service.


kubectl get svc vllm-service

Expected Output:


NAME           TYPE           CLUSTER-IP     EXTERNAL-IP    PORT(S)          AGE
vllm-service   LoadBalancer   10.96.12.34    A.B.C.D        8000:3xxxx/TCP   1m

Note down the EXTERNAL-IP. It might take a few minutes for the cloud provider to provision it.

4. Test the vLLM Endpoint

Once the service has an external IP, you can test the vLLM API using curl or any HTTP client. vLLM provides an OpenAI-compatible API, making it easy to integrate with existing tools and libraries.


# Replace A.B.C.D with your actual EXTERNAL-IP
EXTERNAL_IP="A.B.C.D"

curl -X POST "http://${EXTERNAL_IP}:8000/v1/completions" \
     -H "Content-Type: application/json" \
     -d '{
           "model": "meta-llama/Llama-2-7b-chat-hf",
           "prompt": "Hello, my name is",
           "max_tokens": 16,
           "temperature": 0.7
         }'

Expected Output:


{
  "id": "cmpl-...",
  "object": "text_completion",
  "created": 1678886400,
  "model": "meta-llama/Llama-2-7b-chat-hf",
  "choices": [
    {
      "index": 0,
      "text": " John. I am a software engineer and I love to code.",
      "logprobs": null,
      "finish_reason": "length"
    }
  ],
  "usage": {
    "prompt_tokens": 4,
    "completion_tokens": 16,
    "total_tokens": 20
  }
}

Congratulations! You have successfully deployed vLLM on Kubernetes and served your first LLM inference request.

Production Considerations

  • Resource Management: Accurately size your GPU memory (--gpu-memory-utilization) and CPU/memory requests/limits. Overtightening can lead to OOMKills, while being too generous wastes resources. Monitor GPU utilization closely.
  • Auto-scaling: Implement Horizontal Pod Autoscaler (HPA) based on custom metrics like GPU utilization or request latency. For node-level auto-scaling, consider tools like Karpenter to efficiently provision GPU-enabled nodes on demand, optimizing costs.
  • High Availability: Run multiple vLLM replicas behind a Service. If a node or pod fails, Kubernetes will reschedule the workload. Consider anti-affinity rules to spread pods across different nodes.
  • Model Caching: Use PersistentVolumes (as shown in this guide) for caching Hugging Face models to speed up pod restarts and reduce egress costs.
  • Security:
    • Network Policies: Restrict inbound and outbound traffic to your vLLM pods using Kubernetes Network Policies.
    • Image Security: Use trusted, signed container images. Integrate with tools like Sigstore and Kyverno for supply chain security.
    • Secrets Management: If your LLM requires API keys or credentials, use Kubernetes Secrets and inject them securely into your pods.
  • Observability:
    • Logging: Centralize vLLM container logs using a logging stack (e.g., Fluentd, Loki, ELK).
    • Monitoring: Collect metrics (GPU utilization, latency, throughput) using Prometheus and visualize with Grafana. Explore advanced eBPF-based observability with tools like Hubble for Cilium.
    • Alerting: Set up alerts for high error rates, low throughput, or resource exhaustion.
  • Networking: For advanced traffic management, A/B testing, and canary deployments, consider a Service Mesh like Istio Ambient Mesh or a robust Ingress/Gateway solution.
  • Model Versioning and Updates: Implement strategies for rolling out new model versions with minimal downtime, potentially using blue/green or canary deployments.
  • Cost Optimization: Leverage spot instances for GPU nodes where appropriate, combined with intelligent scheduling and auto-scaling.

Troubleshooting

  1. Issue: Pod stuck in Pending state with “no matching GPU resources”.

    Solution: This usually means the NVIDIA Device Plugin is not properly installed or running, or there are no available GPU-enabled nodes. Verify the device plugin pods are running and healthy (kubectl get pods -n nvidia-device-plugin). Ensure your nodes have GPUs and the necessary drivers/toolkit installed. Check node labels for nvidia.com/gpu.present=true.

    
    kubectl describe pod 
    kubectl get nodes -o yaml | grep -A5 "nvidia.com/gpu" # Check if GPUs are reported
    
  2. Issue: vLLM pod crashes with “CUDA out of memory” or similar GPU memory errors.

    Solution: The model is too large for the allocated GPU memory. Reduce the --gpu-memory-utilization flag in your deployment YAML, or use a smaller model. If using multiple GPUs, ensure --tensor-parallel-size is correctly set. You might need to use a node with more GPU memory or a different model architecture. Also, check for other processes consuming GPU memory on the node.

  3. Issue: Service EXTERNAL-IP remains <pending>.

    Solution: The LoadBalancer service type requires a cloud provider integration (e.g., AWS ELB, GCP Load Balancer) to provision an external IP. If you’re running on-premises or a bare-metal cluster, you might need to use a different service type (NodePort, ClusterIP) or install a load balancer solution like MetalLB. Check your cloud provider’s quota for load balancers.

  4. Issue: API requests to vLLM return connection refused or timeout.

    Solution:

    1. Verify the vLLM pod is Running and healthy (kubectl get pods -l app=vllm).
    2. Check the pod logs for any errors (kubectl logs ).
    3. Ensure the Service is correctly pointing to the pod (selector matches pod labels).
    4. Confirm network connectivity from your client to the EXTERNAL-IP and port. Check any firewalls or Network Policies that might be blocking traffic.
  5. Issue: Slow initial model loading time.

    Solution: This is often due to re-downloading the model. Ensure your PersistentVolumeClaim (PVC) is correctly mounted and persistent. Verify that the cache directory (/root/.cache/huggingface by default for vLLM) within the container is indeed writing to the PVC. Check PVC status and events (kubectl describe pvc vllm-huggingface-cache-pvc).

  6. Issue: vLLM pod restarting frequently.

    Solution: Check the pod’s logs (kubectl logs -p for previous logs). Common causes include:

    • OOMKills (Out Of Memory): Increase memory requests/limits or reduce --gpu-memory-utilization.
    • Application errors: vLLM server crashing due to misconfiguration or model issues.
    • Liveness/Readiness probe failures: If custom probes are configured, they might be too aggressive or misconfigured.
  7. Issue: Cannot pull vLLM image.

    Solution: Ensure your cluster has network access to Docker Hub or your specified image registry. If using a private registry, configure ImagePullSecrets in your deployment. Check for typos in the image name or tag.

FAQ Section

  1. What is vLLM and why should I use it for LLM inference?

    vLLM is an open-source library for high-throughput and low-latency LLM serving. It achieves this by using innovative attention mechanisms like PagedAttention, which efficiently manages key-value caches on the GPU. This significantly outperforms traditional serving frameworks, especially for long sequences and high concurrency, making it ideal for production LLM deployments. For more details, refer to the official vLLM documentation.

  2. How do I choose the right GPU for my LLM?

    The choice of GPU depends primarily on the model size (number of parameters), its memory requirements, and your desired throughput/latency. Larger models require more GPU memory. For example, a 7B parameter model might fit on a single 24GB GPU, while a 70B model often requires multiple GPUs (e.g., A100 80GB) with tensor parallelism. Always check the model’s specific requirements. Our LLM GPU Scheduling Guide provides further insights.

  3. Can I run multiple LLMs on a single GPU with vLLM?

    Yes, vLLM supports serving multiple models on a single GPU, provided there is enough memory. You can either run multiple vLLM instances, each serving a different model, or use vLLM’s multi-model serving capabilities if available (check the latest vLLM documentation for this feature). Be mindful of memory contention and performance degradation if the GPU becomes oversaturated.

  4. How can I achieve auto-scaling for my vLLM deployments on Kubernetes?

    You can use Kubernetes’ Horizontal Pod Autoscaler (HPA). For GPU workloads, you’ll likely need to expose GPU utilization metrics (e.g., via Metrics Server and a custom metrics adapter for NVIDIA GPUs) and scale based on these. Alternatively, scale based on request queue depth or latency metrics. For auto-scaling the underlying GPU nodes, Karpenter is an excellent choice for cloud environments.

  5. What are the security best practices for deploying vLLM on Kubernetes?

    Beyond standard Kubernetes security, for LLMs, focus on:

    • Network isolation: Use Network Policies to restrict traffic.
    • Container image security: Use trusted images, scan for vulnerabilities, and consider image signing with Sigstore.
    • Resource quotas: Prevent resource exhaustion.
    • Role-Based Access Control (RBAC): Limit who can deploy and manage LLM services.
    • Data privacy: Ensure sensitive data isn’t exposed or logged unnecessarily.
    • Secrets management: Securely handle any API keys or credentials needed for model access or external services.

Cleanup Commands

To remove the resources created in this guide, execute the following commands:


# Delete the vLLM service
kubectl delete -f vllm-service.yaml

# Delete the vLLM deployment
kubectl delete -f vllm-deployment.yaml

# Delete the PersistentVolumeClaim
kubectl delete -f vllm-huggingface-cache-pvc.yaml

# (Optional) If you want to remove the NVIDIA Device Plugin
# Be cautious, this affects all GPU workloads on your cluster
# kubectl delete -f https://raw.githubusercontent.com/NVIDIA/gpu-operator/master/deployments/gpu-operator/nvidia-device-plugin.yaml

Next Steps / Further Reading

  • Explore advanced vLLM features: Learn about continuous batching, speculative decoding, and other optimizations on the vLLM documentation site.
  • Implement auto-scaling: Set up HPA for your vLLM deployments based on GPU metrics.
  • Integrate with a service mesh: Consider Istio Ambient Mesh for advanced traffic management, observability, and security features.
  • Optimize GPU scheduling: Dive deeper into Kubernetes GPU scheduling with our LLM GPU Scheduling Best Practices.
  • Enhance observability: Explore eBPF Observability with Hubble for network and application insights.
  • Cost optimization: Learn how to reduce Kubernetes costs for GPU workloads using Karpenter.
  • Explore different LLM serving frameworks: Investigate alternatives like Hugging Face TGI, NVIDIA Triton Inference Server, or others to find the best fit for your specific needs.

Conclusion

Deploying Large Language Models on Kubernetes using vLLM offers a powerful combination for achieving high-performance and scalable inference. By leveraging vLLM’s intelligent GPU memory management and Kubernetes’ robust orchestration capabilities, you can build a resilient and efficient LLM serving platform. This guide has provided a foundational understanding and practical steps to get your vLLM deployment up and running. Remember that optimizing for production involves careful consideration of resource management, auto-scaling, security, and observability, ensuring your LLMs are served reliably and cost-effectively.

Leave a Reply

Your email address will not be published. Required fields are marked *