Agentic AI Agents AI LLM RAG

Deploying Production RAG Systems on Kubernetes: The Complete 2025 Guide

Learn how to deploy scalable RAG (Retrieval-Augmented Generation) systems on Kubernetes. Complete guide covering architecture, autoscaling, cost optimization, and production best practices for enterprise AI deployment.

Introduction: Why Kubernetes for RAG Systems?

Retrieval-Augmented Generation (RAG) has become the cornerstone of enterprise AI applications, powering everything from intelligent chatbots to document search systems. However, deploying RAG systems at scale presents unique challenges that traditional infrastructure struggles to handle.

Enter Kubernetes, the cloud-native orchestration platform that’s revolutionizing how enterprises deploy and manage RAG applications. According to recent industry data, more than 80% of enterprises implementing Generative AI are now augmenting LLMs with frameworks like RAG.

But why Kubernetes specifically? The answer lies in RAG’s unique requirements: unpredictable workloads, multi-component architectures, and the need for both GPU and CPU resources at scale.

This comprehensive guide walks you through everything you need to know about deploying production-ready RAG systems on Kubernetes, from architecture design to cost optimization. If you’re new to Kubernetes, check out Kubelabs for hands-on tutorials from beginner to advanced levels, or get started with Kubernetes on Docker Desktop in 2 minutes.

What is RAG and Why Does It Need Kubernetes?

Understanding RAG Architecture

RAG combines two powerful AI concepts:

  1. Retrieval: Finding relevant information from a knowledge base using vector similarity search
  2. Generation: Using LLMs to create contextually accurate responses based on retrieved data

Unlike vanilla LLMs, RAG systems provide LLMs with relevant information about specific knowledge fields, which is often more efficient than fine-tuning models and helps prevent hallucinations.

The Challenge of Scaling RAG

RAG systems are inherently complex, involving multiple components:

  • Document Processing Pipeline: Ingesting and chunking documents
  • Embedding Generation: Converting text to vector representations
  • Vector Database: Storing and searching high-dimensional vectors
  • LLM Inference: Generating responses based on retrieved context
  • Orchestration Layer: Managing the flow between components

Each component has different resource requirements, scaling patterns, and performance characteristics.

Why Traditional Infrastructure Falls Short

RAG workloads can be unpredictable and bursty—demand can spike suddenly during peak usage periods and drop to near-zero during off-peak hours.

Traditional infrastructure forces a painful choice:

  • Overprovision: Waste 80% of GPU resources on idle capacity
  • Underprovision: Risk service degradation and poor user experience

Kubernetes solves this with dynamic resource allocation, automated scaling, and efficient orchestration. New to Kubernetes? Start with the Kubernetes Beginners Track to understand core concepts.

Kubernetes RAG Architecture: Core Components

High-Level Architecture Overview

A production RAG system on Kubernetes typically consists of:

graph TB
    subgraph k8s["Kubernetes Cluster"]
        frontend["Frontend Service<br/>(FastAPI)"]
        embedding["Embedding Service<br/>(Workers AI)"]
        vectordb["Vector Database<br/>(Qdrant)"]
        llm["LLM NIM<br/>(Inference)<br/>(GPU Pods)"]
        rerank["Reranking Service"]
        
        frontend --> embedding
        embedding --> vectordb
        frontend --> llm
        frontend -.-> rerank
    end
    
    style k8s fill:#f0f8ff,stroke:#4169e1,stroke-width:3px
    style frontend fill:#90ee90,stroke:#228b22,stroke-width:2px
    style embedding fill:#ffb6c1,stroke:#ff1493,stroke-width:2px
    style vectordb fill:#ffd700,stroke:#ff8c00,stroke-width:2px
    style llm fill:#ff69b4,stroke:#c71585,stroke-width:2px
    style rerank fill:#87ceeb,stroke
Kubernetes Cluster
Frontend Service
(FastAPI)
Embedding Service
(Workers AI)
Vector Database
(Qdrant)
LLM NIM
(Inference)
(GPU Pods)
Reranking Service
┌────────────────────────────────────────────────────────┐
│                    Kubernetes Cluster                  │
│                                                        │
│  ┌────────────────┐  ┌──────────────┐  ┌─────────────┐ │
│  │   Frontend     │  │  Embedding   │  │   Vector    │ │
│  │   Service      │─▶│   Service    │─▶│  Database   │ │
│  │  (FastAPI)     │  │ (Workers AI) │  │  (Qdrant)   │ │
│  └────────────────┘  └──────────────┘  └─────────────┘ │
│           │                                            │
│           ▼                                            │
│  ┌────────────────┐  ┌──────────────┐                  │
│  │   LLM NIM      │  │  Reranking   │                  │
│  │  (Inference)   │  │   Service    │                  │
│  │  (GPU Pods)    │  │              │                  │
│  └────────────────┘  └──────────────┘                  │
│                                                        │
└─────────────────────────────────────────────────────────┘
Lexical error on line 1. Unrecognized text.
┌───────────────────
^
Parse error on line 1:

^
Expecting 'NEWLINE', 'SPACE', 'GRAPH', got 'EOF'

working diagram :

flowchart TD
    Start([User Query]) --> Receive[Frontend Service Receives Query]
    Receive --> Embed[Generate Query Embedding<br/>Embedding Service]
    Embed --> Vector[Vector Similarity Search<br/>Qdrant Database]
    Vector --> Retrieve[Retrieve Top-K Documents<br/>k=5 by default]
    Retrieve --> CheckRerank{Reranking<br/>Enabled?}
    
    CheckRerank -->|Yes| Rerank[Rerank Retrieved Documents<br/>Reranking Service]
    CheckRerank -->|No| PrepareContext[Prepare Context from Documents]
    Rerank --> PrepareContext
    
    PrepareContext --> BuildPrompt[Build Prompt:<br/>Query + Context + Instructions]
    BuildPrompt --> LLMInference[LLM Inference<br/>GPU-accelerated NIM]
    LLMInference --> CheckQuality{Response<br/>Quality OK?}
    
    CheckQuality -->|No| Fallback[Fallback Response or Retry]
    CheckQuality -->|Yes| Format[Format Response]
    
    Fallback --> Format
    Format --> Cache{Cache<br/>Response?}
    Cache -->|Yes| StoreCache[Store in Redis Cache]
    Cache -->|No| Return
    StoreCache --> Return([Return to User])
    
    style Start fill:#90ee90,stroke:#228b22,stroke-width:3px
    style Embed fill:#ffb6c1,stroke:#ff1493,stroke-width:2px
    style Vector fill:#ffd700,stroke:#ff8c00,stroke-width:2px
    style Rerank fill:#87ceeb,stroke:#4682b4,stroke-width:2px
    style LLMInference fill:#ff69b4,stroke:#c71585,stroke-width:2px
    style Return fill:#90ee90,stroke:#228b22,stroke-width:3px
    style CheckRerank fill:#fff4e6,stroke:#ff8c00,stroke-width:2px
    style CheckQuality fill:#fff4e6,stroke:#ff8c00,stroke-width:2px
    style Cache fill:#fff4e6,stroke:#ff8c00,stroke-width:2px
YesNoNoYesYesNoUser QueryFrontend Service Receives QueryGenerate Query EmbeddingEmbedding ServiceVector Similarity SearchQdrant DatabaseRetrieve Top-K Documentsk&equals;5 by defaultRerankingEnabled?Rerank Retrieved DocumentsReranking ServicePrepare Context from DocumentsBuild Prompt:Query + Context + InstructionsLLM InferenceGPU-accelerated NIMResponseQuality OK?Fallback Response or RetryFormat ResponseCacheResponse?Store in Redis CacheReturn to User

sequence of events

sequenceDiagram
    actor User
    participant Frontend as Frontend Service<br/>FastAPI
    participant Cache as Redis Cache
    participant Embed as Embedding Service<br/>Workers AI
    participant Vector as Vector Database<br/>Qdrant
    participant Rerank as Reranking Service
    participant LLM as LLM NIM<br/>GPU Inference
    participant Metrics as Prometheus Metrics

    User->>Frontend: POST /query<br/>{query: "How to deploy RAG?"}
    activate Frontend
    
    Note over Frontend,Metrics: Check Cache First
    Frontend->>Cache: Check cached response
    Cache-->>Frontend: Cache miss
    
    Note over Frontend,Metrics: Embedding Phase
    Frontend->>Embed: Generate embedding<br/>{text: query}
    activate Embed
    Embed->>Embed: Load model<br/>bge-large-en-v1.5
    Embed->>Embed: Generate vector<br/>[768 dimensions]
    Embed-->>Frontend: Embedding vector
    deactivate Embed
    
    Note over Frontend,Metrics: Retrieval Phase
    Frontend->>Vector: Search similar vectors<br/>{vector, top_k: 5}
    activate Vector
    Vector->>Vector: Cosine similarity<br/>search
    Vector->>Vector: Retrieve documents<br/>with scores
    Vector-->>Frontend: Top 5 documents<br/>[doc1, doc2, ...]
    deactivate Vector
    
    Note over Frontend,Metrics: Reranking Phase Optional
    Frontend->>Rerank: Rerank documents<br/>{query, documents}
    activate Rerank
    Rerank->>Rerank: Calculate relevance<br/>scores
    Rerank-->>Frontend: Reranked documents
    deactivate Rerank
    
    Note over Frontend,Metrics: Generation Phase
    Frontend->>Frontend: Build prompt<br/>{query + context}
    Frontend->>LLM: Generate response<br/>{prompt, max_tokens: 512}
    activate LLM
    LLM->>LLM: Load model weights<br/>Llama 3.1 70B
    LLM->>LLM: Generate tokens<br/>streaming
    LLM-->>Frontend: Generated response
    deactivate LLM
    
    Note over Frontend,Metrics: Post-processing
    Frontend->>Frontend: Validate response<br/>quality
    Frontend->>Cache: Store response<br/>{query: response}
    Frontend->>Metrics: Record metrics<br/>latency, tokens
    
    Frontend-->>User: 200 OK<br/>{answer, sources, metadata}
    deactivate Frontend
    
    Note over User,Metrics: Total Latency: ~2-5 seconds
Parse error on line 2:
...agram    actor User    participant Fro
----------------------^
Expecting 'SOLID_OPEN_ARROW', 'DOTTED_OPEN_ARROW', 'SOLID_ARROW', 'DOTTED_ARROW', 'SOLID_CROSS', 'DOTTED_CROSS', got 'NL'

Core Kubernetes Components for RAG

1. Deployments for Stateless Services

Your API servers, embedding generators, and LLM inference endpoints run as Kubernetes Deployments. For more on Docker containerization best practices, explore 500+ Docker tutorials.

apiVersion: apps/v1
kind: Deployment
metadata:
  name: rag-api-server
spec:
  replicas: 3
  selector:
    matchLabels:
      app: rag-api
  template:
    metadata:
      labels:
        app: rag-api
    spec:
      containers:
      - name: api
        image: your-registry/rag-api:latest
        ports:
        - containerPort: 8000
        resources:
          requests:
            memory: "2Gi"
            cpu: "1000m"
          limits:
            memory: "4Gi"
            cpu: "2000m"

2. StatefulSets for Vector Databases

Vector databases like Qdrant, Weaviate, or Milvus require persistent storage and stable network identities:

apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: qdrant-cluster
spec:
  serviceName: qdrant
  replicas: 3
  selector:
    matchLabels:
      app: qdrant
  template:
    metadata:
      labels:
        app: qdrant
    spec:
      containers:
      - name: qdrant
        image: qdrant/qdrant:latest
        ports:
        - containerPort: 6333
        volumeMounts:
        - name: qdrant-storage
          mountPath: /qdrant/storage
  volumeClaimTemplates:
  - metadata:
      name: qdrant-storage
    spec:
      accessModes: ["ReadWriteOnce"]
      resources:
        requests:
          storage: 100Gi

3. Jobs for Batch Document Processing

For ingesting large document sets:

apiVersion: batch/v1
kind: Job
metadata:
  name: document-ingestion
spec:
  parallelism: 5
  completions: 10
  template:
    spec:
      containers:
      - name: ingest
        image: your-registry/document-processor:latest
        env:
        - name: BATCH_SIZE
          value: "1000"
        resources:
          requests:
            memory: "4Gi"
            cpu: "2000m"

4. Services for Load Balancing

Expose your RAG components with stable endpoints:

apiVersion: v1
kind: Service
metadata:
  name: rag-api-service
spec:
  selector:
    app: rag-api
  ports:
  - protocol: TCP
    port: 80
    targetPort: 8000
  type: LoadBalancer

Implementing Horizontal Pod Autoscaling (HPA)

The Critical Need for Autoscaling

Without autoscaling, organizations face a painful tradeoff: either overprovision compute resources to handle worst-case scenarios (wasting capital on idle GPUs 80% of the time) or underprovision and risk service degradation.

Standard Metrics-Based Autoscaling

Basic HPA configuration using CPU/memory metrics:

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: rag-api-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: rag-api-server
  minReplicas: 2
  maxReplicas: 10
  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 70
  - type: Resource
    resource:
      name: memory
      target:
        type: Utilization
        averageUtilization: 80

Custom Metrics for RAG Workloads

For production RAG systems, CPU/memory alone isn’t enough. You need RAG-specific metrics:

Key Custom Metrics:

  • Concurrency (num_requests_running)
  • GPU KV Cache Usage (gpu_cache_usage_perc)
  • Time to First Token (TTFT) at 90th percentile
  • Query Complexity Score
  • Retrieval Latency

Example custom metrics HPA using Prometheus:

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: llm-nim-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: llm-inference
  minReplicas: 1
  maxReplicas: 5
  metrics:
  - type: Pods
    pods:
      metric:
        name: time_to_first_token_p90
      target:
        type: AverageValue
        averageValue: "2000m"  # 2 seconds in milliseconds
  - type: Pods
    pods:
      metric:
        name: num_requests_running
      target:
        type: AverageValue
        averageValue: "50"
  behavior:
    scaleUp:
      stabilizationWindowSeconds: 60
      policies:
      - type: Percent
        value: 100
        periodSeconds: 60
    scaleDown:
      stabilizationWindowSeconds: 300
      policies:
      - type: Percent
        value: 25
        periodSeconds: 60

NVIDIA NIM Autoscaling Best Practices

For latency-sensitive workloads like customer service chatbots (ISL/OSL 256/256), autoscaling LLM NIM using TTFT p90 and concurrency ensures SLA adherence (TTFT <2s, e2e_latency <20s).

Recommended Metrics by Component:

ComponentPrimary MetricSecondary MetricTarget
LLM NIMTTFT P90Concurrency<2s, <100 concurrent
EmbeddingGPU UtilizationRequest Queue70-80%, <50 queued
RerankingGPU UtilizationThroughput70-80%, >100 req/s
Vector DBQuery LatencyCPU Usage<100ms, <75%

Vector Database Deployment on Kubernetes

Choosing the Right Vector Database

When building RAG systems, vector databases are the foundation of semantic search capabilities. For a deeper understanding of embeddings and vector databases, read Understanding Embeddings: The Math Behind AI Vector Databases.

Popular Options:

  1. Qdrant – Open source, Rust-based, excellent performance
  2. Weaviate – GraphQL API, built-in hybrid search
  3. Milvus – High throughput, enterprise features
  4. Pinecone – Managed service, can run self-hosted
  5. pgvector – PostgreSQL extension, familiar SQL

Deploying Qdrant with Helm

Qdrant provides users with a Helm Chart that can be used to deploy Qdrant on Kubernetes. For a complete guide on vector embeddings, see Vector Embeddings with Sentence Transformers and Docker.

Step 1: Create Configuration

# qdrant-config.yaml
replicaCount: 3

resources:
  requests:
    memory: "4Gi"
    cpu: "1000m"
  limits:
    memory: "8Gi"
    cpu: "2000m"

persistence:
  size: 100Gi
  storageClass: fast-ssd

service:
  type: ClusterIP
  port: 6333
  grpcPort: 6334

config:
  cluster:
    enabled: true
    p2p:
      port: 6335

Step 2: Deploy with Helm

# Add Qdrant repository
helm repo add qdrant https://qdrant.to/helm
helm repo update

# Install Qdrant cluster
helm install qdrant qdrant/qdrant \
  -n rag-system \
  -f qdrant-config.yaml \
  --create-namespace \
  --wait

Step 3: Verify Deployment

# Check pods
kubectl get pods -n rag-system

# Port forward for debugging
kubectl port-forward service/qdrant -n rag-system 6333:6333

# Access dashboard at http://localhost:6333/dashboard

Scaling Vector Databases

Horizontal Scaling Strategy:

# qdrant-scaling.yaml
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: qdrant-hpa
  namespace: rag-system
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: StatefulSet
    name: qdrant-cluster
  minReplicas: 3
  maxReplicas: 10
  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 70
  - type: Pods
    pods:
      metric:
        name: qdrant_query_latency_p95
      target:
        type: AverageValue
        averageValue: "100m"  # 100ms

Multi-Region Vector Database Deployment

For global applications with low-latency requirements:

# Deploy vector DB in multiple regions
apiVersion: v1
kind: Service
metadata:
  name: qdrant-global
  annotations:
    cloud.google.com/neg: '{"ingress": true}'
spec:
  type: LoadBalancer
  selector:
    app: qdrant
  ports:
  - port: 6333
    targetPort: 6333

---
# Configure regional clusters with cross-region replication
apiVersion: v1
kind: ConfigMap
metadata:
  name: qdrant-replication
data:
  config.yaml: |
    cluster:
      enabled: true
      p2p:
        port: 6335
      consensus:
        tick_period_ms: 100
    replication:
      enabled: true
      factor: 2
      regions:
        - us-east
        - us-west
        - eu-central

GPU Resource Management for RAG Workloads

GPU Node Pools Configuration

GKE GPU Node Pool:

gcloud container node-pools create gpu-pool \
  --cluster=rag-cluster \
  --zone=us-central1-a \
  --machine-type=n1-standard-8 \
  --accelerator=type=nvidia-tesla-t4,count=1 \
  --num-nodes=2 \
  --enable-autoscaling \
  --min-nodes=1 \
  --max-nodes=10

AWS EKS GPU Node Group:

apiVersion: eksctl.io/v1alpha5
kind: ClusterConfig
metadata:
  name: rag-cluster
  region: us-east-1
nodeGroups:
  - name: gpu-workers
    instanceType: g4dn.xlarge
    desiredCapacity: 2
    minSize: 1
    maxSize: 10
    volumeSize: 100
    labels:
      workload: gpu-inference
    taints:
      - key: nvidia.com/gpu
        value: "true"
        effect: NoSchedule

GPU Pod Scheduling

apiVersion: v1
kind: Pod
metadata:
  name: llm-inference
spec:
  containers:
  - name: nim-llm
    image: nvcr.io/nim/meta/llama-3.1-8b-instruct:latest
    resources:
      limits:
        nvidia.com/gpu: 1
        memory: 16Gi
      requests:
        nvidia.com/gpu: 1
        memory: 12Gi
  nodeSelector:
    cloud.google.com/gke-accelerator: nvidia-tesla-t4
  tolerations:
  - key: nvidia.com/gpu
    operator: Exists
    effect: NoSchedule

GPU Time Slicing (Multiple Workloads per GPU)

For smaller models or cost optimization:

apiVersion: v1
kind: ConfigMap
metadata:
  name: gpu-time-slicing
  namespace: gpu-operator
data:
  time-slicing-config.yaml: |
    version: v1
    sharing:
      timeSlicing:
        replicas: 4  # Share GPU between 4 pods
        failRequestsGreaterThanOne: false

Document Processing Pipelines

Batch Processing with Kubernetes Jobs

apiVersion: batch/v1
kind: CronJob
metadata:
  name: document-ingestion-daily
spec:
  schedule: "0 2 * * *"  # 2 AM daily
  jobTemplate:
    spec:
      parallelism: 10
      completions: 100
      backoffLimit: 3
      template:
        spec:
          containers:
          - name: document-processor
            image: your-registry/doc-processor:latest
            env:
            - name: CHUNK_SIZE
              value: "512"
            - name: OVERLAP
              value: "50"
            - name: EMBEDDING_MODEL
              value: "bge-large-en-v1.5"
            resources:
              requests:
                memory: "4Gi"
                cpu: "2000m"
          restartPolicy: OnFailure

Stream Processing with Apache Kafka

For real-time document ingestion:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: kafka-document-consumer
spec:
  replicas: 3
  selector:
    matchLabels:
      app: doc-consumer
  template:
    metadata:
      labels:
        app: doc-consumer
    spec:
      containers:
      - name: consumer
        image: your-registry/kafka-consumer:latest
        env:
        - name: KAFKA_BROKERS
          value: "kafka-0.kafka:9092,kafka-1.kafka:9092"
        - name: CONSUMER_GROUP
          value: "document-processors"
        - name: TOPIC
          value: "documents-to-index"
        resources:
          requests:
            memory: "2Gi"
            cpu: "1000m"

Production RAG Stack: Complete Example

Here’s a complete production-ready RAG deployment using popular open-source tools. For more AI/ML containerization examples, check out Docker AI/ML case studies and learn about Agentic AI Workflows with Docker.

Complete Deployment Manifest

# namespace.yaml
apiVersion: v1
kind: Namespace
metadata:
  name: rag-production

---
# qdrant-vectordb.yaml
apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: qdrant
  namespace: rag-production
spec:
  serviceName: qdrant
  replicas: 3
  selector:
    matchLabels:
      app: qdrant
  template:
    metadata:
      labels:
        app: qdrant
    spec:
      containers:
      - name: qdrant
        image: qdrant/qdrant:v1.7.4
        ports:
        - containerPort: 6333
        - containerPort: 6334
        resources:
          requests:
            memory: 4Gi
            cpu: 2000m
          limits:
            memory: 8Gi
            cpu: 4000m
        volumeMounts:
        - name: storage
          mountPath: /qdrant/storage
  volumeClaimTemplates:
  - metadata:
      name: storage
    spec:
      accessModes: ["ReadWriteOnce"]
      storageClassName: fast-ssd
      resources:
        requests:
          storage: 100Gi

---
# embedding-service.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: embedding-service
  namespace: rag-production
spec:
  replicas: 3
  selector:
    matchLabels:
      app: embedding
  template:
    metadata:
      labels:
        app: embedding
    spec:
      containers:
      - name: embedding
        image: your-registry/embedding-service:latest
        ports:
        - containerPort: 8080
        env:
        - name: MODEL_NAME
          value: "BAAI/bge-large-en-v1.5"
        - name: BATCH_SIZE
          value: "32"
        resources:
          requests:
            memory: 4Gi
            cpu: 2000m
            nvidia.com/gpu: 1
          limits:
            memory: 8Gi
            cpu: 4000m
            nvidia.com/gpu: 1

---
# llm-inference.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: llm-inference
  namespace: rag-production
spec:
  replicas: 2
  selector:
    matchLabels:
      app: llm
  template:
    metadata:
      labels:
        app: llm
    spec:
      containers:
      - name: llm
        image: nvcr.io/nim/meta/llama-3.1-70b-instruct:latest
        ports:
        - containerPort: 8000
        env:
        - name: MODEL_PATH
          value: "/models/llama-3.1-70b"
        - name: MAX_BATCH_SIZE
          value: "64"
        resources:
          limits:
            nvidia.com/gpu: 2
            memory: 80Gi
          requests:
            nvidia.com/gpu: 2
            memory: 64Gi

---
# rag-api.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: rag-api
  namespace: rag-production
spec:
  replicas: 5
  selector:
    matchLabels:
      app: rag-api
  template:
    metadata:
      labels:
        app: rag-api
    spec:
      containers:
      - name: api
        image: your-registry/rag-api:latest
        ports:
        - containerPort: 8000
        env:
        - name: QDRANT_URL
          value: "http://qdrant:6333"
        - name: EMBEDDING_URL
          value: "http://embedding-service:8080"
        - name: LLM_URL
          value: "http://llm-inference:8000"
        - name: TOP_K
          value: "5"
        resources:
          requests:
            memory: 2Gi
            cpu: 1000m
          limits:
            memory: 4Gi
            cpu: 2000m

---
# hpa-rag-api.yaml
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: rag-api-hpa
  namespace: rag-production
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: rag-api
  minReplicas: 3
  maxReplicas: 20
  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 70
  - type: Pods
    pods:
      metric:
        name: request_latency_p95
      target:
        type: AverageValue
        averageValue: "500m"  # 500ms

---
# ingress.yaml
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: rag-ingress
  namespace: rag-production
  annotations:
    kubernetes.io/ingress.class: nginx
    cert-manager.io/cluster-issuer: letsencrypt-prod
spec:
  tls:
  - hosts:
    - rag.yourdomain.com
    secretName: rag-tls
  rules:
  - host: rag.yourdomain.com
    http:
      paths:
      - path: /
        pathType: Prefix
        backend:
          service:
            name: rag-api-service
            port:
              number: 80

Deployment Steps

# 1. Create namespace
kubectl apply -f namespace.yaml

# 2. Deploy vector database
kubectl apply -f qdrant-vectordb.yaml

# Wait for Qdrant to be ready
kubectl wait --for=condition=ready pod -l app=qdrant -n rag-production --timeout=300s

# 3. Deploy embedding service
kubectl apply -f embedding-service.yaml

# 4. Deploy LLM inference
kubectl apply -f llm-inference.yaml

# 5. Deploy RAG API
kubectl apply -f rag-api.yaml

# 6. Configure autoscaling
kubectl apply -f hpa-rag-api.yaml

# 7. Set up ingress
kubectl apply -f ingress.yaml

# 8. Verify deployment
kubectl get all -n rag-production

Monitoring and Observability

Prometheus Metrics for RAG

Key Metrics to Track:

# ServiceMonitor for RAG API
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: rag-api-metrics
  namespace: rag-production
spec:
  selector:
    matchLabels:
      app: rag-api
  endpoints:
  - port: metrics
    interval: 15s
    path: /metrics

Custom Metrics to Expose:

# Python example using prometheus_client
from prometheus_client import Counter, Histogram, Gauge

# Request metrics
rag_requests_total = Counter(
    'rag_requests_total',
    'Total RAG requests',
    ['endpoint', 'status']
)

# Latency metrics
retrieval_latency = Histogram(
    'rag_retrieval_latency_seconds',
    'Time spent on vector retrieval',
    buckets=[0.01, 0.05, 0.1, 0.5, 1.0, 2.0]
)

generation_latency = Histogram(
    'rag_generation_latency_seconds',
    'Time spent on LLM generation',
    buckets=[0.5, 1.0, 2.0, 5.0, 10.0]
)

# System metrics
active_queries = Gauge(
    'rag_active_queries',
    'Number of queries currently being processed'
)

vector_db_size = Gauge(
    'rag_vector_db_documents',
    'Number of documents in vector database'
)

Grafana Dashboard

Example dashboard queries:

# Average retrieval latency
rate(rag_retrieval_latency_seconds_sum[5m]) / 
rate(rag_retrieval_latency_seconds_count[5m])

# P95 generation latency
histogram_quantile(0.95, 
  rate(rag_generation_latency_seconds_bucket[5m]))

# Request rate
rate(rag_requests_total[1m])

# Error rate
rate(rag_requests_total{status="error"}[5m]) / 
rate(rag_requests_total[5m])

# GPU utilization
DCGM_FI_DEV_GPU_UTIL{pod=~"llm-inference.*"}

Distributed Tracing with OpenTelemetry

from opentelemetry import trace
from opentelemetry.exporter.jaeger.thrift import JaegerExporter
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor

# Configure tracer
trace.set_tracer_provider(TracerProvider())
jaeger_exporter = JaegerExporter(
    agent_host_name="jaeger-agent.monitoring",
    agent_port=6831,
)
trace.get_tracer_provider().add_span_processor(
    BatchSpanProcessor(jaeger_exporter)
)

tracer = trace.get_tracer(__name__)

# Instrument RAG pipeline
@tracer.start_as_current_span("rag_query")
def rag_query(query: str):
    with tracer.start_as_current_span("embedding"):
        embedding = generate_embedding(query)
    
    with tracer.start_as_current_span("retrieval"):
        docs = vector_db.search(embedding, top_k=5)
    
    with tracer.start_as_current_span("generation"):
        response = llm.generate(query, docs)
    
    return response

Cost Optimization Strategies

Right-Sizing Resources

Resource Request/Limit Best Practices:

ComponentCPU RequestCPU LimitMemory RequestMemory LimitGPU
API Server500m2000m1Gi4Gi0
Embedding1000m4000m4Gi8Gi1
LLM (70B)2000m8000m64Gi80Gi2
Vector DB2000m4000m4Gi8Gi0
Reranking1000m2000m2Gi4Gi1

Spot/Preemptible Instances for Non-Critical Workloads

# GKE spot node pool for batch processing
apiVersion: v1
kind: Pod
metadata:
  name: batch-indexing
spec:
  nodeSelector:
    cloud.google.com/gke-spot: "true"
  tolerations:
  - key: cloud.google.com/gke-spot
    operator: Equal
    value: "true"
    effect: NoSchedule
  containers:
  - name: indexer
    image: your-registry/document-indexer:latest
    resources:
      requests:
        memory: 4Gi
        cpu: 2000m

Model Quantization for Cost Savings

Deploy quantized models to reduce GPU requirements:

apiVersion: v1
kind: ConfigMap
metadata:
  name: llm-config
data:
  model.yaml: |
    model:
      name: "llama-3.1-70b-instruct"
      quantization: "int8"  # Or "int4" for even more savings
      max_batch_size: 64
      tensor_parallel: 2

Cost Impact:

  • INT8 quantization: 50% memory reduction, 10-15% speed increase
  • INT4 quantization: 75% memory reduction, 20-30% speed increase
  • Trade-off: 1-3% accuracy degradation

Storage Optimization

# Use tiered storage for vector database
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: qdrant-hot-storage
spec:
  accessModes:
    - ReadWriteOnce
  storageClassName: ssd-retained
  resources:
    requests:
      storage: 50Gi

---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: qdrant-cold-storage
spec:
  accessModes:
    - ReadWriteOnce
  storageClassName: standard
  resources:
    requests:
      storage: 500Gi

Security Best Practices

Network Policies

apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: rag-network-policy
  namespace: rag-production
spec:
  podSelector:
    matchLabels:
      app: rag-api
  policyTypes:
  - Ingress
  - Egress
  ingress:
  - from:
    - namespaceSelector:
        matchLabels:
          name: ingress-nginx
    ports:
    - protocol: TCP
      port: 8000
  egress:
  - to:
    - podSelector:
        matchLabels:
          app: qdrant
    ports:
    - protocol: TCP
      port: 6333
  - to:
    - podSelector:
        matchLabels:
          app: llm
    ports:
    - protocol: TCP
      port: 8000

Secret Management with External Secrets Operator

apiVersion: external-secrets.io/v1beta1
kind: ExternalSecret
metadata:
  name: rag-secrets
  namespace: rag-production
spec:
  refreshInterval: 1h
  secretStoreRef:
    name: aws-secrets-manager
    kind: SecretStore
  target:
    name: rag-api-secrets
    creationPolicy: Owner
  data:
  - secretKey: openai_api_key
    remoteRef:
      key: production/rag/openai
  - secretKey: vector_db_password
    remoteRef:
      key: production/rag/qdrant

Pod Security Standards

apiVersion: v1
kind: Pod
metadata:
  name: rag-api
  namespace: rag-production
spec:
  securityContext:
    runAsNonRoot: true
    runAsUser: 1000
    fsGroup: 2000
    seccompProfile:
      type: RuntimeDefault
  containers:
  - name: api
    image: your-registry/rag-api:latest
    securityContext:
      allowPrivilegeEscalation: false
      capabilities:
        drop:
        - ALL
      readOnlyRootFilesystem: true

Common Production Challenges and Solutions

Challenge 1: Slow Retrieval Performance

Symptoms:

  • P95 retrieval latency >500ms
  • Vector search taking >200ms
  • High CPU usage on vector database

Solutions:

Implementing distributed systems in RAG applications requires a structured approach to minimize latency and prevent performance bottlenecks. Keeping frequently accessed data close to the processing nodes significantly reduces latency.

# Implement caching layer
apiVersion: apps/v1
kind: Deployment
metadata:
  name: redis-cache
spec:
  replicas: 3
  selector:
    matchLabels:
      app: redis
  template:
    spec:
      containers:
      - name: redis
        image: redis:7-alpine
        resources:
          requests:
            memory: 4Gi
            cpu: 1000m

Challenge 2: GPU Resource Contention

Symptoms:

  • LLM pods stuck in Pending state
  • OOM errors on GPU nodes
  • Inconsistent inference latency

Solutions:

# Implement GPU resource quotas
apiVersion: v1
kind: ResourceQuota
metadata:
  name: gpu-quota
  namespace: rag-production
spec:
  hard:
    requests.nvidia.com/gpu: "10"
    limits.nvidia.com/gpu: "10"

---
# Use priority classes
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
  name: high-priority-rag
value: 1000
globalDefault: false
description: "High priority for user-facing RAG queries"

Challenge 3: Unpredictable Costs

Symptoms:

  • Monthly cloud bills fluctuating 50%+
  • Idle GPU resources during off-peak
  • Over-provisioned infrastructure

Solutions:

# Implement Cluster Autoscaler with node lifecycle hooks
apiVersion: v1
kind: ConfigMap
metadata:
  name: cluster-autoscaler-config
data:
  config.yaml: |
    scale-down-enabled: true
    scale-down-delay-after-add: 10m
    scale-down-unneeded-time: 10m
    skip-nodes-with-local-storage: false
    balancing-ignore-label: node.kubernetes.io/exclude-from-external-load-balancers

Challenge 4: Data Consistency Across Components

Symptoms:

  • Stale embeddings after document updates
  • Version mismatches between vector DB and source
  • Inconsistent search results

Solutions:

# Implement event-driven updates with Kafka
from kafka import KafkaProducer
import json

producer = KafkaProducer(
    bootstrap_servers=['kafka:9092'],
    value_serializer=lambda v: json.dumps(v).encode('utf-8')
)

def update_document(doc_id: str, content: str):
    # Publish update event
    producer.send('document-updates', {
        'doc_id': doc_id,
        'content': content,
        'timestamp': time.time(),
        'operation': 'update'
    })
    
    # Trigger reindexing
    producer.send('reindex-queue', {
        'doc_id': doc_id,
        'priority': 'high'
    })

Real-World Case Studies

Case Study 1: Educational Platform RAG Deployment

Scenario: College students finding class information through a customized production RAG system deployed for around 3 months, answering over 250+ questions and helping students in over 60 sessions.

Architecture:

  • GKE cluster with 5 nodes
  • Postgres with pgvector extension
  • OpenAI GPT-3.5 for generation
  • FastAPI + HTMX for UI

Cost: Approximately $150/month (primarily cloud hosting)

Key Learnings:

  • Kubernetes local testing with k3d saved deployment debugging time
  • Nightly database backups essential for data integrity
  • Optional Ollama support enabled API-key-free local testing

Case Study 2: Enterprise Document Search

Scenario: 50-person startup with 10,000+ internal documents across Notion, Google Docs, Confluence

Architecture:

  • 3-node Kubernetes cluster
  • Qdrant vector database (3 replicas)
  • Custom embedding service with bge-large
  • NVIDIA NIM for LLM inference

Results:

  • 30 minutes/day saved per employee
  • <500ms average query latency
  • 95% user satisfaction rate
  • $200/month total infrastructure cost

Kubernetes vs. Alternatives for RAG

Kubernetes vs. Serverless (AWS Lambda, Cloud Functions)

AspectKubernetesServerless
GPU Support✅ Native❌ Limited/None
Cold StartMinimal (pod reuse)1-5 seconds
Cost at ScaleLower (resource sharing)Higher (per-invocation)
State ManagementStatefulSetsExternal services required
ControlFullLimited

Verdict: Kubernetes wins for GPU-intensive RAG workloads

Kubernetes vs. Docker Compose

AspectKubernetesDocker Compose
Scalability✅ Automatic HPA❌ Manual
High Availability✅ Built-in❌ Single host
Load Balancing✅ Native❌ Requires nginx
Production-Ready✅ Yes❌ Dev/Test only

Verdict: Docker Compose for local dev, Kubernetes for production

Kubernetes vs. Managed AI Platforms (AWS Bedrock, Azure OpenAI)

AspectKubernetes + OSSManaged Platforms
Cost$200-500/month$1000-5000/month
Data Privacy✅ Full control⚠️ Third-party
Customization✅ Unlimited❌ Limited
Maintenance⚠️ Self-managed✅ Fully managed

Verdict: Depends on priorities (cost/control vs. convenience)

Future Trends: RAG on Kubernetes in 2025

Trend 1: Ray Serve Integration

Ray is becoming the standard for distributed AI workloads. For modern AI infrastructure, also explore Docker Model Runner for embedding models and Docker MCP Gateway for agentic AI.

apiVersion: ray.io/v1
kind: RayService
metadata:
  name: rag-service
spec:
  serviceUnhealthySecondThreshold: 900
  deploymentUnhealthySecondThreshold: 300
  serveConfigV2: |
    applications:
      - name: rag-app
        route_prefix: /
        import_path: rag_app:deployment
        runtime_env:
          pip:
            - langchain
            - qdrant-client
        deployments:
          - name: embedding
            num_replicas: 3
            ray_actor_options:
              num_gpus: 0.5
          - name: generation
            num_replicas: 2
            ray_actor_options:
              num_gpus: 1

Trend 2: KEDA Event-Driven Autoscaling

Scale RAG workloads based on queue depth, custom metrics:

apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: rag-queue-scaler
spec:
  scaleTargetRef:
    name: document-processor
  minReplicaCount: 1
  maxReplicaCount: 50
  triggers:
  - type: kafka
    metadata:
      bootstrapServers: kafka:9092
      consumerGroup: doc-processors
      topic: documents-to-index
      lagThreshold: "100"

Trend 3: Multi-Cloud RAG Deployments

Organizations deploying across AWS, GCP, Azure for redundancy:

# Kubernetes Federation v2 (KubeFed)
apiVersion: types.kubefed.io/v1beta1
kind: FederatedDeployment
metadata:
  name: rag-api-federated
  namespace: rag-production
spec:
  template:
    spec:
      # Same deployment spec across all clusters
  placement:
    clusters:
    - name: aws-us-east
    - name: gcp-us-central
    - name: azure-westus
  overrides:
  - clusterName: aws-us-east
    clusterOverrides:
    - path: "/spec/replicas"
      value: 5

Conclusion: Building Production RAG Systems on Kubernetes

Deploying RAG systems on Kubernetes provides the scalability, reliability, and cost-efficiency that enterprise AI applications demand. The key takeaways:

Architecture Best Practices:

  • Use StatefulSets for vector databases, Deployments for stateless services
  • Implement HPA with custom RAG-specific metrics (TTFT, concurrency, retrieval latency)
  • Deploy multi-component systems with proper service mesh and load balancing
  • For hands-on practice with Kubernetes deployments, check out Kubernetes Hands-on Labs

Cost Optimization:

  • Right-size resources based on actual usage patterns
  • Use spot/preemptible instances for batch workloads
  • Implement model quantization (INT8/INT4) to reduce GPU requirements
  • Scale to zero during off-peak hours

Production Readiness:

  • Comprehensive monitoring with Prometheus + Grafana
  • Distributed tracing with OpenTelemetry
  • Network policies and secret management
  • Automated backup and disaster recovery

Scaling Strategies:

  • Horizontal autoscaling based on query load and latency
  • GPU resource quotas and priority classes
  • Event-driven processing with Kafka/KEDA
  • Multi-region deployments for global applications

The combination of Kubernetes orchestration, modern vector databases, and optimized LLM deployment creates a powerful foundation for production RAG systems that can scale from hundreds to millions of queries while maintaining sub-second latency.

Additional Resources

Community Resources

Leave a Reply

Your email address will not be published. Required fields are marked *