Introduction: Why Kubernetes for RAG Systems?
Retrieval-Augmented Generation (RAG) has become the cornerstone of enterprise AI applications, powering everything from intelligent chatbots to document search systems. However, deploying RAG systems at scale presents unique challenges that traditional infrastructure struggles to handle.
Enter Kubernetes, the cloud-native orchestration platform that’s revolutionizing how enterprises deploy and manage RAG applications. According to recent industry data, more than 80% of enterprises implementing Generative AI are now augmenting LLMs with frameworks like RAG.
But why Kubernetes specifically? The answer lies in RAG’s unique requirements: unpredictable workloads, multi-component architectures, and the need for both GPU and CPU resources at scale.
This comprehensive guide walks you through everything you need to know about deploying production-ready RAG systems on Kubernetes, from architecture design to cost optimization. If you’re new to Kubernetes, check out Kubelabs for hands-on tutorials from beginner to advanced levels, or get started with Kubernetes on Docker Desktop in 2 minutes.
What is RAG and Why Does It Need Kubernetes?
Understanding RAG Architecture
RAG combines two powerful AI concepts:
- Retrieval: Finding relevant information from a knowledge base using vector similarity search
- Generation: Using LLMs to create contextually accurate responses based on retrieved data
Unlike vanilla LLMs, RAG systems provide LLMs with relevant information about specific knowledge fields, which is often more efficient than fine-tuning models and helps prevent hallucinations.
The Challenge of Scaling RAG
RAG systems are inherently complex, involving multiple components:
- Document Processing Pipeline: Ingesting and chunking documents
- Embedding Generation: Converting text to vector representations
- Vector Database: Storing and searching high-dimensional vectors
- LLM Inference: Generating responses based on retrieved context
- Orchestration Layer: Managing the flow between components
Each component has different resource requirements, scaling patterns, and performance characteristics.
Why Traditional Infrastructure Falls Short
RAG workloads can be unpredictable and bursty—demand can spike suddenly during peak usage periods and drop to near-zero during off-peak hours.
Traditional infrastructure forces a painful choice:
- Overprovision: Waste 80% of GPU resources on idle capacity
- Underprovision: Risk service degradation and poor user experience
Kubernetes solves this with dynamic resource allocation, automated scaling, and efficient orchestration. New to Kubernetes? Start with the Kubernetes Beginners Track to understand core concepts.
Kubernetes RAG Architecture: Core Components
High-Level Architecture Overview
A production RAG system on Kubernetes typically consists of:
graph TB
subgraph k8s["Kubernetes Cluster"]
frontend["Frontend Service<br/>(FastAPI)"]
embedding["Embedding Service<br/>(Workers AI)"]
vectordb["Vector Database<br/>(Qdrant)"]
llm["LLM NIM<br/>(Inference)<br/>(GPU Pods)"]
rerank["Reranking Service"]
frontend --> embedding
embedding --> vectordb
frontend --> llm
frontend -.-> rerank
end
style k8s fill:#f0f8ff,stroke:#4169e1,stroke-width:3px
style frontend fill:#90ee90,stroke:#228b22,stroke-width:2px
style embedding fill:#ffb6c1,stroke:#ff1493,stroke-width:2px
style vectordb fill:#ffd700,stroke:#ff8c00,stroke-width:2px
style llm fill:#ff69b4,stroke:#c71585,stroke-width:2px
style rerank fill:#87ceeb,stroke┌────────────────────────────────────────────────────────┐ │ Kubernetes Cluster │ │ │ │ ┌────────────────┐ ┌──────────────┐ ┌─────────────┐ │ │ │ Frontend │ │ Embedding │ │ Vector │ │ │ │ Service │─▶│ Service │─▶│ Database │ │ │ │ (FastAPI) │ │ (Workers AI) │ │ (Qdrant) │ │ │ └────────────────┘ └──────────────┘ └─────────────┘ │ │ │ │ │ ▼ │ │ ┌────────────────┐ ┌──────────────┐ │ │ │ LLM NIM │ │ Reranking │ │ │ │ (Inference) │ │ Service │ │ │ │ (GPU Pods) │ │ │ │ │ └────────────────┘ └──────────────┘ │ │ │ └─────────────────────────────────────────────────────────┘
Lexical error on line 1. Unrecognized text. ┌─────────────────── ^
Parse error on line 1: ^ Expecting 'NEWLINE', 'SPACE', 'GRAPH', got 'EOF'
working diagram :
flowchart TD
Start([User Query]) --> Receive[Frontend Service Receives Query]
Receive --> Embed[Generate Query Embedding<br/>Embedding Service]
Embed --> Vector[Vector Similarity Search<br/>Qdrant Database]
Vector --> Retrieve[Retrieve Top-K Documents<br/>k=5 by default]
Retrieve --> CheckRerank{Reranking<br/>Enabled?}
CheckRerank -->|Yes| Rerank[Rerank Retrieved Documents<br/>Reranking Service]
CheckRerank -->|No| PrepareContext[Prepare Context from Documents]
Rerank --> PrepareContext
PrepareContext --> BuildPrompt[Build Prompt:<br/>Query + Context + Instructions]
BuildPrompt --> LLMInference[LLM Inference<br/>GPU-accelerated NIM]
LLMInference --> CheckQuality{Response<br/>Quality OK?}
CheckQuality -->|No| Fallback[Fallback Response or Retry]
CheckQuality -->|Yes| Format[Format Response]
Fallback --> Format
Format --> Cache{Cache<br/>Response?}
Cache -->|Yes| StoreCache[Store in Redis Cache]
Cache -->|No| Return
StoreCache --> Return([Return to User])
style Start fill:#90ee90,stroke:#228b22,stroke-width:3px
style Embed fill:#ffb6c1,stroke:#ff1493,stroke-width:2px
style Vector fill:#ffd700,stroke:#ff8c00,stroke-width:2px
style Rerank fill:#87ceeb,stroke:#4682b4,stroke-width:2px
style LLMInference fill:#ff69b4,stroke:#c71585,stroke-width:2px
style Return fill:#90ee90,stroke:#228b22,stroke-width:3px
style CheckRerank fill:#fff4e6,stroke:#ff8c00,stroke-width:2px
style CheckQuality fill:#fff4e6,stroke:#ff8c00,stroke-width:2px
style Cache fill:#fff4e6,stroke:#ff8c00,stroke-width:2pxsequence of events

sequenceDiagram
actor User
participant Frontend as Frontend Service<br/>FastAPI
participant Cache as Redis Cache
participant Embed as Embedding Service<br/>Workers AI
participant Vector as Vector Database<br/>Qdrant
participant Rerank as Reranking Service
participant LLM as LLM NIM<br/>GPU Inference
participant Metrics as Prometheus Metrics
User->>Frontend: POST /query<br/>{query: "How to deploy RAG?"}
activate Frontend
Note over Frontend,Metrics: Check Cache First
Frontend->>Cache: Check cached response
Cache-->>Frontend: Cache miss
Note over Frontend,Metrics: Embedding Phase
Frontend->>Embed: Generate embedding<br/>{text: query}
activate Embed
Embed->>Embed: Load model<br/>bge-large-en-v1.5
Embed->>Embed: Generate vector<br/>[768 dimensions]
Embed-->>Frontend: Embedding vector
deactivate Embed
Note over Frontend,Metrics: Retrieval Phase
Frontend->>Vector: Search similar vectors<br/>{vector, top_k: 5}
activate Vector
Vector->>Vector: Cosine similarity<br/>search
Vector->>Vector: Retrieve documents<br/>with scores
Vector-->>Frontend: Top 5 documents<br/>[doc1, doc2, ...]
deactivate Vector
Note over Frontend,Metrics: Reranking Phase Optional
Frontend->>Rerank: Rerank documents<br/>{query, documents}
activate Rerank
Rerank->>Rerank: Calculate relevance<br/>scores
Rerank-->>Frontend: Reranked documents
deactivate Rerank
Note over Frontend,Metrics: Generation Phase
Frontend->>Frontend: Build prompt<br/>{query + context}
Frontend->>LLM: Generate response<br/>{prompt, max_tokens: 512}
activate LLM
LLM->>LLM: Load model weights<br/>Llama 3.1 70B
LLM->>LLM: Generate tokens<br/>streaming
LLM-->>Frontend: Generated response
deactivate LLM
Note over Frontend,Metrics: Post-processing
Frontend->>Frontend: Validate response<br/>quality
Frontend->>Cache: Store response<br/>{query: response}
Frontend->>Metrics: Record metrics<br/>latency, tokens
Frontend-->>User: 200 OK<br/>{answer, sources, metadata}
deactivate Frontend
Note over User,Metrics: Total Latency: ~2-5 secondsParse error on line 2: ...agram actor User participant Fro ----------------------^ Expecting 'SOLID_OPEN_ARROW', 'DOTTED_OPEN_ARROW', 'SOLID_ARROW', 'DOTTED_ARROW', 'SOLID_CROSS', 'DOTTED_CROSS', got 'NL'
Core Kubernetes Components for RAG
1. Deployments for Stateless Services
Your API servers, embedding generators, and LLM inference endpoints run as Kubernetes Deployments. For more on Docker containerization best practices, explore 500+ Docker tutorials.
apiVersion: apps/v1
kind: Deployment
metadata:
name: rag-api-server
spec:
replicas: 3
selector:
matchLabels:
app: rag-api
template:
metadata:
labels:
app: rag-api
spec:
containers:
- name: api
image: your-registry/rag-api:latest
ports:
- containerPort: 8000
resources:
requests:
memory: "2Gi"
cpu: "1000m"
limits:
memory: "4Gi"
cpu: "2000m"
2. StatefulSets for Vector Databases
Vector databases like Qdrant, Weaviate, or Milvus require persistent storage and stable network identities:
apiVersion: apps/v1
kind: StatefulSet
metadata:
name: qdrant-cluster
spec:
serviceName: qdrant
replicas: 3
selector:
matchLabels:
app: qdrant
template:
metadata:
labels:
app: qdrant
spec:
containers:
- name: qdrant
image: qdrant/qdrant:latest
ports:
- containerPort: 6333
volumeMounts:
- name: qdrant-storage
mountPath: /qdrant/storage
volumeClaimTemplates:
- metadata:
name: qdrant-storage
spec:
accessModes: ["ReadWriteOnce"]
resources:
requests:
storage: 100Gi
3. Jobs for Batch Document Processing
For ingesting large document sets:
apiVersion: batch/v1
kind: Job
metadata:
name: document-ingestion
spec:
parallelism: 5
completions: 10
template:
spec:
containers:
- name: ingest
image: your-registry/document-processor:latest
env:
- name: BATCH_SIZE
value: "1000"
resources:
requests:
memory: "4Gi"
cpu: "2000m"
4. Services for Load Balancing
Expose your RAG components with stable endpoints:
apiVersion: v1
kind: Service
metadata:
name: rag-api-service
spec:
selector:
app: rag-api
ports:
- protocol: TCP
port: 80
targetPort: 8000
type: LoadBalancer
Implementing Horizontal Pod Autoscaling (HPA)
The Critical Need for Autoscaling
Without autoscaling, organizations face a painful tradeoff: either overprovision compute resources to handle worst-case scenarios (wasting capital on idle GPUs 80% of the time) or underprovision and risk service degradation.
Standard Metrics-Based Autoscaling
Basic HPA configuration using CPU/memory metrics:
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: rag-api-hpa
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: rag-api-server
minReplicas: 2
maxReplicas: 10
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 70
- type: Resource
resource:
name: memory
target:
type: Utilization
averageUtilization: 80
Custom Metrics for RAG Workloads
For production RAG systems, CPU/memory alone isn’t enough. You need RAG-specific metrics:
Key Custom Metrics:
- Concurrency (num_requests_running)
- GPU KV Cache Usage (gpu_cache_usage_perc)
- Time to First Token (TTFT) at 90th percentile
- Query Complexity Score
- Retrieval Latency
Example custom metrics HPA using Prometheus:
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: llm-nim-hpa
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: llm-inference
minReplicas: 1
maxReplicas: 5
metrics:
- type: Pods
pods:
metric:
name: time_to_first_token_p90
target:
type: AverageValue
averageValue: "2000m" # 2 seconds in milliseconds
- type: Pods
pods:
metric:
name: num_requests_running
target:
type: AverageValue
averageValue: "50"
behavior:
scaleUp:
stabilizationWindowSeconds: 60
policies:
- type: Percent
value: 100
periodSeconds: 60
scaleDown:
stabilizationWindowSeconds: 300
policies:
- type: Percent
value: 25
periodSeconds: 60
NVIDIA NIM Autoscaling Best Practices
For latency-sensitive workloads like customer service chatbots (ISL/OSL 256/256), autoscaling LLM NIM using TTFT p90 and concurrency ensures SLA adherence (TTFT <2s, e2e_latency <20s).
Recommended Metrics by Component:
| Component | Primary Metric | Secondary Metric | Target |
|---|---|---|---|
| LLM NIM | TTFT P90 | Concurrency | <2s, <100 concurrent |
| Embedding | GPU Utilization | Request Queue | 70-80%, <50 queued |
| Reranking | GPU Utilization | Throughput | 70-80%, >100 req/s |
| Vector DB | Query Latency | CPU Usage | <100ms, <75% |
Vector Database Deployment on Kubernetes
Choosing the Right Vector Database
When building RAG systems, vector databases are the foundation of semantic search capabilities. For a deeper understanding of embeddings and vector databases, read Understanding Embeddings: The Math Behind AI Vector Databases.
Popular Options:
- Qdrant – Open source, Rust-based, excellent performance
- Weaviate – GraphQL API, built-in hybrid search
- Milvus – High throughput, enterprise features
- Pinecone – Managed service, can run self-hosted
- pgvector – PostgreSQL extension, familiar SQL
Deploying Qdrant with Helm
Qdrant provides users with a Helm Chart that can be used to deploy Qdrant on Kubernetes. For a complete guide on vector embeddings, see Vector Embeddings with Sentence Transformers and Docker.
Step 1: Create Configuration
# qdrant-config.yaml
replicaCount: 3
resources:
requests:
memory: "4Gi"
cpu: "1000m"
limits:
memory: "8Gi"
cpu: "2000m"
persistence:
size: 100Gi
storageClass: fast-ssd
service:
type: ClusterIP
port: 6333
grpcPort: 6334
config:
cluster:
enabled: true
p2p:
port: 6335
Step 2: Deploy with Helm
# Add Qdrant repository
helm repo add qdrant https://qdrant.to/helm
helm repo update
# Install Qdrant cluster
helm install qdrant qdrant/qdrant \
-n rag-system \
-f qdrant-config.yaml \
--create-namespace \
--wait
Step 3: Verify Deployment
# Check pods
kubectl get pods -n rag-system
# Port forward for debugging
kubectl port-forward service/qdrant -n rag-system 6333:6333
# Access dashboard at http://localhost:6333/dashboard
Scaling Vector Databases
Horizontal Scaling Strategy:
# qdrant-scaling.yaml
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: qdrant-hpa
namespace: rag-system
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: StatefulSet
name: qdrant-cluster
minReplicas: 3
maxReplicas: 10
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 70
- type: Pods
pods:
metric:
name: qdrant_query_latency_p95
target:
type: AverageValue
averageValue: "100m" # 100ms
Multi-Region Vector Database Deployment
For global applications with low-latency requirements:
# Deploy vector DB in multiple regions
apiVersion: v1
kind: Service
metadata:
name: qdrant-global
annotations:
cloud.google.com/neg: '{"ingress": true}'
spec:
type: LoadBalancer
selector:
app: qdrant
ports:
- port: 6333
targetPort: 6333
---
# Configure regional clusters with cross-region replication
apiVersion: v1
kind: ConfigMap
metadata:
name: qdrant-replication
data:
config.yaml: |
cluster:
enabled: true
p2p:
port: 6335
consensus:
tick_period_ms: 100
replication:
enabled: true
factor: 2
regions:
- us-east
- us-west
- eu-central
GPU Resource Management for RAG Workloads
GPU Node Pools Configuration
GKE GPU Node Pool:
gcloud container node-pools create gpu-pool \
--cluster=rag-cluster \
--zone=us-central1-a \
--machine-type=n1-standard-8 \
--accelerator=type=nvidia-tesla-t4,count=1 \
--num-nodes=2 \
--enable-autoscaling \
--min-nodes=1 \
--max-nodes=10
AWS EKS GPU Node Group:
apiVersion: eksctl.io/v1alpha5
kind: ClusterConfig
metadata:
name: rag-cluster
region: us-east-1
nodeGroups:
- name: gpu-workers
instanceType: g4dn.xlarge
desiredCapacity: 2
minSize: 1
maxSize: 10
volumeSize: 100
labels:
workload: gpu-inference
taints:
- key: nvidia.com/gpu
value: "true"
effect: NoSchedule
GPU Pod Scheduling
apiVersion: v1
kind: Pod
metadata:
name: llm-inference
spec:
containers:
- name: nim-llm
image: nvcr.io/nim/meta/llama-3.1-8b-instruct:latest
resources:
limits:
nvidia.com/gpu: 1
memory: 16Gi
requests:
nvidia.com/gpu: 1
memory: 12Gi
nodeSelector:
cloud.google.com/gke-accelerator: nvidia-tesla-t4
tolerations:
- key: nvidia.com/gpu
operator: Exists
effect: NoSchedule
GPU Time Slicing (Multiple Workloads per GPU)
For smaller models or cost optimization:
apiVersion: v1
kind: ConfigMap
metadata:
name: gpu-time-slicing
namespace: gpu-operator
data:
time-slicing-config.yaml: |
version: v1
sharing:
timeSlicing:
replicas: 4 # Share GPU between 4 pods
failRequestsGreaterThanOne: false
Document Processing Pipelines
Batch Processing with Kubernetes Jobs
apiVersion: batch/v1
kind: CronJob
metadata:
name: document-ingestion-daily
spec:
schedule: "0 2 * * *" # 2 AM daily
jobTemplate:
spec:
parallelism: 10
completions: 100
backoffLimit: 3
template:
spec:
containers:
- name: document-processor
image: your-registry/doc-processor:latest
env:
- name: CHUNK_SIZE
value: "512"
- name: OVERLAP
value: "50"
- name: EMBEDDING_MODEL
value: "bge-large-en-v1.5"
resources:
requests:
memory: "4Gi"
cpu: "2000m"
restartPolicy: OnFailure
Stream Processing with Apache Kafka
For real-time document ingestion:
apiVersion: apps/v1
kind: Deployment
metadata:
name: kafka-document-consumer
spec:
replicas: 3
selector:
matchLabels:
app: doc-consumer
template:
metadata:
labels:
app: doc-consumer
spec:
containers:
- name: consumer
image: your-registry/kafka-consumer:latest
env:
- name: KAFKA_BROKERS
value: "kafka-0.kafka:9092,kafka-1.kafka:9092"
- name: CONSUMER_GROUP
value: "document-processors"
- name: TOPIC
value: "documents-to-index"
resources:
requests:
memory: "2Gi"
cpu: "1000m"
Production RAG Stack: Complete Example
Here’s a complete production-ready RAG deployment using popular open-source tools. For more AI/ML containerization examples, check out Docker AI/ML case studies and learn about Agentic AI Workflows with Docker.
Complete Deployment Manifest
# namespace.yaml
apiVersion: v1
kind: Namespace
metadata:
name: rag-production
---
# qdrant-vectordb.yaml
apiVersion: apps/v1
kind: StatefulSet
metadata:
name: qdrant
namespace: rag-production
spec:
serviceName: qdrant
replicas: 3
selector:
matchLabels:
app: qdrant
template:
metadata:
labels:
app: qdrant
spec:
containers:
- name: qdrant
image: qdrant/qdrant:v1.7.4
ports:
- containerPort: 6333
- containerPort: 6334
resources:
requests:
memory: 4Gi
cpu: 2000m
limits:
memory: 8Gi
cpu: 4000m
volumeMounts:
- name: storage
mountPath: /qdrant/storage
volumeClaimTemplates:
- metadata:
name: storage
spec:
accessModes: ["ReadWriteOnce"]
storageClassName: fast-ssd
resources:
requests:
storage: 100Gi
---
# embedding-service.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: embedding-service
namespace: rag-production
spec:
replicas: 3
selector:
matchLabels:
app: embedding
template:
metadata:
labels:
app: embedding
spec:
containers:
- name: embedding
image: your-registry/embedding-service:latest
ports:
- containerPort: 8080
env:
- name: MODEL_NAME
value: "BAAI/bge-large-en-v1.5"
- name: BATCH_SIZE
value: "32"
resources:
requests:
memory: 4Gi
cpu: 2000m
nvidia.com/gpu: 1
limits:
memory: 8Gi
cpu: 4000m
nvidia.com/gpu: 1
---
# llm-inference.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: llm-inference
namespace: rag-production
spec:
replicas: 2
selector:
matchLabels:
app: llm
template:
metadata:
labels:
app: llm
spec:
containers:
- name: llm
image: nvcr.io/nim/meta/llama-3.1-70b-instruct:latest
ports:
- containerPort: 8000
env:
- name: MODEL_PATH
value: "/models/llama-3.1-70b"
- name: MAX_BATCH_SIZE
value: "64"
resources:
limits:
nvidia.com/gpu: 2
memory: 80Gi
requests:
nvidia.com/gpu: 2
memory: 64Gi
---
# rag-api.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: rag-api
namespace: rag-production
spec:
replicas: 5
selector:
matchLabels:
app: rag-api
template:
metadata:
labels:
app: rag-api
spec:
containers:
- name: api
image: your-registry/rag-api:latest
ports:
- containerPort: 8000
env:
- name: QDRANT_URL
value: "http://qdrant:6333"
- name: EMBEDDING_URL
value: "http://embedding-service:8080"
- name: LLM_URL
value: "http://llm-inference:8000"
- name: TOP_K
value: "5"
resources:
requests:
memory: 2Gi
cpu: 1000m
limits:
memory: 4Gi
cpu: 2000m
---
# hpa-rag-api.yaml
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: rag-api-hpa
namespace: rag-production
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: rag-api
minReplicas: 3
maxReplicas: 20
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 70
- type: Pods
pods:
metric:
name: request_latency_p95
target:
type: AverageValue
averageValue: "500m" # 500ms
---
# ingress.yaml
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
name: rag-ingress
namespace: rag-production
annotations:
kubernetes.io/ingress.class: nginx
cert-manager.io/cluster-issuer: letsencrypt-prod
spec:
tls:
- hosts:
- rag.yourdomain.com
secretName: rag-tls
rules:
- host: rag.yourdomain.com
http:
paths:
- path: /
pathType: Prefix
backend:
service:
name: rag-api-service
port:
number: 80
Deployment Steps
# 1. Create namespace
kubectl apply -f namespace.yaml
# 2. Deploy vector database
kubectl apply -f qdrant-vectordb.yaml
# Wait for Qdrant to be ready
kubectl wait --for=condition=ready pod -l app=qdrant -n rag-production --timeout=300s
# 3. Deploy embedding service
kubectl apply -f embedding-service.yaml
# 4. Deploy LLM inference
kubectl apply -f llm-inference.yaml
# 5. Deploy RAG API
kubectl apply -f rag-api.yaml
# 6. Configure autoscaling
kubectl apply -f hpa-rag-api.yaml
# 7. Set up ingress
kubectl apply -f ingress.yaml
# 8. Verify deployment
kubectl get all -n rag-production
Monitoring and Observability
Prometheus Metrics for RAG
Key Metrics to Track:
# ServiceMonitor for RAG API
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: rag-api-metrics
namespace: rag-production
spec:
selector:
matchLabels:
app: rag-api
endpoints:
- port: metrics
interval: 15s
path: /metrics
Custom Metrics to Expose:
# Python example using prometheus_client
from prometheus_client import Counter, Histogram, Gauge
# Request metrics
rag_requests_total = Counter(
'rag_requests_total',
'Total RAG requests',
['endpoint', 'status']
)
# Latency metrics
retrieval_latency = Histogram(
'rag_retrieval_latency_seconds',
'Time spent on vector retrieval',
buckets=[0.01, 0.05, 0.1, 0.5, 1.0, 2.0]
)
generation_latency = Histogram(
'rag_generation_latency_seconds',
'Time spent on LLM generation',
buckets=[0.5, 1.0, 2.0, 5.0, 10.0]
)
# System metrics
active_queries = Gauge(
'rag_active_queries',
'Number of queries currently being processed'
)
vector_db_size = Gauge(
'rag_vector_db_documents',
'Number of documents in vector database'
)
Grafana Dashboard
Example dashboard queries:
# Average retrieval latency
rate(rag_retrieval_latency_seconds_sum[5m]) /
rate(rag_retrieval_latency_seconds_count[5m])
# P95 generation latency
histogram_quantile(0.95,
rate(rag_generation_latency_seconds_bucket[5m]))
# Request rate
rate(rag_requests_total[1m])
# Error rate
rate(rag_requests_total{status="error"}[5m]) /
rate(rag_requests_total[5m])
# GPU utilization
DCGM_FI_DEV_GPU_UTIL{pod=~"llm-inference.*"}
Distributed Tracing with OpenTelemetry
from opentelemetry import trace
from opentelemetry.exporter.jaeger.thrift import JaegerExporter
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
# Configure tracer
trace.set_tracer_provider(TracerProvider())
jaeger_exporter = JaegerExporter(
agent_host_name="jaeger-agent.monitoring",
agent_port=6831,
)
trace.get_tracer_provider().add_span_processor(
BatchSpanProcessor(jaeger_exporter)
)
tracer = trace.get_tracer(__name__)
# Instrument RAG pipeline
@tracer.start_as_current_span("rag_query")
def rag_query(query: str):
with tracer.start_as_current_span("embedding"):
embedding = generate_embedding(query)
with tracer.start_as_current_span("retrieval"):
docs = vector_db.search(embedding, top_k=5)
with tracer.start_as_current_span("generation"):
response = llm.generate(query, docs)
return response
Cost Optimization Strategies
Right-Sizing Resources
Resource Request/Limit Best Practices:
| Component | CPU Request | CPU Limit | Memory Request | Memory Limit | GPU |
|---|---|---|---|---|---|
| API Server | 500m | 2000m | 1Gi | 4Gi | 0 |
| Embedding | 1000m | 4000m | 4Gi | 8Gi | 1 |
| LLM (70B) | 2000m | 8000m | 64Gi | 80Gi | 2 |
| Vector DB | 2000m | 4000m | 4Gi | 8Gi | 0 |
| Reranking | 1000m | 2000m | 2Gi | 4Gi | 1 |
Spot/Preemptible Instances for Non-Critical Workloads
# GKE spot node pool for batch processing
apiVersion: v1
kind: Pod
metadata:
name: batch-indexing
spec:
nodeSelector:
cloud.google.com/gke-spot: "true"
tolerations:
- key: cloud.google.com/gke-spot
operator: Equal
value: "true"
effect: NoSchedule
containers:
- name: indexer
image: your-registry/document-indexer:latest
resources:
requests:
memory: 4Gi
cpu: 2000m
Model Quantization for Cost Savings
Deploy quantized models to reduce GPU requirements:
apiVersion: v1
kind: ConfigMap
metadata:
name: llm-config
data:
model.yaml: |
model:
name: "llama-3.1-70b-instruct"
quantization: "int8" # Or "int4" for even more savings
max_batch_size: 64
tensor_parallel: 2
Cost Impact:
- INT8 quantization: 50% memory reduction, 10-15% speed increase
- INT4 quantization: 75% memory reduction, 20-30% speed increase
- Trade-off: 1-3% accuracy degradation
Storage Optimization
# Use tiered storage for vector database
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: qdrant-hot-storage
spec:
accessModes:
- ReadWriteOnce
storageClassName: ssd-retained
resources:
requests:
storage: 50Gi
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: qdrant-cold-storage
spec:
accessModes:
- ReadWriteOnce
storageClassName: standard
resources:
requests:
storage: 500Gi
Security Best Practices
Network Policies
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: rag-network-policy
namespace: rag-production
spec:
podSelector:
matchLabels:
app: rag-api
policyTypes:
- Ingress
- Egress
ingress:
- from:
- namespaceSelector:
matchLabels:
name: ingress-nginx
ports:
- protocol: TCP
port: 8000
egress:
- to:
- podSelector:
matchLabels:
app: qdrant
ports:
- protocol: TCP
port: 6333
- to:
- podSelector:
matchLabels:
app: llm
ports:
- protocol: TCP
port: 8000
Secret Management with External Secrets Operator
apiVersion: external-secrets.io/v1beta1
kind: ExternalSecret
metadata:
name: rag-secrets
namespace: rag-production
spec:
refreshInterval: 1h
secretStoreRef:
name: aws-secrets-manager
kind: SecretStore
target:
name: rag-api-secrets
creationPolicy: Owner
data:
- secretKey: openai_api_key
remoteRef:
key: production/rag/openai
- secretKey: vector_db_password
remoteRef:
key: production/rag/qdrant
Pod Security Standards
apiVersion: v1
kind: Pod
metadata:
name: rag-api
namespace: rag-production
spec:
securityContext:
runAsNonRoot: true
runAsUser: 1000
fsGroup: 2000
seccompProfile:
type: RuntimeDefault
containers:
- name: api
image: your-registry/rag-api:latest
securityContext:
allowPrivilegeEscalation: false
capabilities:
drop:
- ALL
readOnlyRootFilesystem: true
Common Production Challenges and Solutions
Challenge 1: Slow Retrieval Performance
Symptoms:
- P95 retrieval latency >500ms
- Vector search taking >200ms
- High CPU usage on vector database
Solutions:
Implementing distributed systems in RAG applications requires a structured approach to minimize latency and prevent performance bottlenecks. Keeping frequently accessed data close to the processing nodes significantly reduces latency.
# Implement caching layer
apiVersion: apps/v1
kind: Deployment
metadata:
name: redis-cache
spec:
replicas: 3
selector:
matchLabels:
app: redis
template:
spec:
containers:
- name: redis
image: redis:7-alpine
resources:
requests:
memory: 4Gi
cpu: 1000m
Challenge 2: GPU Resource Contention
Symptoms:
- LLM pods stuck in Pending state
- OOM errors on GPU nodes
- Inconsistent inference latency
Solutions:
# Implement GPU resource quotas
apiVersion: v1
kind: ResourceQuota
metadata:
name: gpu-quota
namespace: rag-production
spec:
hard:
requests.nvidia.com/gpu: "10"
limits.nvidia.com/gpu: "10"
---
# Use priority classes
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
name: high-priority-rag
value: 1000
globalDefault: false
description: "High priority for user-facing RAG queries"
Challenge 3: Unpredictable Costs
Symptoms:
- Monthly cloud bills fluctuating 50%+
- Idle GPU resources during off-peak
- Over-provisioned infrastructure
Solutions:
# Implement Cluster Autoscaler with node lifecycle hooks
apiVersion: v1
kind: ConfigMap
metadata:
name: cluster-autoscaler-config
data:
config.yaml: |
scale-down-enabled: true
scale-down-delay-after-add: 10m
scale-down-unneeded-time: 10m
skip-nodes-with-local-storage: false
balancing-ignore-label: node.kubernetes.io/exclude-from-external-load-balancers
Challenge 4: Data Consistency Across Components
Symptoms:
- Stale embeddings after document updates
- Version mismatches between vector DB and source
- Inconsistent search results
Solutions:
# Implement event-driven updates with Kafka
from kafka import KafkaProducer
import json
producer = KafkaProducer(
bootstrap_servers=['kafka:9092'],
value_serializer=lambda v: json.dumps(v).encode('utf-8')
)
def update_document(doc_id: str, content: str):
# Publish update event
producer.send('document-updates', {
'doc_id': doc_id,
'content': content,
'timestamp': time.time(),
'operation': 'update'
})
# Trigger reindexing
producer.send('reindex-queue', {
'doc_id': doc_id,
'priority': 'high'
})
Real-World Case Studies
Case Study 1: Educational Platform RAG Deployment
Scenario: College students finding class information through a customized production RAG system deployed for around 3 months, answering over 250+ questions and helping students in over 60 sessions.
Architecture:
- GKE cluster with 5 nodes
- Postgres with pgvector extension
- OpenAI GPT-3.5 for generation
- FastAPI + HTMX for UI
Cost: Approximately $150/month (primarily cloud hosting)
Key Learnings:
- Kubernetes local testing with k3d saved deployment debugging time
- Nightly database backups essential for data integrity
- Optional Ollama support enabled API-key-free local testing
Case Study 2: Enterprise Document Search
Scenario: 50-person startup with 10,000+ internal documents across Notion, Google Docs, Confluence
Architecture:
- 3-node Kubernetes cluster
- Qdrant vector database (3 replicas)
- Custom embedding service with bge-large
- NVIDIA NIM for LLM inference
Results:
- 30 minutes/day saved per employee
- <500ms average query latency
- 95% user satisfaction rate
- $200/month total infrastructure cost
Kubernetes vs. Alternatives for RAG
Kubernetes vs. Serverless (AWS Lambda, Cloud Functions)
| Aspect | Kubernetes | Serverless |
|---|---|---|
| GPU Support | ✅ Native | ❌ Limited/None |
| Cold Start | Minimal (pod reuse) | 1-5 seconds |
| Cost at Scale | Lower (resource sharing) | Higher (per-invocation) |
| State Management | StatefulSets | External services required |
| Control | Full | Limited |
Verdict: Kubernetes wins for GPU-intensive RAG workloads
Kubernetes vs. Docker Compose
| Aspect | Kubernetes | Docker Compose |
|---|---|---|
| Scalability | ✅ Automatic HPA | ❌ Manual |
| High Availability | ✅ Built-in | ❌ Single host |
| Load Balancing | ✅ Native | ❌ Requires nginx |
| Production-Ready | ✅ Yes | ❌ Dev/Test only |
Verdict: Docker Compose for local dev, Kubernetes for production
Kubernetes vs. Managed AI Platforms (AWS Bedrock, Azure OpenAI)
| Aspect | Kubernetes + OSS | Managed Platforms |
|---|---|---|
| Cost | $200-500/month | $1000-5000/month |
| Data Privacy | ✅ Full control | ⚠️ Third-party |
| Customization | ✅ Unlimited | ❌ Limited |
| Maintenance | ⚠️ Self-managed | ✅ Fully managed |
Verdict: Depends on priorities (cost/control vs. convenience)
Future Trends: RAG on Kubernetes in 2025
Trend 1: Ray Serve Integration
Ray is becoming the standard for distributed AI workloads. For modern AI infrastructure, also explore Docker Model Runner for embedding models and Docker MCP Gateway for agentic AI.
apiVersion: ray.io/v1
kind: RayService
metadata:
name: rag-service
spec:
serviceUnhealthySecondThreshold: 900
deploymentUnhealthySecondThreshold: 300
serveConfigV2: |
applications:
- name: rag-app
route_prefix: /
import_path: rag_app:deployment
runtime_env:
pip:
- langchain
- qdrant-client
deployments:
- name: embedding
num_replicas: 3
ray_actor_options:
num_gpus: 0.5
- name: generation
num_replicas: 2
ray_actor_options:
num_gpus: 1
Trend 2: KEDA Event-Driven Autoscaling
Scale RAG workloads based on queue depth, custom metrics:
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
name: rag-queue-scaler
spec:
scaleTargetRef:
name: document-processor
minReplicaCount: 1
maxReplicaCount: 50
triggers:
- type: kafka
metadata:
bootstrapServers: kafka:9092
consumerGroup: doc-processors
topic: documents-to-index
lagThreshold: "100"
Trend 3: Multi-Cloud RAG Deployments
Organizations deploying across AWS, GCP, Azure for redundancy:
# Kubernetes Federation v2 (KubeFed)
apiVersion: types.kubefed.io/v1beta1
kind: FederatedDeployment
metadata:
name: rag-api-federated
namespace: rag-production
spec:
template:
spec:
# Same deployment spec across all clusters
placement:
clusters:
- name: aws-us-east
- name: gcp-us-central
- name: azure-westus
overrides:
- clusterName: aws-us-east
clusterOverrides:
- path: "/spec/replicas"
value: 5
Conclusion: Building Production RAG Systems on Kubernetes
Deploying RAG systems on Kubernetes provides the scalability, reliability, and cost-efficiency that enterprise AI applications demand. The key takeaways:
Architecture Best Practices:
- Use StatefulSets for vector databases, Deployments for stateless services
- Implement HPA with custom RAG-specific metrics (TTFT, concurrency, retrieval latency)
- Deploy multi-component systems with proper service mesh and load balancing
- For hands-on practice with Kubernetes deployments, check out Kubernetes Hands-on Labs
Cost Optimization:
- Right-size resources based on actual usage patterns
- Use spot/preemptible instances for batch workloads
- Implement model quantization (INT8/INT4) to reduce GPU requirements
- Scale to zero during off-peak hours
Production Readiness:
- Comprehensive monitoring with Prometheus + Grafana
- Distributed tracing with OpenTelemetry
- Network policies and secret management
- Automated backup and disaster recovery
Scaling Strategies:
- Horizontal autoscaling based on query load and latency
- GPU resource quotas and priority classes
- Event-driven processing with Kafka/KEDA
- Multi-region deployments for global applications
The combination of Kubernetes orchestration, modern vector databases, and optimized LLM deployment creates a powerful foundation for production RAG systems that can scale from hundreds to millions of queries while maintaining sub-second latency.