The AI landscape has shifted dramatically. While cloud APIs dominated the early days of LLM adoption, organizations are increasingly asking: How can we run powerful language models on our own infrastructure? The answer lies at the intersection of two powerful technologies: Ollama and Kubernetes.
Why Ollama + Kubernetes? The Perfect Marriage
With the rapid adoption of Large Language Models in enterprise applications, running models locally has become crucial for three compelling reasons:
π Data Privacy: Keep sensitive data within your infrastructure. Every inference runs directly on your own hardwareβno external API calls, no data leaving your network.
π° Cost Efficiency: Eliminate per-token API costs for high-volume applications. One Ollama deployment can serve thousands of requests without recurring charges.
β‘ Low Latency: Local inference means predictable, sub-second response times without the variability of internet round-trips.
Kubernetes amplifies these benefits with auto-scaling for variable workloads and efficient GPU/CPU allocation across your cluster.
1. “How do I deploy Ollama on Kubernetes?”
The fastest path to production uses the community Helm chart:
# Add the Helm repository
helm repo add otwld https://helm.otwld.com/
helm repo update
# Deploy Ollama with a single command
helm install ollama otwld/ollama \
--namespace ollama \
--create-namespace
For more control, here’s a production-ready deployment manifest:
apiVersion: apps/v1
kind: Deployment
metadata:
name: ollama
namespace: ollama-system
spec:
replicas: 1
selector:
matchLabels:
name: ollama
template:
metadata:
labels:
name: ollama
spec:
containers:
- name: ollama
image: ollama/ollama:latest
ports:
- containerPort: 11434
env:
- name: OLLAMA_HOST
value: "0.0.0.0"
resources:
requests:
memory: "4Gi"
cpu: "2000m"
limits:
memory: "16Gi"
cpu: "8000m"
volumeMounts:
- name: ollama-data
mountPath: /root/.ollama
volumes:
- name: ollama-data
persistentVolumeClaim:
claimName: ollama-pvc
Pro tip: Always use a PersistentVolumeClaim. Downloaded models persist across pod restarts, saving precious time and bandwidth.
2. “Ollama vs vLLM: Which should I choose for Kubernetes?”
This is perhaps the hottest debate in the LLM serving community. Here’s the verdict based on extensive benchmarking:
| Aspect | Ollama | vLLM |
|---|---|---|
| Best For | Local dev, prototyping, single-user apps | Production, high-concurrency workloads |
| Peak Throughput | ~41 TPS | ~793 TPS |
| P99 Latency | 673ms at peak | 80ms at peak |
| Setup Complexity | 5 minutes | 30+ minutes |
| Model Support | Curated library, easy downloads | Hugging Face ecosystem |
The Bottom Line:
- Use Ollama when you’re getting started, building prototypes, or running internal AI assistants with moderate traffic
- Use vLLM when you’re serving production workloads with hundreds of concurrent users
Many teams use both: Ollama for development and vLLM for production. The OpenAI-compatible APIs make switching straightforward.
3. “How do I enable GPU support for Ollama in Kubernetes?”
GPU acceleration transforms Ollama performance. Here’s the configuration:
apiVersion: apps/v1
kind: Deployment
metadata:
name: ollama-gpu
namespace: ollama-system
spec:
replicas: 1
selector:
matchLabels:
app: ollama-gpu
template:
spec:
containers:
- name: ollama
image: ollama/ollama:latest
env:
- name: NVIDIA_DRIVER_CAPABILITIES
value: compute,utility
- name: CUDA_VISIBLE_DEVICES
value: "0"
resources:
requests:
nvidia.com/gpu: 1
limits:
nvidia.com/gpu: 1
nodeSelector:
accelerator: nvidia-tesla-t4
tolerations:
- key: nvidia.com/gpu
operator: Exists
effect: NoSchedule
Prerequisites:
- Install the NVIDIA GPU Operator in your cluster
- Ensure GPU nodes have proper drivers installed
- Use node selectors to target GPU-enabled nodes
4. “How do I auto-scale Ollama with HPA?”
Horizontal Pod Autoscaler (HPA) configuration for Ollama requires careful tuning because LLM workloads have unique characteristics:
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: ollama-hpa
namespace: ollama-system
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: ollama
minReplicas: 2
maxReplicas: 10
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 70
- type: Resource
resource:
name: memory
target:
type: Utilization
averageUtilization: 80
behavior:
scaleDown:
stabilizationWindowSeconds: 300
policies:
- type: Percent
value: 50
periodSeconds: 60
scaleUp:
stabilizationWindowSeconds: 60
policies:
- type: Percent
value: 100
periodSeconds: 30
Why the 300-second stabilization window? Model loading takes 5-30 seconds depending on size. Aggressive scale-down causes constant model reloading, killing performance.
5. “Can I scale Ollama to zero to save costs?”
Yes! KEDA (Kubernetes Event-Driven Autoscaling) enables scale-to-zero:
apiVersion: http.keda.sh/v1alpha1
kind: HTTPScaledObject
metadata:
name: ollama
namespace: ollama
spec:
hosts:
- ollama.your-domain.com
scaleTargetRef:
name: ollama
kind: Deployment
apiVersion: apps/v1
replicas:
min: 0
max: 5
scaledownPeriod: 3600 # Scale down after 1 hour of inactivity
scalingMetric:
requestRate:
targetValue: 20
This configuration is ideal for GPU workloadsβscaling to zero when idle can save thousands of dollars monthly on expensive GPU nodes.
6. “How do I deploy Open WebUI with Ollama on Kubernetes?”
Open WebUI provides a ChatGPT-like interface for Ollama. Here’s the complete stack:
# Ollama Deployment
apiVersion: apps/v1
kind: Deployment
metadata:
name: ollama
namespace: ollama
spec:
replicas: 1
selector:
matchLabels:
app: ollama
template:
metadata:
labels:
app: ollama
spec:
containers:
- name: ollama
image: ollama/ollama:latest
ports:
- containerPort: 11434
volumeMounts:
- name: ollama-data
mountPath: /root/.ollama
volumes:
- name: ollama-data
persistentVolumeClaim:
claimName: ollama-pvc
---
# Open WebUI Deployment
apiVersion: apps/v1
kind: Deployment
metadata:
name: open-webui
namespace: ollama
spec:
replicas: 1
selector:
matchLabels:
app: open-webui
template:
metadata:
labels:
app: open-webui
spec:
containers:
- name: open-webui
image: ghcr.io/open-webui/open-webui:main
ports:
- containerPort: 8080
env:
- name: OLLAMA_BASE_URL
value: "http://ollama-service:11434"
volumeMounts:
- name: webui-data
mountPath: /app/backend/data
volumes:
- name: webui-data
persistentVolumeClaim:
claimName: webui-pvc
Key configuration: Set OLLAMA_BASE_URL to point to your Ollama Kubernetes Service.
7. “How do I expose Ollama externally with Ingress?”
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
name: ollama-ingress
namespace: ollama-system
annotations:
nginx.ingress.kubernetes.io/proxy-body-size: "0"
nginx.ingress.kubernetes.io/proxy-read-timeout: "600"
nginx.ingress.kubernetes.io/proxy-send-timeout: "600"
spec:
ingressClassName: nginx
rules:
- host: ollama.yourdomain.com
http:
paths:
- path: /
pathType: Prefix
backend:
service:
name: ollama-service
port:
number: 11434
Critical: Increase proxy timeouts! LLM inference can take 30+ seconds for long responses.
8. “How do I monitor Ollama in production?”
Deploy Prometheus ServiceMonitor for comprehensive observability:
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: ollama-monitor
namespace: ollama-system
spec:
selector:
matchLabels:
app: ollama
endpoints:
- port: metrics
interval: 30s
path: /metrics
Key metrics to track:
ollama_request_duration_seconds– Inference latencyollama_active_requests– Current loadollama_model_load_duration_seconds– Model startup timeollama_gpu_utilization_percent– GPU efficiency
Production Architecture: The Complete Picture
Here’s a battle-tested architecture for enterprise Ollama deployments:
βββββββββββββββββββββββββββββββββββββββββββ
β Ingress β
β (NGINX/Traefik + TLS termination) β
βββββββββββββββββββ¬ββββββββββββββββββββββββ
β
βββββββββββββββββββΌββββββββββββββββββββββββ
β OAuth2 Proxy β
β (Authentication) β
βββββββββββββββββββ¬ββββββββββββββββββββββββ
β
βββββββββββββββββββΌββββββββββββββββββββββββ
β LoadBalancer Service β
βββββββββββββββββββ¬ββββββββββββββββββββββββ
β
βββββββββββββββββββββββββββββββΌββββββββββββββββββββββββββββββ
β β β
βββββββββΌββββββββ ββββββββββΌβββββββββ βββββββββββΌββββββββ
β Ollama Pod β β Ollama Pod β β Ollama Pod β
β (GPU 0) β β (GPU 1) β β (GPU 2) β
βββββββββ¬ββββββββ ββββββββββ¬βββββββββ βββββββββββ¬ββββββββ
β β β
ββββββββββββββββββββββββββββββ΄ββββββββββββββββββββββββββββββ
β
ββββββββββββββββββΌβββββββββββββββββββββββββ
β Shared NFS Storage β
β (Models persist across pods) β
βββββββββββββββββββββββββββββββββββββββββββ
Common Pitfalls and Solutions
β Pitfall: Models Re-download on Every Pod Restart
Solution: Use ReadWriteMany PVC with NFS or shared storage. All pods share the same model cache.
β Pitfall: HPA Thrashing (Constant Scale Up/Down)
Solution: Set stabilizationWindowSeconds: 300 for scale-down. Model loading is expensive.
β Pitfall: OOM Kills During Large Model Loads
Solution: Set memory limits 2x the model size. Llama 3 70B needs ~140GB RAM.
β Pitfall: Slow First Response After Scale-Up
Solution: Use init containers to pre-pull models, or configure ollama.models.pull in Helm values.
The Future: Ollama in the Multi-Agent Era
As AI evolves toward multi-agent architectures, Ollama on Kubernetes becomes even more powerful. Imagine:
- Agent pools: Different Ollama instances running specialized models (code, analysis, creative)
- Model routing: Intelligent traffic direction based on query type
- Federated inference: Workloads distributed across edge and cloud clusters
The containerized, Kubernetes-native approach positions teams perfectly for this agentic future.
Quick Start Commands
# Create namespace
kubectl create namespace ollama-system
# Deploy with Helm
helm install ollama otwld/ollama \
--namespace ollama-system \
--set ollama.gpu.enabled=true \
--set ollama.models.pull={llama3.1,codellama}
# Verify deployment
kubectl get pods -n ollama-system
# Port-forward for local testing
kubectl port-forward svc/ollama 11434:11434 -n ollama-system
# Test the API
curl http://localhost:11434/api/generate \
-d '{"model": "llama3.1", "prompt": "Hello, Kubernetes!"}'
Conclusion
Running Ollama on Kubernetes isn’t just about infrastructureβit’s about taking control of your AI destiny. You gain:
β
Data sovereignty: Your data never leaves your infrastructure
β
Cost predictability: No surprise API bills
β
Performance control: Tune latency and throughput to your needs
β
Scale flexibility: From development laptop to enterprise GPU cluster
Whether you’re building internal AI assistants, powering customer-facing applications, or experimenting with the latest open-source models, the Ollama + Kubernetes combination provides a robust foundation for the AI-driven future.