This comprehensive tutorial walks you through deploying and running Ollama, an open-source Large Language Model (LLM) runtime, on a local Kubernetes cluster. While we’ll focus on local deployment using Minikube, the same principles apply to production clusters on EKS, AKS, GKE, or on-premises infrastructure.
Why Run Ollama on Kubernetes?
Privacy and Control
Unlike SaaS-based AI tools like ChatGPT, Google Gemini, or Microsoft Copilot, Ollama runs entirely on your infrastructure. This means:
- Complete data privacy – Your prompts and data never leave your network
- Model flexibility – Choose from dozens of open-source models or train your own
- Cost control – No per-token pricing or API rate limits
- Compliance – Meet strict data residency and security requirements
Why Kubernetes?
Kubernetes offers unique advantages for LLM workloads:
- Resource orchestration – Efficiently manage CPU, memory, and GPU allocation
- Scalability – Easily scale to multiple replicas as demand grows
- High availability – Automatic pod restarts and health monitoring
- Portability – Deploy anywhere from local dev to production cloud clusters
Prerequisites
Before starting, ensure you have the following installed:
- kubectl – Kubernetes command-line tool (installation guide)
- Minikube – Local Kubernetes cluster (installation guide)
- Code editor – VS Code, vim, or your preferred editor
- System resources – At least 8GB RAM and 4 CPU cores available
Optional but recommended:
- Docker Scout or Trivy for container image scanning
- k9s for easier Kubernetes resource management
Understanding Ollama Architecture
Before we dive in, it’s important to understand how Ollama works:
- Ollama Server – The runtime that manages model loading and serving
- Model Files – The actual LLM weights and configuration (downloaded separately)
- API Endpoint – REST API on port 11434 for programmatic access
- CLI Interface – Interactive terminal for testing and debugging
When you deploy Ollama to Kubernetes, you’re deploying the server. Models must be pulled and loaded separately.
Step 1: Start Your Kubernetes Cluster
LLM workloads are resource-intensive. We’ll create a 3-node Minikube cluster with sufficient resources:
minikube start --nodes 3 --cpus 4 --memory 8192
What this does:
- Creates 3 worker nodes for workload distribution
- Allocates 4 CPUs per node
- Provides 8GB RAM per node
Verify the cluster:
kubectl get nodes
```
Expected output:
```
NAME STATUS ROLES AGE VERSION
minikube Ready control-plane 1m v1.28.3
minikube-m02 Ready <none> 1m v1.28.3
minikube-m03 Ready <none> 1m v1.28.3
Step 2: Create the Ollama Namespace
Namespaces provide logical isolation for your workloads:
kubectl create namespace ollama
Verify:
kubectl get namespaces
Step 3: Deploy Ollama with Persistent Storage
We’ll create a complete deployment with persistent storage to ensure models aren’t lost when pods restart.
Create a file named ollama-deployment.yaml:
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: ollama-storage
namespace: ollama
spec:
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 10Gi
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: ollama
namespace: ollama
spec:
replicas: 1
selector:
matchLabels:
app: ollama
template:
metadata:
labels:
app: ollama
spec:
containers:
- name: ollama
image: ollama/ollama:latest
ports:
- name: http
containerPort: 11434
protocol: TCP
volumeMounts:
- name: ollama-storage
mountPath: /root/.ollama
resources:
requests:
memory: "4Gi"
cpu: "2"
limits:
memory: "8Gi"
cpu: "4"
livenessProbe:
httpGet:
path: /
port: 11434
initialDelaySeconds: 30
periodSeconds: 10
readinessProbe:
httpGet:
path: /
port: 11434
initialDelaySeconds: 15
periodSeconds: 5
volumes:
- name: ollama-storage
persistentVolumeClaim:
claimName: ollama-storage
---
apiVersion: v1
kind: Service
metadata:
name: ollama
namespace: ollama
spec:
type: ClusterIP
selector:
app: ollama
ports:
- protocol: TCP
port: 11434
targetPort: 11434
Key components explained:
- PersistentVolumeClaim – 10GB storage for model files that persists across pod restarts
- Resource limits – Prevents Ollama from consuming all cluster resources
- Health probes – Kubernetes automatically restarts unhealthy pods
- Service – Provides a stable network endpoint for accessing Ollama
Deploy:
kubectl apply -f ollama-deployment.yaml
Verify deployment:
kubectl get pods -n ollama -w
Wait until the pod shows STATUS: Running and READY: 1/1.
Step 4: Access the Ollama Pod
Get the exact pod name:
kubectl get pods -n ollama
```
Output example:
```
NAME READY STATUS RESTARTS AGE
ollama-7d9f8c5b6d-k8xjm 1/1 Running 0 2m
Access the pod:
kubectl -n ollama exec -it ollama-7d9f8c5b6d-k8xjm -- /bin/bash
Replace ollama-7d9f8c5b6d-k8xjm with your actual pod name.
Step 5: Pull and Run an LLM Model
Now you’re inside the Ollama container. Let’s verify the installation and pull a model.
Check Ollama version:
ollama --version
Pull a model:
ollama pull llama3.2
This downloads the Llama 3.2 model files (~2GB) to /root/.ollama (which is persisted via our PVC).
Understanding pull vs run
ollama pull llama3.2– Downloads model files to disk (like downloading a game)ollama run llama3.2– Loads model into memory AND starts interactive chat (like launching the game)
Start interactive mode:
ollama run llama3.2
```
You'll see a prompt like:
```
>>>
```
**Test with a question:**
```
>>> What is Kubernetes?
The model will respond with an explanation. Type /bye to exit.
Step 6: Access Ollama via API (Production Method)
In production, you won’t exec into pods. Instead, use the API endpoint.
From another terminal, forward the service port:
kubectl port-forward -n ollama svc/ollama 11434:11434
Test with curl:
curl http://localhost:11434/api/generate -d '{
"model": "llama3.2",
"prompt": "Why is Kubernetes useful for ML workloads?",
"stream": false
}'
Step 7: Expose Ollama (Optional)
For external access, change the Service type to LoadBalancer:
kubectl patch svc ollama -n ollama -p '{"spec":{"type":"LoadBalancer"}}'
Get external IP (Minikube):
minikube service ollama -n ollama --url
Production Considerations
Security Best Practices
- Image scanning:
docker scout quickview ollama/ollama:latest
- Use specific image tags (not
latest):
image: ollama/ollama:0.1.44
- Apply Network Policies to restrict pod communication
- Use RBAC to limit pod permissions
Resource Management
For production workloads, adjust resources based on your model size:
Small models (1-3B parameters):
resources:
requests:
memory: "4Gi"
cpu: "2"
Medium models (7-13B parameters):
resources:
requests:
memory: "16Gi"
cpu: "8"
Large models (30B+ parameters):
resources:
requests:
memory: "32Gi"
cpu: "16"
nvidia.com/gpu: "1" # If GPU available
High Availability
For production, increase replicas:
spec:
replicas: 3
And add pod anti-affinity to distribute across nodes.
Monitoring
Add Prometheus annotations:
metadata:
annotations:
prometheus.io/scrape: "true"
prometheus.io/port: "11434"
prometheus.io/path: "/metrics"
Cleanup
When you’re done experimenting:
# Delete Ollama deployment
kubectl delete -f ollama-deployment.yaml
# Delete namespace
kubectl delete namespace ollama
# Stop Minikube
minikube stop
# Delete cluster (if needed)
minikube delete
Next Steps
Now that you have Ollama running on Kubernetes, consider:
- Building a RAG application using your own documents
- Fine-tuning models with domain-specific data
- Creating a chat UI using the Ollama API
- Deploying to cloud Kubernetes (EKS, AKS, GKE)
- Implementing autoscaling based on request volume
- Adding GPU support for faster inference
Conclusion
You’ve successfully deployed Ollama on Kubernetes, gaining complete control over your LLM infrastructure while maintaining privacy and flexibility. This foundation can scale from local experimentation to production workloads serving thousands of requests.
The combination of Kubernetes orchestration and Ollama’s efficient runtime creates a powerful platform for running AI workloads anywhere.
One thought on “Running Ollama LLM on Kubernetes: A Complete Guide”