Kubernetes ollama

Top Ollama + Kubernetes Questions Answered


The AI landscape has shifted dramatically. While cloud APIs dominated the early days of LLM adoption, organizations are increasingly asking: How can we run powerful language models on our own infrastructure? The answer lies at the intersection of two powerful technologies: Ollama and Kubernetes.


Why Ollama + Kubernetes? The Perfect Marriage

With the rapid adoption of Large Language Models in enterprise applications, running models locally has become crucial for three compelling reasons:

πŸ” Data Privacy: Keep sensitive data within your infrastructure. Every inference runs directly on your own hardwareβ€”no external API calls, no data leaving your network.

πŸ’° Cost Efficiency: Eliminate per-token API costs for high-volume applications. One Ollama deployment can serve thousands of requests without recurring charges.

⚑ Low Latency: Local inference means predictable, sub-second response times without the variability of internet round-trips.

Kubernetes amplifies these benefits with auto-scaling for variable workloads and efficient GPU/CPU allocation across your cluster.


1. “How do I deploy Ollama on Kubernetes?”

The fastest path to production uses the community Helm chart:

# Add the Helm repository
helm repo add otwld https://helm.otwld.com/
helm repo update

# Deploy Ollama with a single command
helm install ollama otwld/ollama \
  --namespace ollama \
  --create-namespace

For more control, here’s a production-ready deployment manifest:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: ollama
  namespace: ollama-system
spec:
  replicas: 1
  selector:
    matchLabels:
      name: ollama
  template:
    metadata:
      labels:
        name: ollama
    spec:
      containers:
      - name: ollama
        image: ollama/ollama:latest
        ports:
        - containerPort: 11434
        env:
        - name: OLLAMA_HOST
          value: "0.0.0.0"
        resources:
          requests:
            memory: "4Gi"
            cpu: "2000m"
          limits:
            memory: "16Gi"
            cpu: "8000m"
        volumeMounts:
        - name: ollama-data
          mountPath: /root/.ollama
      volumes:
      - name: ollama-data
        persistentVolumeClaim:
          claimName: ollama-pvc

Pro tip: Always use a PersistentVolumeClaim. Downloaded models persist across pod restarts, saving precious time and bandwidth.


2. “Ollama vs vLLM: Which should I choose for Kubernetes?”

This is perhaps the hottest debate in the LLM serving community. Here’s the verdict based on extensive benchmarking:

AspectOllamavLLM
Best ForLocal dev, prototyping, single-user appsProduction, high-concurrency workloads
Peak Throughput~41 TPS~793 TPS
P99 Latency673ms at peak80ms at peak
Setup Complexity5 minutes30+ minutes
Model SupportCurated library, easy downloadsHugging Face ecosystem

The Bottom Line:

  • Use Ollama when you’re getting started, building prototypes, or running internal AI assistants with moderate traffic
  • Use vLLM when you’re serving production workloads with hundreds of concurrent users

Many teams use both: Ollama for development and vLLM for production. The OpenAI-compatible APIs make switching straightforward.


3. “How do I enable GPU support for Ollama in Kubernetes?”

GPU acceleration transforms Ollama performance. Here’s the configuration:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: ollama-gpu
  namespace: ollama-system
spec:
  replicas: 1
  selector:
    matchLabels:
      app: ollama-gpu
  template:
    spec:
      containers:
      - name: ollama
        image: ollama/ollama:latest
        env:
        - name: NVIDIA_DRIVER_CAPABILITIES
          value: compute,utility
        - name: CUDA_VISIBLE_DEVICES
          value: "0"
        resources:
          requests:
            nvidia.com/gpu: 1
          limits:
            nvidia.com/gpu: 1
      nodeSelector:
        accelerator: nvidia-tesla-t4
      tolerations:
      - key: nvidia.com/gpu
        operator: Exists
        effect: NoSchedule

Prerequisites:

  1. Install the NVIDIA GPU Operator in your cluster
  2. Ensure GPU nodes have proper drivers installed
  3. Use node selectors to target GPU-enabled nodes

4. “How do I auto-scale Ollama with HPA?”

Horizontal Pod Autoscaler (HPA) configuration for Ollama requires careful tuning because LLM workloads have unique characteristics:

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: ollama-hpa
  namespace: ollama-system
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: ollama
  minReplicas: 2
  maxReplicas: 10
  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 70
  - type: Resource
    resource:
      name: memory
      target:
        type: Utilization
        averageUtilization: 80
  behavior:
    scaleDown:
      stabilizationWindowSeconds: 300
      policies:
      - type: Percent
        value: 50
        periodSeconds: 60
    scaleUp:
      stabilizationWindowSeconds: 60
      policies:
      - type: Percent
        value: 100
        periodSeconds: 30

Why the 300-second stabilization window? Model loading takes 5-30 seconds depending on size. Aggressive scale-down causes constant model reloading, killing performance.


5. “Can I scale Ollama to zero to save costs?”

Yes! KEDA (Kubernetes Event-Driven Autoscaling) enables scale-to-zero:

apiVersion: http.keda.sh/v1alpha1
kind: HTTPScaledObject
metadata:
  name: ollama
  namespace: ollama
spec:
  hosts:
    - ollama.your-domain.com
  scaleTargetRef:
    name: ollama
    kind: Deployment
    apiVersion: apps/v1
  replicas:
    min: 0
    max: 5
  scaledownPeriod: 3600  # Scale down after 1 hour of inactivity
  scalingMetric:
    requestRate:
      targetValue: 20

This configuration is ideal for GPU workloadsβ€”scaling to zero when idle can save thousands of dollars monthly on expensive GPU nodes.


6. “How do I deploy Open WebUI with Ollama on Kubernetes?”

Open WebUI provides a ChatGPT-like interface for Ollama. Here’s the complete stack:

# Ollama Deployment
apiVersion: apps/v1
kind: Deployment
metadata:
  name: ollama
  namespace: ollama
spec:
  replicas: 1
  selector:
    matchLabels:
      app: ollama
  template:
    metadata:
      labels:
        app: ollama
    spec:
      containers:
      - name: ollama
        image: ollama/ollama:latest
        ports:
        - containerPort: 11434
        volumeMounts:
        - name: ollama-data
          mountPath: /root/.ollama
      volumes:
      - name: ollama-data
        persistentVolumeClaim:
          claimName: ollama-pvc
---
# Open WebUI Deployment
apiVersion: apps/v1
kind: Deployment
metadata:
  name: open-webui
  namespace: ollama
spec:
  replicas: 1
  selector:
    matchLabels:
      app: open-webui
  template:
    metadata:
      labels:
        app: open-webui
    spec:
      containers:
      - name: open-webui
        image: ghcr.io/open-webui/open-webui:main
        ports:
        - containerPort: 8080
        env:
        - name: OLLAMA_BASE_URL
          value: "http://ollama-service:11434"
        volumeMounts:
        - name: webui-data
          mountPath: /app/backend/data
      volumes:
      - name: webui-data
        persistentVolumeClaim:
          claimName: webui-pvc

Key configuration: Set OLLAMA_BASE_URL to point to your Ollama Kubernetes Service.


7. “How do I expose Ollama externally with Ingress?”

apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: ollama-ingress
  namespace: ollama-system
  annotations:
    nginx.ingress.kubernetes.io/proxy-body-size: "0"
    nginx.ingress.kubernetes.io/proxy-read-timeout: "600"
    nginx.ingress.kubernetes.io/proxy-send-timeout: "600"
spec:
  ingressClassName: nginx
  rules:
  - host: ollama.yourdomain.com
    http:
      paths:
      - path: /
        pathType: Prefix
        backend:
          service:
            name: ollama-service
            port:
              number: 11434

Critical: Increase proxy timeouts! LLM inference can take 30+ seconds for long responses.


8. “How do I monitor Ollama in production?”

Deploy Prometheus ServiceMonitor for comprehensive observability:

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: ollama-monitor
  namespace: ollama-system
spec:
  selector:
    matchLabels:
      app: ollama
  endpoints:
  - port: metrics
    interval: 30s
    path: /metrics

Key metrics to track:

  • ollama_request_duration_seconds – Inference latency
  • ollama_active_requests – Current load
  • ollama_model_load_duration_seconds – Model startup time
  • ollama_gpu_utilization_percent – GPU efficiency

Production Architecture: The Complete Picture

Here’s a battle-tested architecture for enterprise Ollama deployments:

                    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                    β”‚              Ingress                    β”‚
                    β”‚    (NGINX/Traefik + TLS termination)   β”‚
                    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                                      β”‚
                    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                    β”‚           OAuth2 Proxy                  β”‚
                    β”‚         (Authentication)                β”‚
                    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                                      β”‚
                    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                    β”‚         LoadBalancer Service            β”‚
                    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                                      β”‚
        β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
        β”‚                             β”‚                             β”‚
β”Œβ”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”           β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”          β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”
β”‚  Ollama Pod   β”‚           β”‚   Ollama Pod    β”‚          β”‚   Ollama Pod    β”‚
β”‚   (GPU 0)     β”‚           β”‚    (GPU 1)      β”‚          β”‚    (GPU 2)      β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”˜           β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”˜          β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”˜
        β”‚                            β”‚                             β”‚
        β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                                     β”‚
                    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                    β”‚        Shared NFS Storage               β”‚
                    β”‚    (Models persist across pods)         β”‚
                    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Common Pitfalls and Solutions

❌ Pitfall: Models Re-download on Every Pod Restart

Solution: Use ReadWriteMany PVC with NFS or shared storage. All pods share the same model cache.

❌ Pitfall: HPA Thrashing (Constant Scale Up/Down)

Solution: Set stabilizationWindowSeconds: 300 for scale-down. Model loading is expensive.

❌ Pitfall: OOM Kills During Large Model Loads

Solution: Set memory limits 2x the model size. Llama 3 70B needs ~140GB RAM.

❌ Pitfall: Slow First Response After Scale-Up

Solution: Use init containers to pre-pull models, or configure ollama.models.pull in Helm values.


The Future: Ollama in the Multi-Agent Era

As AI evolves toward multi-agent architectures, Ollama on Kubernetes becomes even more powerful. Imagine:

  • Agent pools: Different Ollama instances running specialized models (code, analysis, creative)
  • Model routing: Intelligent traffic direction based on query type
  • Federated inference: Workloads distributed across edge and cloud clusters

The containerized, Kubernetes-native approach positions teams perfectly for this agentic future.


Quick Start Commands

# Create namespace
kubectl create namespace ollama-system

# Deploy with Helm
helm install ollama otwld/ollama \
  --namespace ollama-system \
  --set ollama.gpu.enabled=true \
  --set ollama.models.pull={llama3.1,codellama}

# Verify deployment
kubectl get pods -n ollama-system

# Port-forward for local testing
kubectl port-forward svc/ollama 11434:11434 -n ollama-system

# Test the API
curl http://localhost:11434/api/generate \
  -d '{"model": "llama3.1", "prompt": "Hello, Kubernetes!"}'

Conclusion

Running Ollama on Kubernetes isn’t just about infrastructureβ€”it’s about taking control of your AI destiny. You gain:

βœ… Data sovereignty: Your data never leaves your infrastructure
βœ… Cost predictability: No surprise API bills
βœ… Performance control: Tune latency and throughput to your needs
βœ… Scale flexibility: From development laptop to enterprise GPU cluster

Whether you’re building internal AI assistants, powering customer-facing applications, or experimenting with the latest open-source models, the Ollama + Kubernetes combination provides a robust foundation for the AI-driven future.

Leave a Reply

Your email address will not be published. Required fields are marked *