Agentic AI AI Kubeflow Kubernetes Kubernetes Smart AI

Running Ollama LLM on Kubernetes: A Complete Guide

This comprehensive tutorial walks you through deploying and running Ollama, an open-source Large Language Model (LLM) runtime, on a local Kubernetes cluster. While we’ll focus on local deployment using Minikube, the same principles apply to production clusters on EKS, AKS, GKE, or on-premises infrastructure.

Why Run Ollama on Kubernetes?

Privacy and Control

Unlike SaaS-based AI tools like ChatGPT, Google Gemini, or Microsoft Copilot, Ollama runs entirely on your infrastructure. This means:

  • Complete data privacy – Your prompts and data never leave your network
  • Model flexibility – Choose from dozens of open-source models or train your own
  • Cost control – No per-token pricing or API rate limits
  • Compliance – Meet strict data residency and security requirements

Why Kubernetes?

Kubernetes offers unique advantages for LLM workloads:

  • Resource orchestration – Efficiently manage CPU, memory, and GPU allocation
  • Scalability – Easily scale to multiple replicas as demand grows
  • High availability – Automatic pod restarts and health monitoring
  • Portability – Deploy anywhere from local dev to production cloud clusters

Prerequisites

Before starting, ensure you have the following installed:

  1. kubectl – Kubernetes command-line tool (installation guide)
  2. Minikube – Local Kubernetes cluster (installation guide)
  3. Code editor – VS Code, vim, or your preferred editor
  4. System resources – At least 8GB RAM and 4 CPU cores available

Optional but recommended:

  • Docker Scout or Trivy for container image scanning
  • k9s for easier Kubernetes resource management

Understanding Ollama Architecture

Before we dive in, it’s important to understand how Ollama works:

  1. Ollama Server – The runtime that manages model loading and serving
  2. Model Files – The actual LLM weights and configuration (downloaded separately)
  3. API Endpoint – REST API on port 11434 for programmatic access
  4. CLI Interface – Interactive terminal for testing and debugging

When you deploy Ollama to Kubernetes, you’re deploying the server. Models must be pulled and loaded separately.

Step 1: Start Your Kubernetes Cluster

LLM workloads are resource-intensive. We’ll create a 3-node Minikube cluster with sufficient resources:

minikube start --nodes 3 --cpus 4 --memory 8192

What this does:

  • Creates 3 worker nodes for workload distribution
  • Allocates 4 CPUs per node
  • Provides 8GB RAM per node

Verify the cluster:

kubectl get nodes
```

Expected output:
```
NAME           STATUS   ROLES           AGE   VERSION
minikube       Ready    control-plane   1m    v1.28.3
minikube-m02   Ready    <none>          1m    v1.28.3
minikube-m03   Ready    <none>          1m    v1.28.3

Step 2: Create the Ollama Namespace

Namespaces provide logical isolation for your workloads:

kubectl create namespace ollama

Verify:

kubectl get namespaces

Step 3: Deploy Ollama with Persistent Storage

We’ll create a complete deployment with persistent storage to ensure models aren’t lost when pods restart.

Create a file named ollama-deployment.yaml:

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: ollama-storage
  namespace: ollama
spec:
  accessModes:
    - ReadWriteOnce
  resources:
    requests:
      storage: 10Gi
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: ollama
  namespace: ollama
spec:
  replicas: 1
  selector:
    matchLabels:
      app: ollama
  template:
    metadata:
      labels:
        app: ollama
    spec:
      containers:
      - name: ollama
        image: ollama/ollama:latest
        ports:
        - name: http
          containerPort: 11434
          protocol: TCP
        volumeMounts:
        - name: ollama-storage
          mountPath: /root/.ollama
        resources:
          requests:
            memory: "4Gi"
            cpu: "2"
          limits:
            memory: "8Gi"
            cpu: "4"
        livenessProbe:
          httpGet:
            path: /
            port: 11434
          initialDelaySeconds: 30
          periodSeconds: 10
        readinessProbe:
          httpGet:
            path: /
            port: 11434
          initialDelaySeconds: 15
          periodSeconds: 5
      volumes:
      - name: ollama-storage
        persistentVolumeClaim:
          claimName: ollama-storage
---
apiVersion: v1
kind: Service
metadata:
  name: ollama
  namespace: ollama
spec:
  type: ClusterIP
  selector:
    app: ollama
  ports:
  - protocol: TCP
    port: 11434
    targetPort: 11434

Key components explained:

  • PersistentVolumeClaim – 10GB storage for model files that persists across pod restarts
  • Resource limits – Prevents Ollama from consuming all cluster resources
  • Health probes – Kubernetes automatically restarts unhealthy pods
  • Service – Provides a stable network endpoint for accessing Ollama

Deploy:

kubectl apply -f ollama-deployment.yaml

Verify deployment:

kubectl get pods -n ollama -w

Wait until the pod shows STATUS: Running and READY: 1/1.

Step 4: Access the Ollama Pod

Get the exact pod name:

kubectl get pods -n ollama
```

Output example:
```
NAME                      READY   STATUS    RESTARTS   AGE
ollama-7d9f8c5b6d-k8xjm   1/1     Running   0          2m

Access the pod:

kubectl -n ollama exec -it ollama-7d9f8c5b6d-k8xjm -- /bin/bash

Replace ollama-7d9f8c5b6d-k8xjm with your actual pod name.

Step 5: Pull and Run an LLM Model

Now you’re inside the Ollama container. Let’s verify the installation and pull a model.

Check Ollama version:

ollama --version

Pull a model:

ollama pull llama3.2

This downloads the Llama 3.2 model files (~2GB) to /root/.ollama (which is persisted via our PVC).

Understanding pull vs run

  • ollama pull llama3.2 – Downloads model files to disk (like downloading a game)
  • ollama run llama3.2 – Loads model into memory AND starts interactive chat (like launching the game)

Start interactive mode:

ollama run llama3.2
```

You'll see a prompt like:
```
>>> 
```

**Test with a question:**
```
>>> What is Kubernetes?

The model will respond with an explanation. Type /bye to exit.

Step 6: Access Ollama via API (Production Method)

In production, you won’t exec into pods. Instead, use the API endpoint.

From another terminal, forward the service port:

kubectl port-forward -n ollama svc/ollama 11434:11434

Test with curl:

curl http://localhost:11434/api/generate -d '{
  "model": "llama3.2",
  "prompt": "Why is Kubernetes useful for ML workloads?",
  "stream": false
}'

Step 7: Expose Ollama (Optional)

For external access, change the Service type to LoadBalancer:

kubectl patch svc ollama -n ollama -p '{"spec":{"type":"LoadBalancer"}}'

Get external IP (Minikube):

minikube service ollama -n ollama --url

Production Considerations

Security Best Practices

  1. Image scanning:

docker scout quickview ollama/ollama:latest
  1. Use specific image tags (not latest):

image: ollama/ollama:0.1.44
  1. Apply Network Policies to restrict pod communication
  2. Use RBAC to limit pod permissions

Resource Management

For production workloads, adjust resources based on your model size:

Small models (1-3B parameters):

resources:
  requests:
    memory: "4Gi"
    cpu: "2"

Medium models (7-13B parameters):

resources:
  requests:
    memory: "16Gi"
    cpu: "8"

Large models (30B+ parameters):

resources:
  requests:
    memory: "32Gi"
    cpu: "16"
    nvidia.com/gpu: "1"  # If GPU available

High Availability

For production, increase replicas:

spec:
  replicas: 3

And add pod anti-affinity to distribute across nodes.

Monitoring

Add Prometheus annotations:

metadata:
  annotations:
    prometheus.io/scrape: "true"
    prometheus.io/port: "11434"
    prometheus.io/path: "/metrics"

Cleanup

When you’re done experimenting:

# Delete Ollama deployment
kubectl delete -f ollama-deployment.yaml

# Delete namespace
kubectl delete namespace ollama

# Stop Minikube
minikube stop

# Delete cluster (if needed)
minikube delete

Next Steps

Now that you have Ollama running on Kubernetes, consider:

  1. Building a RAG application using your own documents
  2. Fine-tuning models with domain-specific data
  3. Creating a chat UI using the Ollama API
  4. Deploying to cloud Kubernetes (EKS, AKS, GKE)
  5. Implementing autoscaling based on request volume
  6. Adding GPU support for faster inference

Conclusion

You’ve successfully deployed Ollama on Kubernetes, gaining complete control over your LLM infrastructure while maintaining privacy and flexibility. This foundation can scale from local experimentation to production workloads serving thousands of requests.

The combination of Kubernetes orchestration and Ollama’s efficient runtime creates a powerful platform for running AI workloads anywhere.

One thought on “Running Ollama LLM on Kubernetes: A Complete Guide

Leave a Reply

Your email address will not be published. Required fields are marked *