GPU Kubernetes Kubernetes Smart AI LLM MCP n8n ollama Orchestration Platform Engineering RAG

Building Production AI Workflows: Kubernetes + n8n + Ollama

The Challenge: Bridging AI Capabilities and Production Reality

Organizations adopting large language models face a critical gap: running an LLM locally is straightforward, but orchestrating AI workflows at scale requires production infrastructure. This guide shows how Kubernetes, n8n, and Ollama converge to solve this challenge, creating an enterprise-ready platform that scales from prototype to production.

What you’ll build: A production-ready AI automation platform that processes documents, enriches data, and triggers intelligent workflows – all self-hosted, scalable, and cost-effective.

Architecture Overview

Let’s visualize how these components work together:

graph TB
    subgraph "Kubernetes Cluster"
        subgraph "Workflow Layer"
            N8N[n8n Workflow Engine<br/>- Orchestration<br/>- Triggers & Webhooks]
        end
        
        subgraph "AI Inference Layer"
            OLLAMA[Ollama Service<br/>- LLM Runtime<br/>- Model Management]
            GPU[GPU Node Pool<br/>- Dedicated Resources]
        end
        
        subgraph "Storage Layer"
            PVC[Persistent Volumes<br/>- Model Storage<br/>- Workflow State]
        end
        
        subgraph "Ingress & Networking"
            INGRESS[Ingress Controller<br/>- API Gateway]
        end
    end
    
    EXT[External Triggers<br/>Webhooks, Cron, API] --> INGRESS
    INGRESS --> N8N
    N8N --> OLLAMA
    OLLAMA --> GPU
    OLLAMA --> PVC
    N8N --> PVC
Kubernetes Cluster
Workflow Layer
AI Inference Layer
Storage Layer
Ingress & Networking
Ingress Controller
– API Gateway
Persistent Volumes
– Model Storage
– Workflow State
Ollama Service
– LLM Runtime
– Model Management
GPU Node Pool
– Dedicated Resources
n8n Workflow Engine
– Orchestration
– Triggers & Webhooks
External Triggers
Webhooks, Cron, API

Key Design Principles:

  • Separation of concerns: Workflow orchestration separated from AI inference
  • Horizontal scalability: Each component scales independently
  • Resource efficiency: GPU resources allocated only to inference workloads
  • State persistence: Workflows and models survive pod restarts

Component Breakdown

Ollama: Your Self-Hosted LLM Runtime

Ollama provides the inference engine, making models like Llama, Mistral, or Phi accessible via a simple API. In production, we deploy it as a StatefulSet for stable network identity and persistent storage.

apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: ollama
  namespace: ai-platform
spec:
  serviceName: ollama
  replicas: 2
  selector:
    matchLabels:
      app: ollama
  template:
    metadata:
      labels:
        app: ollama
    spec:
      nodeSelector:
        workload-type: gpu
      containers:
      - name: ollama
        image: ollama/ollama:latest
        ports:
        - containerPort: 11434
          name: api
        resources:
          requests:
            memory: "8Gi"
            nvidia.com/gpu: 1
          limits:
            memory: "16Gi"
            nvidia.com/gpu: 1
        volumeMounts:
        - name: models
          mountPath: /root/.ollama
        env:
        - name: OLLAMA_NUM_PARALLEL
          value: "4"
        - name: OLLAMA_MAX_LOADED_MODELS
          value: "2"
  volumeClaimTemplates:
  - metadata:
      name: models
    spec:
      accessModes: ["ReadWriteOnce"]
      storageClassName: fast-ssd
      resources:
        requests:
          storage: 100Gi

Production considerations:

  • OLLAMA_NUM_PARALLEL: Controls concurrent requests per instance
  • OLLAMA_MAX_LOADED_MODELS: Limits memory consumption
  • Node selector ensures GPU-enabled nodes
  • Persistent volumes prevent model re-downloads on restarts

n8n: Workflow Orchestration at Scale

n8n acts as your AI automation hub, connecting triggers to AI processing and actions. Deploy it with queue mode for distributed processing.

apiVersion: apps/v1
kind: Deployment
metadata:
  name: n8n-worker
  namespace: ai-platform
spec:
  replicas: 3
  selector:
    matchLabels:
      app: n8n
      component: worker
  template:
    metadata:
      labels:
        app: n8n
        component: worker
    spec:
      containers:
      - name: n8n
        image: n8nio/n8n:latest
        env:
        - name: EXECUTIONS_MODE
          value: "queue"
        - name: QUEUE_BULL_REDIS_HOST
          value: "redis-service"
        - name: N8N_ENCRYPTION_KEY
          valueFrom:
            secretKeyRef:
              name: n8n-secrets
              key: encryption-key
        - name: DB_TYPE
          value: "postgresdb"
        - name: DB_POSTGRESDB_HOST
          value: "postgres-service"
        resources:
          requests:
            cpu: "500m"
            memory: "1Gi"
          limits:
            cpu: "2"
            memory: "4Gi"
---
apiVersion: v1
kind: Service
metadata:
  name: n8n-service
  namespace: ai-platform
spec:
  type: ClusterIP
  ports:
  - port: 5678
    targetPort: 5678
    protocol: TCP
  selector:
    app: n8n
    component: worker

Architecture notes:

  • Queue mode enables horizontal scaling of workflow execution
  • Redis handles job distribution across workers
  • PostgreSQL ensures workflow state persistence
  • Secrets management for sensitive credentials

Production Workflow Pattern

Here’s how a production AI workflow executes across the platform:

sequenceDiagram
    participant Client
    participant Ingress
    participant n8n
    participant Redis
    participant Worker
    participant Ollama
    participant Storage

    Client->>Ingress: POST /webhook/process-document
    Ingress->>n8n: Route request
    n8n->>Redis: Queue workflow job
    Redis->>Worker: Assign to available worker
    
    Worker->>Storage: Fetch document
    Worker->>Ollama: POST /api/generate<br/>{prompt: analyze_document}
    Ollama->>Ollama: Load model (if not cached)
    Ollama->>Worker: Return analysis
    
    Worker->>Storage: Save enriched data
    Worker->>Redis: Mark job complete
    Redis->>n8n: Update workflow status
    n8n->>Client: Webhook response
ClientIngressn8nRedisWorkerOllamaStoragePOST /webhook/process-documentRoute requestQueue workflow jobAssign to available workerFetch documentPOST /api/generate{prompt: analyze_document}Load model (if not cached)Return analysisSave enriched dataMark job completeUpdate workflow statusWebhook responseClientIngressn8nRedisWorkerOllamaStorage

Real-World Workflow Example

Let’s build an intelligent document processing pipeline that beginners can deploy and pros can extend:

# n8n workflow configuration
{
  "nodes": [
    {
      "name": "Webhook Trigger",
      "type": "n8n-nodes-base.webhook",
      "parameters": {
        "path": "process-document",
        "httpMethod": "POST"
      }
    },
    {
      "name": "Extract Text",
      "type": "n8n-nodes-base.httpRequest",
      "parameters": {
        "url": "http://tika-service:9998/tika",
        "method": "PUT"
      }
    },
    {
      "name": "AI Analysis",
      "type": "n8n-nodes-base.httpRequest",
      "parameters": {
        "url": "http://ollama-service:11434/api/generate",
        "method": "POST",
        "bodyParameters": {
          "model": "llama3.2",
          "prompt": "Analyze this document and extract: key topics, sentiment, action items, and summary. Document: {{$json.text}}",
          "stream": false
        }
      }
    },
    {
      "name": "Store Results",
      "type": "n8n-nodes-base.postgres",
      "parameters": {
        "operation": "insert",
        "table": "document_analysis"
      }
    }
  ]
}

Scaling Strategy

As your workload grows, scale each component independently:

graph LR
    subgraph "Development"
        D1[1 n8n pod<br/>1 Ollama pod<br/>Single node]
    end
    
    subgraph "Production"
        P1[3 n8n workers<br/>2 Ollama instances<br/>GPU node pool]
    end
    
    subgraph "Enterprise"
        E1[10+ n8n workers<br/>5+ Ollama instances<br/>Multi-region<br/>Auto-scaling]
    end
    
    D1 -->|Add resources| P1
    P1 -->|Add automation| E1
Enterprise
Production
Development
Add resources
Add automation
10+ n8n workers
5+ Ollama instances
Multi-region
Auto-scaling
3 n8n workers
2 Ollama instances
GPU node pool
1 n8n pod
1 Ollama pod
Single node

Scaling commands:

# Scale n8n workers for more throughput
kubectl scale deployment n8n-worker --replicas=5 -n ai-platform

# Scale Ollama for more concurrent AI requests
kubectl scale statefulset ollama --replicas=3 -n ai-platform

# Enable horizontal pod autoscaling
kubectl autoscale deployment n8n-worker \
  --cpu-percent=70 \
  --min=3 \
  --max=10 \
  -n ai-platform

Resource Management

Production LLM deployments require careful resource allocation:

# Resource quota for AI namespace
apiVersion: v1
kind: ResourceQuota
metadata:
  name: ai-platform-quota
  namespace: ai-platform
spec:
  hard:
    requests.cpu: "32"
    requests.memory: 128Gi
    requests.nvidia.com/gpu: "4"
    persistentvolumeclaims: "10"
---
# Limit ranges for safety
apiVersion: v1
kind: LimitRange
metadata:
  name: ai-platform-limits
  namespace: ai-platform
spec:
  limits:
  - max:
      memory: 32Gi
      cpu: "8"
    min:
      memory: 512Mi
      cpu: 100m
    type: Container

Monitoring and Observability

Track AI workflow health with these metrics:

apiVersion: v1
kind: ConfigMap
metadata:
  name: prometheus-config
  namespace: monitoring
data:
  prometheus.yml: |
    scrape_configs:
    - job_name: 'ollama'
      static_configs:
      - targets: ['ollama-service.ai-platform:11434']
      metrics_path: '/metrics'
    
    - job_name: 'n8n'
      static_configs:
      - targets: ['n8n-service.ai-platform:5678']
      metrics_path: '/metrics'

Key metrics to monitor:

  • Model load time and memory usage (Ollama)
  • Request latency and queue depth (n8n)
  • GPU utilization and throttling
  • Storage I/O and capacity

Security Hardening

# Network policy: Restrict Ollama access
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: ollama-network-policy
  namespace: ai-platform
spec:
  podSelector:
    matchLabels:
      app: ollama
  policyTypes:
  - Ingress
  ingress:
  - from:
    - podSelector:
        matchLabels:
          app: n8n
    ports:
    - protocol: TCP
      port: 11434

Security checklist:

  • Network policies isolate AI services
  • Secrets management for API keys and credentials
  • RBAC controls for namespace access
  • Pod security standards enforce container best practices

Cost Optimization

Running this stack efficiently:

ComponentDevelopmentProductionEnterprise
n8n1 pod (0.5 CPU, 1GB)3 pods (1.5 CPU, 3GB)10 pods + auto-scale
Ollama1 pod (1 GPU, 8GB)2 pods (2 GPU, 16GB)5 pods (5 GPU, 40GB)
Storage50GB SSD200GB SSD1TB+ NVMe
Monthly cost~$100~$500~$2000

Compare this to cloud AI APIs: $0.002-0.06 per 1K tokens means heavy usage quickly exceeds self-hosted costs.

Getting Started

Deploy the complete stack:

# Create namespace
kubectl create namespace ai-platform

# Deploy Ollama
kubectl apply -f ollama-statefulset.yaml

# Wait for Ollama to be ready
kubectl wait --for=condition=ready pod -l app=ollama -n ai-platform --timeout=300s

# Load your first model
kubectl exec -it ollama-0 -n ai-platform -- ollama pull llama3.2

# Deploy n8n with dependencies
kubectl apply -f n8n-deployment.yaml

# Expose n8n UI
kubectl port-forward svc/n8n-service 5678:5678 -n ai-platform

Access n8n at http://localhost:5678 and start building workflows.

What’s Next?

This architecture provides the foundation for:

  • Multi-model serving: Run multiple LLMs for different tasks
  • Advanced RAG pipelines: Integrate vector databases for context-aware AI
  • Compliance and governance: Audit trails for AI decision-making
  • Edge deployment: Extend to edge Kubernetes clusters for low-latency AI

Conclusion

This production blueprint transforms Kubernetes from container orchestration into an AI automation platform. You’ve learned to deploy self-hosted LLMs at scale, orchestrate complex AI workflows, and operate this infrastructure efficiently.

Build your AI monster and share your experience

Leave a Reply

Your email address will not be published. Required fields are marked *