The Challenge: Bridging AI Capabilities and Production Reality
Organizations adopting large language models face a critical gap: running an LLM locally is straightforward, but orchestrating AI workflows at scale requires production infrastructure. This guide shows how Kubernetes, n8n, and Ollama converge to solve this challenge, creating an enterprise-ready platform that scales from prototype to production.
What you’ll build: A production-ready AI automation platform that processes documents, enriches data, and triggers intelligent workflows – all self-hosted, scalable, and cost-effective.
Architecture Overview
Let’s visualize how these components work together:
graph TB
subgraph "Kubernetes Cluster"
subgraph "Workflow Layer"
N8N[n8n Workflow Engine<br/>- Orchestration<br/>- Triggers & Webhooks]
end
subgraph "AI Inference Layer"
OLLAMA[Ollama Service<br/>- LLM Runtime<br/>- Model Management]
GPU[GPU Node Pool<br/>- Dedicated Resources]
end
subgraph "Storage Layer"
PVC[Persistent Volumes<br/>- Model Storage<br/>- Workflow State]
end
subgraph "Ingress & Networking"
INGRESS[Ingress Controller<br/>- API Gateway]
end
end
EXT[External Triggers<br/>Webhooks, Cron, API] --> INGRESS
INGRESS --> N8N
N8N --> OLLAMA
OLLAMA --> GPU
OLLAMA --> PVC
N8N --> PVC
Key Design Principles:
- Separation of concerns: Workflow orchestration separated from AI inference
- Horizontal scalability: Each component scales independently
- Resource efficiency: GPU resources allocated only to inference workloads
- State persistence: Workflows and models survive pod restarts
Component Breakdown
Ollama: Your Self-Hosted LLM Runtime
Ollama provides the inference engine, making models like Llama, Mistral, or Phi accessible via a simple API. In production, we deploy it as a StatefulSet for stable network identity and persistent storage.
apiVersion: apps/v1
kind: StatefulSet
metadata:
name: ollama
namespace: ai-platform
spec:
serviceName: ollama
replicas: 2
selector:
matchLabels:
app: ollama
template:
metadata:
labels:
app: ollama
spec:
nodeSelector:
workload-type: gpu
containers:
- name: ollama
image: ollama/ollama:latest
ports:
- containerPort: 11434
name: api
resources:
requests:
memory: "8Gi"
nvidia.com/gpu: 1
limits:
memory: "16Gi"
nvidia.com/gpu: 1
volumeMounts:
- name: models
mountPath: /root/.ollama
env:
- name: OLLAMA_NUM_PARALLEL
value: "4"
- name: OLLAMA_MAX_LOADED_MODELS
value: "2"
volumeClaimTemplates:
- metadata:
name: models
spec:
accessModes: ["ReadWriteOnce"]
storageClassName: fast-ssd
resources:
requests:
storage: 100Gi
Production considerations:
OLLAMA_NUM_PARALLEL: Controls concurrent requests per instanceOLLAMA_MAX_LOADED_MODELS: Limits memory consumption- Node selector ensures GPU-enabled nodes
- Persistent volumes prevent model re-downloads on restarts
n8n: Workflow Orchestration at Scale
n8n acts as your AI automation hub, connecting triggers to AI processing and actions. Deploy it with queue mode for distributed processing.
apiVersion: apps/v1
kind: Deployment
metadata:
name: n8n-worker
namespace: ai-platform
spec:
replicas: 3
selector:
matchLabels:
app: n8n
component: worker
template:
metadata:
labels:
app: n8n
component: worker
spec:
containers:
- name: n8n
image: n8nio/n8n:latest
env:
- name: EXECUTIONS_MODE
value: "queue"
- name: QUEUE_BULL_REDIS_HOST
value: "redis-service"
- name: N8N_ENCRYPTION_KEY
valueFrom:
secretKeyRef:
name: n8n-secrets
key: encryption-key
- name: DB_TYPE
value: "postgresdb"
- name: DB_POSTGRESDB_HOST
value: "postgres-service"
resources:
requests:
cpu: "500m"
memory: "1Gi"
limits:
cpu: "2"
memory: "4Gi"
---
apiVersion: v1
kind: Service
metadata:
name: n8n-service
namespace: ai-platform
spec:
type: ClusterIP
ports:
- port: 5678
targetPort: 5678
protocol: TCP
selector:
app: n8n
component: worker
Architecture notes:
- Queue mode enables horizontal scaling of workflow execution
- Redis handles job distribution across workers
- PostgreSQL ensures workflow state persistence
- Secrets management for sensitive credentials
Production Workflow Pattern
Here’s how a production AI workflow executes across the platform:
sequenceDiagram
participant Client
participant Ingress
participant n8n
participant Redis
participant Worker
participant Ollama
participant Storage
Client->>Ingress: POST /webhook/process-document
Ingress->>n8n: Route request
n8n->>Redis: Queue workflow job
Redis->>Worker: Assign to available worker
Worker->>Storage: Fetch document
Worker->>Ollama: POST /api/generate<br/>{prompt: analyze_document}
Ollama->>Ollama: Load model (if not cached)
Ollama->>Worker: Return analysis
Worker->>Storage: Save enriched data
Worker->>Redis: Mark job complete
Redis->>n8n: Update workflow status
n8n->>Client: Webhook responseReal-World Workflow Example
Let’s build an intelligent document processing pipeline that beginners can deploy and pros can extend:
# n8n workflow configuration
{
"nodes": [
{
"name": "Webhook Trigger",
"type": "n8n-nodes-base.webhook",
"parameters": {
"path": "process-document",
"httpMethod": "POST"
}
},
{
"name": "Extract Text",
"type": "n8n-nodes-base.httpRequest",
"parameters": {
"url": "http://tika-service:9998/tika",
"method": "PUT"
}
},
{
"name": "AI Analysis",
"type": "n8n-nodes-base.httpRequest",
"parameters": {
"url": "http://ollama-service:11434/api/generate",
"method": "POST",
"bodyParameters": {
"model": "llama3.2",
"prompt": "Analyze this document and extract: key topics, sentiment, action items, and summary. Document: {{$json.text}}",
"stream": false
}
}
},
{
"name": "Store Results",
"type": "n8n-nodes-base.postgres",
"parameters": {
"operation": "insert",
"table": "document_analysis"
}
}
]
}
Scaling Strategy
As your workload grows, scale each component independently:
graph LR
subgraph "Development"
D1[1 n8n pod<br/>1 Ollama pod<br/>Single node]
end
subgraph "Production"
P1[3 n8n workers<br/>2 Ollama instances<br/>GPU node pool]
end
subgraph "Enterprise"
E1[10+ n8n workers<br/>5+ Ollama instances<br/>Multi-region<br/>Auto-scaling]
end
D1 -->|Add resources| P1
P1 -->|Add automation| E1Scaling commands:
# Scale n8n workers for more throughput
kubectl scale deployment n8n-worker --replicas=5 -n ai-platform
# Scale Ollama for more concurrent AI requests
kubectl scale statefulset ollama --replicas=3 -n ai-platform
# Enable horizontal pod autoscaling
kubectl autoscale deployment n8n-worker \
--cpu-percent=70 \
--min=3 \
--max=10 \
-n ai-platform
Resource Management
Production LLM deployments require careful resource allocation:
# Resource quota for AI namespace
apiVersion: v1
kind: ResourceQuota
metadata:
name: ai-platform-quota
namespace: ai-platform
spec:
hard:
requests.cpu: "32"
requests.memory: 128Gi
requests.nvidia.com/gpu: "4"
persistentvolumeclaims: "10"
---
# Limit ranges for safety
apiVersion: v1
kind: LimitRange
metadata:
name: ai-platform-limits
namespace: ai-platform
spec:
limits:
- max:
memory: 32Gi
cpu: "8"
min:
memory: 512Mi
cpu: 100m
type: Container
Monitoring and Observability
Track AI workflow health with these metrics:
apiVersion: v1
kind: ConfigMap
metadata:
name: prometheus-config
namespace: monitoring
data:
prometheus.yml: |
scrape_configs:
- job_name: 'ollama'
static_configs:
- targets: ['ollama-service.ai-platform:11434']
metrics_path: '/metrics'
- job_name: 'n8n'
static_configs:
- targets: ['n8n-service.ai-platform:5678']
metrics_path: '/metrics'
Key metrics to monitor:
- Model load time and memory usage (Ollama)
- Request latency and queue depth (n8n)
- GPU utilization and throttling
- Storage I/O and capacity
Security Hardening
# Network policy: Restrict Ollama access
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: ollama-network-policy
namespace: ai-platform
spec:
podSelector:
matchLabels:
app: ollama
policyTypes:
- Ingress
ingress:
- from:
- podSelector:
matchLabels:
app: n8n
ports:
- protocol: TCP
port: 11434
Security checklist:
- Network policies isolate AI services
- Secrets management for API keys and credentials
- RBAC controls for namespace access
- Pod security standards enforce container best practices
Cost Optimization
Running this stack efficiently:
| Component | Development | Production | Enterprise |
|---|---|---|---|
| n8n | 1 pod (0.5 CPU, 1GB) | 3 pods (1.5 CPU, 3GB) | 10 pods + auto-scale |
| Ollama | 1 pod (1 GPU, 8GB) | 2 pods (2 GPU, 16GB) | 5 pods (5 GPU, 40GB) |
| Storage | 50GB SSD | 200GB SSD | 1TB+ NVMe |
| Monthly cost | ~$100 | ~$500 | ~$2000 |
Compare this to cloud AI APIs: $0.002-0.06 per 1K tokens means heavy usage quickly exceeds self-hosted costs.
Getting Started
Deploy the complete stack:
# Create namespace
kubectl create namespace ai-platform
# Deploy Ollama
kubectl apply -f ollama-statefulset.yaml
# Wait for Ollama to be ready
kubectl wait --for=condition=ready pod -l app=ollama -n ai-platform --timeout=300s
# Load your first model
kubectl exec -it ollama-0 -n ai-platform -- ollama pull llama3.2
# Deploy n8n with dependencies
kubectl apply -f n8n-deployment.yaml
# Expose n8n UI
kubectl port-forward svc/n8n-service 5678:5678 -n ai-platform
Access n8n at http://localhost:5678 and start building workflows.
What’s Next?
This architecture provides the foundation for:
- Multi-model serving: Run multiple LLMs for different tasks
- Advanced RAG pipelines: Integrate vector databases for context-aware AI
- Compliance and governance: Audit trails for AI decision-making
- Edge deployment: Extend to edge Kubernetes clusters for low-latency AI
Conclusion
This production blueprint transforms Kubernetes from container orchestration into an AI automation platform. You’ve learned to deploy self-hosted LLMs at scale, orchestrate complex AI workflows, and operate this infrastructure efficiently.
Build your AI monster and share your experience