Last month, I was at a local Meetup preparing for my “Agents are the New Microservices” talk, and someone asked me a brutally honest question: “Have you actually deployed these AI agents on Kubernetes, or is this all just theory?”
Fair question. And honestly? Six months ago, it would’ve been mostly theory. But after spending way too many late nights debugging why my AI agents kept eating through my API credits and crashing my Redis instances, I’ve got some stories to tell.
The Problem Nobody Talks About
Here’s the thing about AI agents that the vendor slides don’t mention: they’re CHATTY. Like, really chatty. A single user request can trigger 5-10 LLM calls, each one hitting your Claude API, and suddenly you’re burning through tokens faster than I go through filter coffee at Rameshwaram Cafe.
And unlike traditional microservices where you can predict load patterns, AI agents are unpredictable. One agent might use three tools and finish in 2 seconds. Another might go down a rabbit hole, call 15 different APIs, and take 45 seconds. Good luck autoscaling that.
So when I started building our multi-agent system for Docker, I knew Kubernetes was the answer. I just didn’t know how painful the journey would be.
Building the Agent (Take Three)
My first two attempts at containerizing an AI agent were disasters. The first one worked on my MacBook but crashed in production because I forgot about Redis connection pooling. The second one was over-engineered – I had so many abstractions that debugging felt like archaeology.
The third time, I kept it simple. Here’s what actually works:
# agent/main.py
from fastapi import FastAPI
from anthropic import Anthropic
import redis.asyncio as redis
import json
import os
app = FastAPI()
anthropic = Anthropic(api_key=os.getenv("ANTHROPIC_API_KEY"))
redis_client = None
@app.on_event("startup")
async def startup():
global redis_client
# This took me 3 hours to debug because I kept using redis:// instead of the actual service name
redis_client = await redis.from_url(
os.getenv("REDIS_URL", "redis://redis-service:6379"),
decode_responses=True
)
print("Connected to Redis - finally!")
@app.get("/health")
async def health_check():
# Pro tip: Actually CHECK if Redis is alive
try:
await redis_client.ping()
return {"status": "healthy"}
except Exception as e:
# This saved me during a 2am production incident
return {"status": "unhealthy", "error": str(e)}, 503
@app.post("/agent/execute")
async def execute_agent(task: str, agent_id: str = "default"):
# Get conversation history - agents need memory!
history_key = f"agent:{agent_id}:history"
history = await redis_client.lrange(history_key, 0, -1)
messages = [json.loads(msg) for msg in history] if history else []
messages.append({"role": "user", "content": task})
# The magic happens here
response = anthropic.messages.create(
model="claude-sonnet-4-20250514",
max_tokens=4096,
tools=[
{
"name": "search_github",
"description": "Search GitHub repos and issues",
"input_schema": {
"type": "object",
"properties": {
"query": {"type": "string"}
}
}
}
],
messages=messages
)
# Handle tool use (this is where it gets interesting)
if response.stop_reason == "tool_use":
for block in response.content:
if block.type == "tool_use":
# Call your actual tool here
# I'm skipping the implementation because that's another blog post
tool_result = {"results": "some data"}
# Feed it back to Claude
messages.append({"role": "assistant", "content": response.content})
messages.append({
"role": "user",
"content": [{
"type": "tool_result",
"tool_use_id": block.id,
"content": json.dumps(tool_result)
}]
})
# Get the final answer
final_response = anthropic.messages.create(
model="claude-sonnet-4-20250514",
max_tokens=4096,
messages=messages
)
result = final_response.content[0].text
else:
result = response.content[0].text
# Save to Redis (keep last 10 messages)
for msg in messages[-10:]:
await redis_client.rpush(history_key, json.dumps(msg))
await redis_client.ltrim(history_key, -10, -1)
await redis_client.expire(history_key, 3600)
return {"result": result}
Look, I know this isn’t perfect. There’s no error handling for when Claude times out. The tool execution is stubbed out. But you know what? It WORKS. And after two failed attempts, “works” felt pretty damn good.
The Dockerfile That Finally Worked
FROM python:3.11-slim
WORKDIR /app
# I learned the hard way: always clean up apt cache
RUN apt-get update && apt-get install -y curl && \
rm -rf /var/lib/apt/lists/*
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
COPY agent/ /app/agent/
# Health checks are NOT optional - trust me on this
HEALTHCHECK --interval=30s --timeout=10s --start-period=40s \
CMD curl -f http://localhost:8000/health || exit 1
EXPOSE 8000
CMD ["uvicorn", "agent.main:app", "--host", "0.0.0.0", "--port", "8000"]
The health check? That’s there because I once deployed 20 broken pods to production and Kubernetes happily sent traffic to all of them. Learn from my mistakes.
Kubernetes Configuration: The Parts That Matter
Forget the 200-line YAML files you see in tutorials. Here’s what you actually need:
Redis first (because agents without memory are just expensive API calls):
apiVersion: apps/v1
kind: Deployment
metadata:
name: redis
namespace: agentic-ai
spec:
replicas: 1
selector:
matchLabels:
app: redis
template:
metadata:
labels:
app: redis
spec:
containers:
- name: redis
image: redis:7-alpine
ports:
- containerPort: 6379
resources:
requests:
memory: "256Mi"
cpu: "250m"
volumeMounts:
- name: redis-data
mountPath: /data
volumes:
- name: redis-data
persistentVolumeClaim:
claimName: redis-pvc
---
apiVersion: v1
kind: Service
metadata:
name: redis-service
namespace: agentic-ai
spec:
selector:
app: redis
ports:
- port: 6379
The agent deployment:
apiVersion: apps/v1
kind: Deployment
metadata:
name: ai-agent
namespace: agentic-ai
spec:
replicas: 3 # Start small, scale later
selector:
matchLabels:
app: ai-agent
template:
metadata:
labels:
app: ai-agent
spec:
containers:
- name: agent
image: your-registry/ai-agent:v1.0.0
ports:
- containerPort: 8000
env:
- name: ANTHROPIC_API_KEY
valueFrom:
secretKeyRef:
name: ai-secrets
key: anthropic-api-key
- name: REDIS_URL
value: "redis://redis-service:6379"
resources:
requests:
memory: "512Mi"
cpu: "500m"
limits:
memory: "2Gi"
cpu: "2000m"
livenessProbe:
httpGet:
path: /health
port: 8000
initialDelaySeconds: 30
periodSeconds: 10
---
apiVersion: v1
kind: Service
metadata:
name: ai-agent-service
namespace: agentic-ai
spec:
selector:
app: ai-agent
ports:
- port: 80
targetPort: 8000
The Autoscaling Drama
Oh boy. HPA (Horizontal Pod Autoscaler) and AI agents are a special kind of hell. Traditional CPU/memory metrics don’t work because:
- An idle agent uses almost no CPU
- An agent making an API call uses… still almost no CPU
- The bottleneck is Claude’s API, not your container
After burning through my monthly API quota in a weekend, I learned to scale based on request queue depth instead:
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: ai-agent-hpa
namespace: agentic-ai
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: ai-agent
minReplicas: 3
maxReplicas: 20
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 70
# Scale slowly down, quickly up
behavior:
scaleDown:
stabilizationWindowSeconds: 300 # Wait 5 minutes
policies:
- type: Percent
value: 50
periodSeconds: 60
scaleUp:
stabilizationWindowSeconds: 0 # No waiting
policies:
- type: Percent
value: 100 # Double immediately if needed
periodSeconds: 30
The scaleDown stabilization window saved me from the death spiral where pods scale up, then immediately scale down, then back up again.
Multi-Agent Orchestration: Where It Gets Wild
Running one agent is manageable. Running a TEAM of agents that need to coordinate? That’s where things get interesting.
I built an orchestrator that can run agents in three patterns:
Sequential (agent A → agent B → agent C):
python
async def run_sequential_workflow(task: str, agents: list):
context = {}
for agent_name in agents:
result = await call_agent(agent_name, task, context)
context[agent_name] = result # Pass to next agent
return context
Parallel (all agents at once):
async def run_parallel_workflow(task: str, agents: list):
tasks = [call_agent(agent, task, {}) for agent in agents]
results = await asyncio.gather(*tasks)
return dict(zip(agents, results))
Hierarchical (manager delegates to workers):
async def run_hierarchical_workflow(task: str):
# Manager agent decides what to do
plan = await call_agent("manager", f"Break down: {task}")
# Workers execute subtasks
subtasks = parse_plan(plan) # Your parsing logic here
results = await asyncio.gather(*[
call_agent(st["agent"], st["task"], {})
for st in subtasks
])
return results
What I Wish Someone Had Told Me
1. Redis will be your best friend and worst enemy. I spent two days debugging why agents kept getting confused. Turns out I was storing conversation history wrong. Use keys like agent:{id}:history not {id}:agent:history. Future you will thank present you.
2. Claude API rate limits are real. With 10 agents running in parallel, you WILL hit rate limits. Implement exponential backoff or prepare for cryptic 429 errors at 3am.
3. Logs are gold. Structure your logs. I use:
logger.info("agent_call",
agent_id=agent_id,
tokens=response.usage.input_tokens,
duration=time.time() - start_time)
Now I can actually debug production issues instead of guessing.
4. Start with StatefulSets if you need sticky sessions. Some agents benefit from affinity to specific pods. I learned this after users complained their conversations kept “resetting.”
5. Cost monitoring is NOT optional. I built a simple Prometheus metric:
from prometheus_client import Counter
TOKEN_USAGE = Counter('claude_tokens_total',
'Total tokens used',
['model', 'agent_type'])
This saved my budget. Literally.
What’s Actually Working in Production
After three months of running this in production for Docker’s AI initiatives:
- 3 replicas minimum, 20 maximum – gives us headroom without breaking the bank
- Redis with persistent volumes – agents losing their memory is a terrible user experience
- 30-second health check intervals – catches issues before users do
- Separate namespaces for dev/staging/prod – because I deployed to production by mistake. Once.
- Service mesh (Istio) – for gradual rollouts. Deploy v2, send 10% of traffic, pray, send 100%
The Real Talk
Is this perfect? Hell no. I still get paged when Redis runs out of memory. The cost per request is higher than I’d like. And don’t get me started on the time I accidentally created an infinite agent loop that cost me $47 in 15 minutes.
But you know what? It’s running. Real users are using it. AI agents are handling customer queries, writing code, and doing actual useful work. On Kubernetes. At scale.
And honestly? That’s pretty cool.
If you’re thinking about deploying AI agents on Kubernetes, my advice: start small, monitor everything, and for the love of all that’s holy, SET UP COST ALERTS.