Agentic AI Agents AI

Kubernetes and Agentic AI: What I Learned Deploying AI Agents at Scale

Last month, I was at a local Meetup preparing for my “Agents are the New Microservices” talk, and someone asked me a brutally honest question: “Have you actually deployed these AI agents on Kubernetes, or is this all just theory?”

Fair question. And honestly? Six months ago, it would’ve been mostly theory. But after spending way too many late nights debugging why my AI agents kept eating through my API credits and crashing my Redis instances, I’ve got some stories to tell.

The Problem Nobody Talks About

Here’s the thing about AI agents that the vendor slides don’t mention: they’re CHATTY. Like, really chatty. A single user request can trigger 5-10 LLM calls, each one hitting your Claude API, and suddenly you’re burning through tokens faster than I go through filter coffee at Rameshwaram Cafe.

And unlike traditional microservices where you can predict load patterns, AI agents are unpredictable. One agent might use three tools and finish in 2 seconds. Another might go down a rabbit hole, call 15 different APIs, and take 45 seconds. Good luck autoscaling that.

So when I started building our multi-agent system for Docker, I knew Kubernetes was the answer. I just didn’t know how painful the journey would be.

Building the Agent (Take Three)

My first two attempts at containerizing an AI agent were disasters. The first one worked on my MacBook but crashed in production because I forgot about Redis connection pooling. The second one was over-engineered – I had so many abstractions that debugging felt like archaeology.

The third time, I kept it simple. Here’s what actually works:

# agent/main.py
from fastapi import FastAPI
from anthropic import Anthropic
import redis.asyncio as redis
import json
import os

app = FastAPI()
anthropic = Anthropic(api_key=os.getenv("ANTHROPIC_API_KEY"))
redis_client = None

@app.on_event("startup")
async def startup():
    global redis_client
    # This took me 3 hours to debug because I kept using redis:// instead of the actual service name
    redis_client = await redis.from_url(
        os.getenv("REDIS_URL", "redis://redis-service:6379"),
        decode_responses=True
    )
    print("Connected to Redis - finally!")

@app.get("/health")
async def health_check():
    # Pro tip: Actually CHECK if Redis is alive
    try:
        await redis_client.ping()
        return {"status": "healthy"}
    except Exception as e:
        # This saved me during a 2am production incident
        return {"status": "unhealthy", "error": str(e)}, 503

@app.post("/agent/execute")
async def execute_agent(task: str, agent_id: str = "default"):
    # Get conversation history - agents need memory!
    history_key = f"agent:{agent_id}:history"
    history = await redis_client.lrange(history_key, 0, -1)
    messages = [json.loads(msg) for msg in history] if history else []
    
    messages.append({"role": "user", "content": task})
    
    # The magic happens here
    response = anthropic.messages.create(
        model="claude-sonnet-4-20250514",
        max_tokens=4096,
        tools=[
            {
                "name": "search_github",
                "description": "Search GitHub repos and issues",
                "input_schema": {
                    "type": "object",
                    "properties": {
                        "query": {"type": "string"}
                    }
                }
            }
        ],
        messages=messages
    )
    
    # Handle tool use (this is where it gets interesting)
    if response.stop_reason == "tool_use":
        for block in response.content:
            if block.type == "tool_use":
                # Call your actual tool here
                # I'm skipping the implementation because that's another blog post
                tool_result = {"results": "some data"}
                
                # Feed it back to Claude
                messages.append({"role": "assistant", "content": response.content})
                messages.append({
                    "role": "user",
                    "content": [{
                        "type": "tool_result",
                        "tool_use_id": block.id,
                        "content": json.dumps(tool_result)
                    }]
                })
                
                # Get the final answer
                final_response = anthropic.messages.create(
                    model="claude-sonnet-4-20250514",
                    max_tokens=4096,
                    messages=messages
                )
                
                result = final_response.content[0].text
    else:
        result = response.content[0].text
    
    # Save to Redis (keep last 10 messages)
    for msg in messages[-10:]:
        await redis_client.rpush(history_key, json.dumps(msg))
    await redis_client.ltrim(history_key, -10, -1)
    await redis_client.expire(history_key, 3600)
    
    return {"result": result}

Look, I know this isn’t perfect. There’s no error handling for when Claude times out. The tool execution is stubbed out. But you know what? It WORKS. And after two failed attempts, “works” felt pretty damn good.

The Dockerfile That Finally Worked

FROM python:3.11-slim

WORKDIR /app

# I learned the hard way: always clean up apt cache
RUN apt-get update && apt-get install -y curl && \
    rm -rf /var/lib/apt/lists/*

COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt

COPY agent/ /app/agent/

# Health checks are NOT optional - trust me on this
HEALTHCHECK --interval=30s --timeout=10s --start-period=40s \
  CMD curl -f http://localhost:8000/health || exit 1

EXPOSE 8000

CMD ["uvicorn", "agent.main:app", "--host", "0.0.0.0", "--port", "8000"]

The health check? That’s there because I once deployed 20 broken pods to production and Kubernetes happily sent traffic to all of them. Learn from my mistakes.

Kubernetes Configuration: The Parts That Matter

Forget the 200-line YAML files you see in tutorials. Here’s what you actually need:

Redis first (because agents without memory are just expensive API calls):

apiVersion: apps/v1
kind: Deployment
metadata:
  name: redis
  namespace: agentic-ai
spec:
  replicas: 1
  selector:
    matchLabels:
      app: redis
  template:
    metadata:
      labels:
        app: redis
    spec:
      containers:
      - name: redis
        image: redis:7-alpine
        ports:
        - containerPort: 6379
        resources:
          requests:
            memory: "256Mi"
            cpu: "250m"
        volumeMounts:
        - name: redis-data
          mountPath: /data
      volumes:
      - name: redis-data
        persistentVolumeClaim:
          claimName: redis-pvc
---
apiVersion: v1
kind: Service
metadata:
  name: redis-service
  namespace: agentic-ai
spec:
  selector:
    app: redis
  ports:
  - port: 6379

The agent deployment:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: ai-agent
  namespace: agentic-ai
spec:
  replicas: 3  # Start small, scale later
  selector:
    matchLabels:
      app: ai-agent
  template:
    metadata:
      labels:
        app: ai-agent
    spec:
      containers:
      - name: agent
        image: your-registry/ai-agent:v1.0.0
        ports:
        - containerPort: 8000
        env:
        - name: ANTHROPIC_API_KEY
          valueFrom:
            secretKeyRef:
              name: ai-secrets
              key: anthropic-api-key
        - name: REDIS_URL
          value: "redis://redis-service:6379"
        resources:
          requests:
            memory: "512Mi"
            cpu: "500m"
          limits:
            memory: "2Gi"
            cpu: "2000m"
        livenessProbe:
          httpGet:
            path: /health
            port: 8000
          initialDelaySeconds: 30
          periodSeconds: 10
---
apiVersion: v1
kind: Service
metadata:
  name: ai-agent-service
  namespace: agentic-ai
spec:
  selector:
    app: ai-agent
  ports:
  - port: 80
    targetPort: 8000

The Autoscaling Drama

Oh boy. HPA (Horizontal Pod Autoscaler) and AI agents are a special kind of hell. Traditional CPU/memory metrics don’t work because:

  1. An idle agent uses almost no CPU
  2. An agent making an API call uses… still almost no CPU
  3. The bottleneck is Claude’s API, not your container

After burning through my monthly API quota in a weekend, I learned to scale based on request queue depth instead:

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: ai-agent-hpa
  namespace: agentic-ai
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: ai-agent
  minReplicas: 3
  maxReplicas: 20
  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 70
  # Scale slowly down, quickly up
  behavior:
    scaleDown:
      stabilizationWindowSeconds: 300  # Wait 5 minutes
      policies:
      - type: Percent
        value: 50
        periodSeconds: 60
    scaleUp:
      stabilizationWindowSeconds: 0  # No waiting
      policies:
      - type: Percent
        value: 100  # Double immediately if needed
        periodSeconds: 30

The scaleDown stabilization window saved me from the death spiral where pods scale up, then immediately scale down, then back up again.

Multi-Agent Orchestration: Where It Gets Wild

Running one agent is manageable. Running a TEAM of agents that need to coordinate? That’s where things get interesting.

I built an orchestrator that can run agents in three patterns:

Sequential (agent A → agent B → agent C):

python

async def run_sequential_workflow(task: str, agents: list):
    context = {}
    
    for agent_name in agents:
        result = await call_agent(agent_name, task, context)
        context[agent_name] = result  # Pass to next agent
    
    return context

Parallel (all agents at once):

async def run_parallel_workflow(task: str, agents: list):
    tasks = [call_agent(agent, task, {}) for agent in agents]
    results = await asyncio.gather(*tasks)
    return dict(zip(agents, results))

Hierarchical (manager delegates to workers):

async def run_hierarchical_workflow(task: str):
    # Manager agent decides what to do
    plan = await call_agent("manager", f"Break down: {task}")
    
    # Workers execute subtasks
    subtasks = parse_plan(plan)  # Your parsing logic here
    results = await asyncio.gather(*[
        call_agent(st["agent"], st["task"], {}) 
        for st in subtasks
    ])
    
    return results

What I Wish Someone Had Told Me

1. Redis will be your best friend and worst enemy. I spent two days debugging why agents kept getting confused. Turns out I was storing conversation history wrong. Use keys like agent:{id}:history not {id}:agent:history. Future you will thank present you.

2. Claude API rate limits are real. With 10 agents running in parallel, you WILL hit rate limits. Implement exponential backoff or prepare for cryptic 429 errors at 3am.

3. Logs are gold. Structure your logs. I use:

logger.info("agent_call", 
           agent_id=agent_id, 
           tokens=response.usage.input_tokens,
           duration=time.time() - start_time)

Now I can actually debug production issues instead of guessing.

4. Start with StatefulSets if you need sticky sessions. Some agents benefit from affinity to specific pods. I learned this after users complained their conversations kept “resetting.”

5. Cost monitoring is NOT optional. I built a simple Prometheus metric:

from prometheus_client import Counter

TOKEN_USAGE = Counter('claude_tokens_total', 
                     'Total tokens used', 
                     ['model', 'agent_type'])

This saved my budget. Literally.

What’s Actually Working in Production

After three months of running this in production for Docker’s AI initiatives:

  • 3 replicas minimum, 20 maximum – gives us headroom without breaking the bank
  • Redis with persistent volumes – agents losing their memory is a terrible user experience
  • 30-second health check intervals – catches issues before users do
  • Separate namespaces for dev/staging/prod – because I deployed to production by mistake. Once.
  • Service mesh (Istio) – for gradual rollouts. Deploy v2, send 10% of traffic, pray, send 100%

The Real Talk

Is this perfect? Hell no. I still get paged when Redis runs out of memory. The cost per request is higher than I’d like. And don’t get me started on the time I accidentally created an infinite agent loop that cost me $47 in 15 minutes.

But you know what? It’s running. Real users are using it. AI agents are handling customer queries, writing code, and doing actual useful work. On Kubernetes. At scale.

And honestly? That’s pretty cool.

If you’re thinking about deploying AI agents on Kubernetes, my advice: start small, monitor everything, and for the love of all that’s holy, SET UP COST ALERTS.

Further Reading

Leave a Reply

Your email address will not be published. Required fields are marked *