Agentic AI Agents Cloud-Native AI Kubernetes

Kagent: Build Production-Ready AI Agents in Kubernetes – Complete Guide with Code Examples

Learn how to deploy AI agents natively in Kubernetes with Kagent. Step-by-step guide with code examples for DevOps automation, MCP integration, and multi-agent orchestration.

Kagent is an open-source Kubernetes-native framework for building, deploying, and managing AI agents directly in your cluster. Created by Solo.io and now a CNCF Sandbox Project, Kagent transforms how DevOps teams leverage AI for infrastructure automation.

Key Features:

  • βœ… Kubernetes-Native CRDs – Agents, ModelConfigs, and ToolServers as K8s resources
  • πŸ”Œ MCP Protocol Support – Connect to Model Context Protocol servers
  • πŸ“Š OpenTelemetry Tracing – Full observability for agent operations
  • 🎯 Multi-Provider – OpenAI, Anthropic, Azure, Vertex AI, Ollama
  • πŸ› οΈ Pre-Built Tools – Kubernetes, Helm, Istio, Prometheus, Grafana, Argo

Architecture Overview


Quick Start Installation

Prerequisites Setup

# Install required tools
# kind (Kubernetes in Docker)
curl -Lo ./kind https://kind.sigs.k8s.io/dl/v0.20.0/kind-linux-amd64
chmod +x ./kind
sudo mv ./kind /usr/local/bin/kind

# Create local Kubernetes cluster
kind create cluster --name kagent-demo

# Verify cluster
kubectl cluster-info --context kind-kagent-demo

Install Kagent

# Set your OpenAI API key
export OPENAI_API_KEY="sk-your-api-key-here"

# Download and install Kagent CLI
curl https://raw.githubusercontent.com/kagent-dev/kagent/refs/heads/main/scripts/get-kagent | bash

# Install Kagent to cluster
kagent install

Launch Dashboard

# Start Kagent UI (automatically port-forwards)
kagent ui

# Access at http://localhost:8080

Core Concepts with Code Examples

1. Creating Your First Agent

# helm-agent.yaml
apiVersion: kagent.io/v1alpha1
kind: Agent
metadata:
  name: helm-agent
  namespace: kagent
spec:
  systemPrompt: |
    You are a Helm expert assistant. Help users manage Helm releases,
    charts, and deployments. Provide clear explanations and best practices.
  
  modelConfig:
    name: openai-gpt4
  
  tools:
  - toolServer:
      name: helm-mcp
      namespace: kagent
  - toolServer:
      name: kubernetes-mcp
      namespace: kagent
  
  maxIterations: 10
  temperature: 0.7

Apply the agent:

kubectl apply -f helm-agent.yaml

# Verify agent is running
kubectl get agents -n kagent

2. Configure LLM Provider

# openai-model.yaml
apiVersion: kagent.io/v1alpha1
kind: ModelConfig
metadata:
  name: openai-gpt4
  namespace: kagent
spec:
  provider: openai
  model: gpt-4
  apiKeySecretRef:
    name: openai-secret
    key: api-key
  parameters:
    temperature: 0.7
    maxTokens: 4096
    topP: 1.0

Create the API key secret:

kubectl create secret generic openai-secret \
  --from-literal=api-key=$OPENAI_API_KEY \
  -n kagent

3. Setup MCP Tool Server

# kubernetes-mcp-toolserver.yaml
apiVersion: kagent.io/v1alpha1
kind: ToolServer
metadata:
  name: kubernetes-mcp
  namespace: kagent
spec:
  type: mcp
  image: ghcr.io/kagent-dev/mcp-kubernetes:latest
  
  # MCP server configuration
  command:
  - npx
  - -y
  - "@kagent/mcp-kubernetes"
  
  # Tools provided by this MCP server
  tools:
  - name: get_pods
    description: List pods in a namespace
  - name: get_deployments
    description: List deployments
  - name: get_services
    description: List services
  - name: describe_pod
    description: Get detailed pod information
  - name: get_logs
    description: Retrieve pod logs
  
  # Service account for K8s access
  serviceAccountName: kagent-kubernetes-sa

Advanced Use Cases with Examples

DevOps Automation Agent

# devops-agent.yaml
apiVersion: kagent.io/v1alpha1
kind: Agent
metadata:
  name: devops-automation-agent
  namespace: kagent
spec:
  systemPrompt: |
    You are a DevOps automation expert. You help with:
    - Diagnosing deployment issues
    - Analyzing application logs
    - Troubleshooting network connectivity
    - Optimizing resource usage
    - Implementing best practices
    
    Always provide step-by-step reasoning and actionable solutions.
  
  modelConfig:
    name: anthropic-claude
  
  tools:
  - toolServer:
      name: kubernetes-mcp
  - toolServer:
      name: helm-mcp
  - toolServer:
      name: prometheus-mcp
  - toolServer:
      name: grafana-mcp
  
  maxIterations: 15
  temperature: 0.5

Multi-Agent Observability Setup

# observability-agents.yaml
apiVersion: kagent.io/v1alpha1
kind: Agent
metadata:
  name: metrics-analyzer
  namespace: kagent
spec:
  systemPrompt: |
    Analyze Prometheus metrics and identify performance issues.
  modelConfig:
    name: openai-gpt4
  tools:
  - toolServer:
      name: prometheus-mcp
---
apiVersion: kagent.io/v1alpha1
kind: Agent
metadata:
  name: log-analyzer
  namespace: kagent
spec:
  systemPrompt: |
    Analyze application logs and identify error patterns.
  modelConfig:
    name: anthropic-claude
  tools:
  - toolServer:
      name: kubernetes-mcp
  - toolServer:
      name: grafana-mcp
---
apiVersion: kagent.io/v1alpha1
kind: Agent
metadata:
  name: incident-orchestrator
  namespace: kagent
spec:
  systemPrompt: |
    Coordinate between metrics-analyzer and log-analyzer to
    diagnose and resolve incidents. Provide comprehensive reports.
  modelConfig:
    name: openai-gpt4
  agents:
  - name: metrics-analyzer
  - name: log-analyzer
  tools:
  - toolServer:
      name: kubernetes-mcp

Interact with Agents via CLI

REPL Mode

# Start interactive REPL
kagent repl

# List available agents
kagent> agents

# Output:
+---+-------------------------+----+----------------------------+
| # | NAME                    | ID | CREATED                    |
+---+-------------------------+----+----------------------------+
| 0 | helm-agent              | 2  | 2025-03-13T19:08:14.527935 |
| 1 | devops-automation-agent | 3  | 2025-03-13T19:08:14.348957 |
| 2 | metrics-analyzer        | 4  | 2025-03-13T19:08:13.794848 |
+---+-------------------------+----+----------------------------+

# Start conversation with agent
kagent> chat helm-agent

# Ask questions
You: List all helm releases in the cluster
Agent: I'll check the Helm releases across all namespaces...

Event Type: ToolCall(s)
Source: helm_agent
+---+--------------------+-----------------------------------------+
| # | NAME               | ARGUMENTS                               |
+---+--------------------+-----------------------------------------+
| 0 | helm_list_releases | {"all_namespaces":true,"deployed":true} |
+---+--------------------+-----------------------------------------+

Agent: I found 3 Helm releases:
- **kagent** (kagent namespace) - v0.7.4 - Deployed
- **prometheus** (monitoring namespace) - v2.45.0 - Deployed  
- **grafana** (monitoring namespace) - v9.5.3 - Deployed

Programmatic API Access

# Create conversation via API
curl -X POST http://localhost:8080/api/v1/conversations \
  -H "Content-Type: application/json" \
  -d '{
    "agentId": "helm-agent",
    "message": "Upgrade the kagent release to latest version"
  }'

# Get conversation history
curl http://localhost:8080/api/v1/conversations/conv-123

# Send follow-up message
curl -X POST http://localhost:8080/api/v1/conversations/conv-123/messages \
  -H "Content-Type: application/json" \
  -d '{
    "message": "What are the changes in the new version?"
  }'

Building Custom MCP Tools

Create Custom Tool Server

// custom-tools-mcp.ts
import { Server } from "@modelcontextprotocol/sdk/server/index.js";
import { StdioServerTransport } from "@modelcontextprotocol/sdk/server/stdio.js";

const server = new Server(
  {
    name: "custom-devops-tools",
    version: "1.0.0",
  },
  {
    capabilities: {
      tools: {},
    },
  }
);

// Register custom tool
server.setRequestHandler("tools/list", async () => ({
  tools: [
    {
      name: "check_deployment_health",
      description: "Check health of a Kubernetes deployment",
      inputSchema: {
        type: "object",
        properties: {
          deployment: { type: "string" },
          namespace: { type: "string" },
        },
        required: ["deployment", "namespace"],
      },
    },
  ],
}));

// Handle tool execution
server.setRequestHandler("tools/call", async (request) => {
  if (request.params.name === "check_deployment_health") {
    const { deployment, namespace } = request.params.arguments;
    
    // Execute kubectl command
    const result = await checkDeploymentHealth(deployment, namespace);
    
    return {
      content: [
        {
          type: "text",
          text: JSON.stringify(result, null, 2),
        },
      ],
    };
  }
});

async function checkDeploymentHealth(deployment: string, namespace: string) {
  // Implementation details
  return {
    healthy: true,
    replicas: { desired: 3, ready: 3 },
    conditions: [...],
  };
}

// Start server
const transport = new StdioServerTransport();
server.connect(transport);

Deploy Custom MCP Server

# custom-mcp-deployment.yaml
apiVersion: kagent.io/v1alpha1
kind: ToolServer
metadata:
  name: custom-devops-mcp
  namespace: kagent
spec:
  type: mcp
  image: myregistry/custom-devops-mcp:v1.0.0
  command:
  - node
  - dist/custom-tools-mcp.js
  
  serviceAccountName: custom-tools-sa
  
  env:
  - name: LOG_LEVEL
    value: debug
  
  resources:
    requests:
      memory: "128Mi"
      cpu: "100m"
    limits:
      memory: "256Mi"
      cpu: "200m"

Observability and Monitoring

Enable OpenTelemetry Tracing

# kagent-otel-config.yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: kagent-otel-config
  namespace: kagent
data:
  config.yaml: |
    receivers:
      otlp:
        protocols:
          grpc:
            endpoint: 0.0.0.0:4317
    
    processors:
      batch:
        timeout: 10s
    
    exporters:
      jaeger:
        endpoint: jaeger-collector:14250
        tls:
          insecure: true
      
      prometheus:
        endpoint: "0.0.0.0:8889"
    
    service:
      pipelines:
        traces:
          receivers: [otlp]
          processors: [batch]
          exporters: [jaeger]
        
        metrics:
          receivers: [otlp]
          processors: [batch]
          exporters: [prometheus]

Query Agent Traces

# View traces in Jaeger
kubectl port-forward svc/jaeger-query 16686:16686 -n observability

# Access Jaeger UI at http://localhost:16686
# Filter by service: kagent-controller
# Search for agent conversations and tool executions

Prometheus Metrics

# prometheus-scrape-config.yaml
scrape_configs:
- job_name: 'kagent'
  kubernetes_sd_configs:
  - role: pod
    namespaces:
      names:
      - kagent
  
  relabel_configs:
  - source_labels: [__meta_kubernetes_pod_label_app]
    action: keep
    regex: kagent
  
  - source_labels: [__meta_kubernetes_pod_name]
    target_label: pod
  
  metric_relabel_configs:
  - source_labels: [__name__]
    regex: 'kagent_(agent|tool|conversation).*'
    action: keep

Production Deployment Best Practices

High Availability Configuration

# kagent-ha-values.yaml
controller:
  replicas: 3
  
  resources:
    requests:
      memory: 512Mi
      cpu: 500m
    limits:
      memory: 1Gi
      cpu: 1000m
  
  podDisruptionBudget:
    enabled: true
    minAvailable: 2
  
  affinity:
    podAntiAffinity:
      preferredDuringSchedulingIgnoredDuringExecution:
      - weight: 100
        podAffinityTerm:
          labelSelector:
            matchLabels:
              app: kagent-controller
          topologyKey: kubernetes.io/hostname

ui:
  replicas: 2
  
  ingress:
    enabled: true
    className: nginx
    hosts:
    - host: kagent.example.com
      paths:
      - path: /
        pathType: Prefix
    
    tls:
    - secretName: kagent-tls
      hosts:
      - kagent.example.com

database:
  type: postgres
  host: postgres.kagent.svc.cluster.local
  name: kagent
  existingSecret: kagent-db-secret

Install with HA:

helm repo add kagent https://kagent-dev.github.io/kagent
helm repo update

helm install kagent kagent/kagent \
  --namespace kagent \
  --create-namespace \
  -f kagent-ha-values.yaml

Security Hardening

# kagent-rbac.yaml
apiVersion: v1
kind: ServiceAccount
metadata:
  name: kagent-restricted
  namespace: kagent
---
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  name: kagent-restricted-role
  namespace: kagent
rules:
- apiGroups: [""]
  resources: ["pods", "services"]
  verbs: ["get", "list", "watch"]
- apiGroups: ["apps"]
  resources: ["deployments", "replicasets"]
  verbs: ["get", "list", "watch"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
  name: kagent-restricted-binding
  namespace: kagent
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: Role
  name: kagent-restricted-role
subjects:
- kind: ServiceAccount
  name: kagent-restricted
  namespace: kagent

Network Policies

# kagent-netpol.yaml
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: kagent-controller-policy
  namespace: kagent
spec:
  podSelector:
    matchLabels:
      app: kagent-controller
  
  policyTypes:
  - Ingress
  - Egress
  
  ingress:
  - from:
    - podSelector:
        matchLabels:
          app: kagent-ui
    ports:
    - protocol: TCP
      port: 8080
  
  egress:
  - to:
    - podSelector:
        matchLabels:
          app: postgres
    ports:
    - protocol: TCP
      port: 5432
  
  - to:
    - namespaceSelector: {}
    ports:
    - protocol: TCP
      port: 443  # LLM API calls

Real-World Examples

Istio Service Mesh Troubleshooting

# Chat with istio-agent
kagent repl

You: I'm seeing 503 errors in my productpage service. Can you help diagnose?

Agent: I'll investigate the 503 errors in the productpage service. Let me check 
several things:

1. First, let me check the Envoy proxy status...

[ToolCall: istio_check_proxy_status]

2. Now let me examine the virtual service configurations...

[ToolCall: istio_list_virtualservices]

3. Let me check the destination rules...

[ToolCall: istio_get_destinationrule]

Agent: I found the issue! Your destination rule has a connection pool setting 
that's too restrictive:

```yaml
spec:
  host: productpage
  trafficPolicy:
    connectionPool:
      tcp:
        maxConnections: 10  # Too low!
      http:
        http1MaxPendingRequests: 5
        maxRequestsPerConnection: 1

The maxConnections is set to 10, but you’re receiving 50+ concurrent requests during peak hours. This causes connection queueing and 503 errors.

Recommendation: Increase to at least 100 connections:

spec:
  host: productpage
  trafficPolicy:
    connectionPool:
      tcp:
        maxConnections: 100
      http:
        http1MaxPendingRequests: 50

Would you like me to apply this change?


### Automated Incident Response

```yaml
# incident-response-workflow.yaml
apiVersion: kagent.io/v1alpha1
kind: Agent
metadata:
  name: incident-responder
  namespace: kagent
spec:
  systemPrompt: |
    You are an incident response agent. When alerted:
    1. Gather information from monitoring systems
    2. Analyze logs and metrics
    3. Identify root cause
    4. Execute remediation steps
    5. Create incident report
    6. Notify on-call team
  
  modelConfig:
    name: openai-gpt4-turbo
  
  tools:
  - toolServer:
      name: kubernetes-mcp
  - toolServer:
      name: prometheus-mcp
  - toolServer:
      name: grafana-mcp
  - toolServer:
      name: pagerduty-mcp
  
  # Multi-step reasoning
  maxIterations: 20
  temperature: 0.3  # Lower temperature for deterministic responses
  
  # Webhook for Prometheus Alertmanager
  webhooks:
  - name: alertmanager
    endpoint: /api/v1/incidents
    authentication:
      type: bearer
      secretRef:
        name: webhook-token

Performance and Scaling

Agent Resource Optimization

# efficient-agent.yaml
apiVersion: kagent.io/v1alpha1
kind: Agent
metadata:
  name: optimized-agent
  namespace: kagent
spec:
  systemPrompt: "You are a Kubernetes troubleshooting expert."
  
  modelConfig:
    name: anthropic-claude-haiku  # Faster, cheaper model
  
  # Limit iterations to control costs
  maxIterations: 8
  
  # Control token usage
  maxTokens: 2048
  
  # Cache system prompts (Anthropic feature)
  caching:
    enabled: true
    ttl: 3600
  
  # Retry configuration
  retry:
    maxAttempts: 3
    backoff: exponential
    initialDelay: 1s

Batch Processing

# batch-agent-requests.py
import asyncio
import aiohttp

async def process_agent_request(session, conversation_id, message):
    async with session.post(
        f'http://kagent-api:8080/api/v1/conversations/{conversation_id}/messages',
        json={'message': message}
    ) as resp:
        return await resp.json()

async def batch_process(requests):
    async with aiohttp.ClientSession() as session:
        tasks = [
            process_agent_request(session, req['conv_id'], req['message'])
            for req in requests
        ]
        return await asyncio.gather(*tasks)

# Process 100 requests concurrently
requests = [
    {'conv_id': f'conv-{i}', 'message': f'Query {i}'}
    for i in range(100)
]

results = asyncio.run(batch_process(requests))

Testing and Debugging

Unit Test Agents

# test_agent.py
import pytest
from kagent_sdk import Agent, ToolServer, ModelConfig

@pytest.fixture
def mock_kubernetes_mcp():
    return ToolServer(
        name="kubernetes-mcp",
        type="mcp",
        mock_responses={
            "get_pods": {"items": [{"name": "test-pod"}]},
            "get_logs": {"logs": "application started"}
        }
    )

@pytest.mark.asyncio
async def test_agent_troubleshooting(mock_kubernetes_mcp):
    agent = Agent(
        name="test-agent",
        system_prompt="You help troubleshoot pods",
        model_config=ModelConfig(provider="mock"),
        tools=[mock_kubernetes_mcp]
    )
    
    response = await agent.send_message(
        "Why is my test-pod crashing?"
    )
    
    assert "test-pod" in response.content
    assert mock_kubernetes_mcp.called_tools == ["get_pods", "get_logs"]

Debug Mode

# Enable debug logging
kubectl set env deployment/kagent-controller \
  LOG_LEVEL=debug \
  -n kagent

# View detailed logs
kubectl logs -f deployment/kagent-controller -n kagent

# Debug specific conversation
kagent debug conversation conv-123

# Output includes:
# - All LLM requests/responses
# - Tool execution details
# - Timing information
# - Token usage

Migration and Integration

Migrate from LangChain

# Before (LangChain)
from langchain.agents import AgentExecutor
from langchain.tools import Tool

def get_pods():
    # kubectl logic
    pass

tools = [Tool(name="get_pods", func=get_pods)]
agent = AgentExecutor.from_agent_and_tools(tools=tools)

# After (Kagent)
# kagent-migrated-agent.yaml
apiVersion: kagent.io/v1alpha1
kind: Agent
metadata:
  name: migrated-langchain-agent
spec:
  systemPrompt: "Your existing prompt"
  modelConfig:
    name: openai-gpt4
  tools:
  - toolServer:
      name: kubernetes-mcp  # Built-in equivalent

Integrate with CI/CD

# .github/workflows/deploy-agent.yaml
name: Deploy Kagent Agent

on:
  push:
    branches: [main]
    paths:
    - 'agents/**'

jobs:
  deploy:
    runs-on: ubuntu-latest
    steps:
    - uses: actions/checkout@v3
    
    - name: Setup kubectl
      uses: azure/setup-kubectl@v3
    
    - name: Deploy agents
      run: |
        kubectl apply -f agents/ -n kagent
    
    - name: Verify deployment
      run: |
        kubectl wait --for=condition=Ready agent --all -n kagent --timeout=120s
    
    - name: Run agent tests
      run: |
        ./scripts/test-agents.sh

Community and Resources

Documentation Links

Getting Help

# Get help via CLI
kagent help

# Check version
kagent version

# Validate agent configuration
kagent validate agent.yaml

# Generate agent template
kagent generate agent --name my-agent --tools kubernetes,helm

Conclusion

Kagent represents a paradigm shift in how we deploy AI agents for cloud-native operations. By leveraging Kubernetes-native patterns, MCP protocols, and production-grade observability, Kagent enables DevOps teams to:

βœ… Automate complex troubleshooting workflows
βœ… Scale AI agents with Kubernetes reliability
βœ… Integrate with existing cloud-native tooling
βœ… Maintain full visibility through OpenTelemetry
βœ… Extend easily with custom MCP servers

Next Steps:

  1. Clone the Kagent repository
  2. Follow the Quick Start Guide
  3. Join the Discord community
  4. Attend the KubeCon Atlanta Community Party

One thought on “Kagent: Build Production-Ready AI Agents in Kubernetes – Complete Guide with Code Examples

Leave a Reply

Your email address will not be published. Required fields are marked *