Kagent is an open-source Kubernetes-native framework for building, deploying, and managing AI agents directly in your cluster. Created by Solo.io and now a CNCF Sandbox Project, Kagent transforms how DevOps teams leverage AI for infrastructure automation.
Key Features:
- β Kubernetes-Native CRDs – Agents, ModelConfigs, and ToolServers as K8s resources
- π MCP Protocol Support – Connect to Model Context Protocol servers
- π OpenTelemetry Tracing – Full observability for agent operations
- π― Multi-Provider – OpenAI, Anthropic, Azure, Vertex AI, Ollama
- π οΈ Pre-Built Tools – Kubernetes, Helm, Istio, Prometheus, Grafana, Argo
Architecture Overview

Quick Start Installation
Prerequisites Setup
# Install required tools
# kind (Kubernetes in Docker)
curl -Lo ./kind https://kind.sigs.k8s.io/dl/v0.20.0/kind-linux-amd64
chmod +x ./kind
sudo mv ./kind /usr/local/bin/kind
# Create local Kubernetes cluster
kind create cluster --name kagent-demo
# Verify cluster
kubectl cluster-info --context kind-kagent-demo
Install Kagent
# Set your OpenAI API key
export OPENAI_API_KEY="sk-your-api-key-here"
# Download and install Kagent CLI
curl https://raw.githubusercontent.com/kagent-dev/kagent/refs/heads/main/scripts/get-kagent | bash
# Install Kagent to cluster
kagent install
Launch Dashboard
# Start Kagent UI (automatically port-forwards)
kagent ui
# Access at http://localhost:8080
Core Concepts with Code Examples
1. Creating Your First Agent
# helm-agent.yaml
apiVersion: kagent.io/v1alpha1
kind: Agent
metadata:
name: helm-agent
namespace: kagent
spec:
systemPrompt: |
You are a Helm expert assistant. Help users manage Helm releases,
charts, and deployments. Provide clear explanations and best practices.
modelConfig:
name: openai-gpt4
tools:
- toolServer:
name: helm-mcp
namespace: kagent
- toolServer:
name: kubernetes-mcp
namespace: kagent
maxIterations: 10
temperature: 0.7
Apply the agent:
kubectl apply -f helm-agent.yaml
# Verify agent is running
kubectl get agents -n kagent
2. Configure LLM Provider
# openai-model.yaml
apiVersion: kagent.io/v1alpha1
kind: ModelConfig
metadata:
name: openai-gpt4
namespace: kagent
spec:
provider: openai
model: gpt-4
apiKeySecretRef:
name: openai-secret
key: api-key
parameters:
temperature: 0.7
maxTokens: 4096
topP: 1.0
Create the API key secret:
kubectl create secret generic openai-secret \
--from-literal=api-key=$OPENAI_API_KEY \
-n kagent
3. Setup MCP Tool Server
# kubernetes-mcp-toolserver.yaml
apiVersion: kagent.io/v1alpha1
kind: ToolServer
metadata:
name: kubernetes-mcp
namespace: kagent
spec:
type: mcp
image: ghcr.io/kagent-dev/mcp-kubernetes:latest
# MCP server configuration
command:
- npx
- -y
- "@kagent/mcp-kubernetes"
# Tools provided by this MCP server
tools:
- name: get_pods
description: List pods in a namespace
- name: get_deployments
description: List deployments
- name: get_services
description: List services
- name: describe_pod
description: Get detailed pod information
- name: get_logs
description: Retrieve pod logs
# Service account for K8s access
serviceAccountName: kagent-kubernetes-sa
Advanced Use Cases with Examples
DevOps Automation Agent
# devops-agent.yaml
apiVersion: kagent.io/v1alpha1
kind: Agent
metadata:
name: devops-automation-agent
namespace: kagent
spec:
systemPrompt: |
You are a DevOps automation expert. You help with:
- Diagnosing deployment issues
- Analyzing application logs
- Troubleshooting network connectivity
- Optimizing resource usage
- Implementing best practices
Always provide step-by-step reasoning and actionable solutions.
modelConfig:
name: anthropic-claude
tools:
- toolServer:
name: kubernetes-mcp
- toolServer:
name: helm-mcp
- toolServer:
name: prometheus-mcp
- toolServer:
name: grafana-mcp
maxIterations: 15
temperature: 0.5
Multi-Agent Observability Setup
# observability-agents.yaml
apiVersion: kagent.io/v1alpha1
kind: Agent
metadata:
name: metrics-analyzer
namespace: kagent
spec:
systemPrompt: |
Analyze Prometheus metrics and identify performance issues.
modelConfig:
name: openai-gpt4
tools:
- toolServer:
name: prometheus-mcp
---
apiVersion: kagent.io/v1alpha1
kind: Agent
metadata:
name: log-analyzer
namespace: kagent
spec:
systemPrompt: |
Analyze application logs and identify error patterns.
modelConfig:
name: anthropic-claude
tools:
- toolServer:
name: kubernetes-mcp
- toolServer:
name: grafana-mcp
---
apiVersion: kagent.io/v1alpha1
kind: Agent
metadata:
name: incident-orchestrator
namespace: kagent
spec:
systemPrompt: |
Coordinate between metrics-analyzer and log-analyzer to
diagnose and resolve incidents. Provide comprehensive reports.
modelConfig:
name: openai-gpt4
agents:
- name: metrics-analyzer
- name: log-analyzer
tools:
- toolServer:
name: kubernetes-mcp
Interact with Agents via CLI
REPL Mode
# Start interactive REPL
kagent repl
# List available agents
kagent> agents
# Output:
+---+-------------------------+----+----------------------------+
| # | NAME | ID | CREATED |
+---+-------------------------+----+----------------------------+
| 0 | helm-agent | 2 | 2025-03-13T19:08:14.527935 |
| 1 | devops-automation-agent | 3 | 2025-03-13T19:08:14.348957 |
| 2 | metrics-analyzer | 4 | 2025-03-13T19:08:13.794848 |
+---+-------------------------+----+----------------------------+
# Start conversation with agent
kagent> chat helm-agent
# Ask questions
You: List all helm releases in the cluster
Agent: I'll check the Helm releases across all namespaces...
Event Type: ToolCall(s)
Source: helm_agent
+---+--------------------+-----------------------------------------+
| # | NAME | ARGUMENTS |
+---+--------------------+-----------------------------------------+
| 0 | helm_list_releases | {"all_namespaces":true,"deployed":true} |
+---+--------------------+-----------------------------------------+
Agent: I found 3 Helm releases:
- **kagent** (kagent namespace) - v0.7.4 - Deployed
- **prometheus** (monitoring namespace) - v2.45.0 - Deployed
- **grafana** (monitoring namespace) - v9.5.3 - Deployed
Programmatic API Access
# Create conversation via API
curl -X POST http://localhost:8080/api/v1/conversations \
-H "Content-Type: application/json" \
-d '{
"agentId": "helm-agent",
"message": "Upgrade the kagent release to latest version"
}'
# Get conversation history
curl http://localhost:8080/api/v1/conversations/conv-123
# Send follow-up message
curl -X POST http://localhost:8080/api/v1/conversations/conv-123/messages \
-H "Content-Type: application/json" \
-d '{
"message": "What are the changes in the new version?"
}'
Building Custom MCP Tools
Create Custom Tool Server
// custom-tools-mcp.ts
import { Server } from "@modelcontextprotocol/sdk/server/index.js";
import { StdioServerTransport } from "@modelcontextprotocol/sdk/server/stdio.js";
const server = new Server(
{
name: "custom-devops-tools",
version: "1.0.0",
},
{
capabilities: {
tools: {},
},
}
);
// Register custom tool
server.setRequestHandler("tools/list", async () => ({
tools: [
{
name: "check_deployment_health",
description: "Check health of a Kubernetes deployment",
inputSchema: {
type: "object",
properties: {
deployment: { type: "string" },
namespace: { type: "string" },
},
required: ["deployment", "namespace"],
},
},
],
}));
// Handle tool execution
server.setRequestHandler("tools/call", async (request) => {
if (request.params.name === "check_deployment_health") {
const { deployment, namespace } = request.params.arguments;
// Execute kubectl command
const result = await checkDeploymentHealth(deployment, namespace);
return {
content: [
{
type: "text",
text: JSON.stringify(result, null, 2),
},
],
};
}
});
async function checkDeploymentHealth(deployment: string, namespace: string) {
// Implementation details
return {
healthy: true,
replicas: { desired: 3, ready: 3 },
conditions: [...],
};
}
// Start server
const transport = new StdioServerTransport();
server.connect(transport);
Deploy Custom MCP Server
# custom-mcp-deployment.yaml
apiVersion: kagent.io/v1alpha1
kind: ToolServer
metadata:
name: custom-devops-mcp
namespace: kagent
spec:
type: mcp
image: myregistry/custom-devops-mcp:v1.0.0
command:
- node
- dist/custom-tools-mcp.js
serviceAccountName: custom-tools-sa
env:
- name: LOG_LEVEL
value: debug
resources:
requests:
memory: "128Mi"
cpu: "100m"
limits:
memory: "256Mi"
cpu: "200m"
Observability and Monitoring
Enable OpenTelemetry Tracing
# kagent-otel-config.yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: kagent-otel-config
namespace: kagent
data:
config.yaml: |
receivers:
otlp:
protocols:
grpc:
endpoint: 0.0.0.0:4317
processors:
batch:
timeout: 10s
exporters:
jaeger:
endpoint: jaeger-collector:14250
tls:
insecure: true
prometheus:
endpoint: "0.0.0.0:8889"
service:
pipelines:
traces:
receivers: [otlp]
processors: [batch]
exporters: [jaeger]
metrics:
receivers: [otlp]
processors: [batch]
exporters: [prometheus]
Query Agent Traces
# View traces in Jaeger
kubectl port-forward svc/jaeger-query 16686:16686 -n observability
# Access Jaeger UI at http://localhost:16686
# Filter by service: kagent-controller
# Search for agent conversations and tool executions
Prometheus Metrics
# prometheus-scrape-config.yaml
scrape_configs:
- job_name: 'kagent'
kubernetes_sd_configs:
- role: pod
namespaces:
names:
- kagent
relabel_configs:
- source_labels: [__meta_kubernetes_pod_label_app]
action: keep
regex: kagent
- source_labels: [__meta_kubernetes_pod_name]
target_label: pod
metric_relabel_configs:
- source_labels: [__name__]
regex: 'kagent_(agent|tool|conversation).*'
action: keep
Production Deployment Best Practices
High Availability Configuration
# kagent-ha-values.yaml
controller:
replicas: 3
resources:
requests:
memory: 512Mi
cpu: 500m
limits:
memory: 1Gi
cpu: 1000m
podDisruptionBudget:
enabled: true
minAvailable: 2
affinity:
podAntiAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 100
podAffinityTerm:
labelSelector:
matchLabels:
app: kagent-controller
topologyKey: kubernetes.io/hostname
ui:
replicas: 2
ingress:
enabled: true
className: nginx
hosts:
- host: kagent.example.com
paths:
- path: /
pathType: Prefix
tls:
- secretName: kagent-tls
hosts:
- kagent.example.com
database:
type: postgres
host: postgres.kagent.svc.cluster.local
name: kagent
existingSecret: kagent-db-secret
Install with HA:
helm repo add kagent https://kagent-dev.github.io/kagent
helm repo update
helm install kagent kagent/kagent \
--namespace kagent \
--create-namespace \
-f kagent-ha-values.yaml
Security Hardening
# kagent-rbac.yaml
apiVersion: v1
kind: ServiceAccount
metadata:
name: kagent-restricted
namespace: kagent
---
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
name: kagent-restricted-role
namespace: kagent
rules:
- apiGroups: [""]
resources: ["pods", "services"]
verbs: ["get", "list", "watch"]
- apiGroups: ["apps"]
resources: ["deployments", "replicasets"]
verbs: ["get", "list", "watch"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
name: kagent-restricted-binding
namespace: kagent
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: Role
name: kagent-restricted-role
subjects:
- kind: ServiceAccount
name: kagent-restricted
namespace: kagent
Network Policies
# kagent-netpol.yaml
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: kagent-controller-policy
namespace: kagent
spec:
podSelector:
matchLabels:
app: kagent-controller
policyTypes:
- Ingress
- Egress
ingress:
- from:
- podSelector:
matchLabels:
app: kagent-ui
ports:
- protocol: TCP
port: 8080
egress:
- to:
- podSelector:
matchLabels:
app: postgres
ports:
- protocol: TCP
port: 5432
- to:
- namespaceSelector: {}
ports:
- protocol: TCP
port: 443 # LLM API calls
Real-World Examples
Istio Service Mesh Troubleshooting
# Chat with istio-agent
kagent repl
You: I'm seeing 503 errors in my productpage service. Can you help diagnose?
Agent: I'll investigate the 503 errors in the productpage service. Let me check
several things:
1. First, let me check the Envoy proxy status...
[ToolCall: istio_check_proxy_status]
2. Now let me examine the virtual service configurations...
[ToolCall: istio_list_virtualservices]
3. Let me check the destination rules...
[ToolCall: istio_get_destinationrule]
Agent: I found the issue! Your destination rule has a connection pool setting
that's too restrictive:
```yaml
spec:
host: productpage
trafficPolicy:
connectionPool:
tcp:
maxConnections: 10 # Too low!
http:
http1MaxPendingRequests: 5
maxRequestsPerConnection: 1
The maxConnections is set to 10, but you’re receiving 50+ concurrent requests during peak hours. This causes connection queueing and 503 errors.
Recommendation: Increase to at least 100 connections:
spec:
host: productpage
trafficPolicy:
connectionPool:
tcp:
maxConnections: 100
http:
http1MaxPendingRequests: 50
Would you like me to apply this change?
### Automated Incident Response
```yaml
# incident-response-workflow.yaml
apiVersion: kagent.io/v1alpha1
kind: Agent
metadata:
name: incident-responder
namespace: kagent
spec:
systemPrompt: |
You are an incident response agent. When alerted:
1. Gather information from monitoring systems
2. Analyze logs and metrics
3. Identify root cause
4. Execute remediation steps
5. Create incident report
6. Notify on-call team
modelConfig:
name: openai-gpt4-turbo
tools:
- toolServer:
name: kubernetes-mcp
- toolServer:
name: prometheus-mcp
- toolServer:
name: grafana-mcp
- toolServer:
name: pagerduty-mcp
# Multi-step reasoning
maxIterations: 20
temperature: 0.3 # Lower temperature for deterministic responses
# Webhook for Prometheus Alertmanager
webhooks:
- name: alertmanager
endpoint: /api/v1/incidents
authentication:
type: bearer
secretRef:
name: webhook-token
Performance and Scaling
Agent Resource Optimization
# efficient-agent.yaml
apiVersion: kagent.io/v1alpha1
kind: Agent
metadata:
name: optimized-agent
namespace: kagent
spec:
systemPrompt: "You are a Kubernetes troubleshooting expert."
modelConfig:
name: anthropic-claude-haiku # Faster, cheaper model
# Limit iterations to control costs
maxIterations: 8
# Control token usage
maxTokens: 2048
# Cache system prompts (Anthropic feature)
caching:
enabled: true
ttl: 3600
# Retry configuration
retry:
maxAttempts: 3
backoff: exponential
initialDelay: 1s
Batch Processing
# batch-agent-requests.py
import asyncio
import aiohttp
async def process_agent_request(session, conversation_id, message):
async with session.post(
f'http://kagent-api:8080/api/v1/conversations/{conversation_id}/messages',
json={'message': message}
) as resp:
return await resp.json()
async def batch_process(requests):
async with aiohttp.ClientSession() as session:
tasks = [
process_agent_request(session, req['conv_id'], req['message'])
for req in requests
]
return await asyncio.gather(*tasks)
# Process 100 requests concurrently
requests = [
{'conv_id': f'conv-{i}', 'message': f'Query {i}'}
for i in range(100)
]
results = asyncio.run(batch_process(requests))
Testing and Debugging
Unit Test Agents
# test_agent.py
import pytest
from kagent_sdk import Agent, ToolServer, ModelConfig
@pytest.fixture
def mock_kubernetes_mcp():
return ToolServer(
name="kubernetes-mcp",
type="mcp",
mock_responses={
"get_pods": {"items": [{"name": "test-pod"}]},
"get_logs": {"logs": "application started"}
}
)
@pytest.mark.asyncio
async def test_agent_troubleshooting(mock_kubernetes_mcp):
agent = Agent(
name="test-agent",
system_prompt="You help troubleshoot pods",
model_config=ModelConfig(provider="mock"),
tools=[mock_kubernetes_mcp]
)
response = await agent.send_message(
"Why is my test-pod crashing?"
)
assert "test-pod" in response.content
assert mock_kubernetes_mcp.called_tools == ["get_pods", "get_logs"]
Debug Mode
# Enable debug logging
kubectl set env deployment/kagent-controller \
LOG_LEVEL=debug \
-n kagent
# View detailed logs
kubectl logs -f deployment/kagent-controller -n kagent
# Debug specific conversation
kagent debug conversation conv-123
# Output includes:
# - All LLM requests/responses
# - Tool execution details
# - Timing information
# - Token usage
Migration and Integration
Migrate from LangChain
# Before (LangChain)
from langchain.agents import AgentExecutor
from langchain.tools import Tool
def get_pods():
# kubectl logic
pass
tools = [Tool(name="get_pods", func=get_pods)]
agent = AgentExecutor.from_agent_and_tools(tools=tools)
# After (Kagent)
# kagent-migrated-agent.yaml
apiVersion: kagent.io/v1alpha1
kind: Agent
metadata:
name: migrated-langchain-agent
spec:
systemPrompt: "Your existing prompt"
modelConfig:
name: openai-gpt4
tools:
- toolServer:
name: kubernetes-mcp # Built-in equivalent
Integrate with CI/CD
# .github/workflows/deploy-agent.yaml
name: Deploy Kagent Agent
on:
push:
branches: [main]
paths:
- 'agents/**'
jobs:
deploy:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v3
- name: Setup kubectl
uses: azure/setup-kubectl@v3
- name: Deploy agents
run: |
kubectl apply -f agents/ -n kagent
- name: Verify deployment
run: |
kubectl wait --for=condition=Ready agent --all -n kagent --timeout=120s
- name: Run agent tests
run: |
./scripts/test-agents.sh
Community and Resources
Documentation Links
- π Official Website: kagent.dev
- π Documentation: kagent.dev/docs
- π» GitHub Repository: github.com/kagent-dev/kagent
- π¬ Discord Community: Join Discord
- π Examples: github.com/kagent-dev/examples
Getting Help
# Get help via CLI
kagent help
# Check version
kagent version
# Validate agent configuration
kagent validate agent.yaml
# Generate agent template
kagent generate agent --name my-agent --tools kubernetes,helm
Conclusion
Kagent represents a paradigm shift in how we deploy AI agents for cloud-native operations. By leveraging Kubernetes-native patterns, MCP protocols, and production-grade observability, Kagent enables DevOps teams to:
β
Automate complex troubleshooting workflows
β
Scale AI agents with Kubernetes reliability
β
Integrate with existing cloud-native tooling
β
Maintain full visibility through OpenTelemetry
β
Extend easily with custom MCP servers
Next Steps:
- Clone the Kagent repository
- Follow the Quick Start Guide
- Join the Discord community
- Attend the KubeCon Atlanta Community Party
One thought on “Kagent: Build Production-Ready AI Agents in Kubernetes – Complete Guide with Code Examples”