Alerts and Monitoring Best Practices Cloud Computing Cloud-Native AI Kubernetes Orchestration Platform Engineering Security

Run APIs where internet doesn’t exist: Complete air-gapped Kubernetes guide-Air-Gapped API Management

Run APIs where internet doesn’t exist: Complete air-gapped Kubernetes guide-Air-Gapped API Management

If you’ve ever tried deploying an API gateway in an air-gapped Kubernetes cluster, you know the pain: image pull failures, license validation timeouts, and SaaS dashboards that assume you’re always connected. Traditional API management platforms weren’t built for disconnected operations—they expect to phone home.

Here’s how to build a production-grade, zero-egress API management platform using GitOps principles, policy-as-code, and Kubernetes-native patterns that work when the internet doesn’t exist.

What Makes an Environment Truly Air-Gapped?

An air-gapped environment isn’t just “behind a strict firewall.” It’s infrastructure that operates with zero external network connectivity by design. Think defense networks, financial trading floors processing sensitive transactions, healthcare systems under HIPAA constraints, or sovereign clouds bound by data residency laws.

graph TB
    subgraph "Air-Gapped Environment"
        A[Internal Git Repository] -->|Pull Request| B[CI/CD Pipeline]
        B -->|Policy Validation| C[Internal Container Registry]
        C -->|Deploy| D[Traefik Hub Gateway]
        D -->|Route| E[Microservices]
        D -->|Route| F[AI Models]
        G[Operators] -->|No Internet| H[X]
    end
    
    style A fill:#2ecc71
    style D fill:#3498db
    style H fill:#e74c3c
Air-Gapped Environment
Pull Request
Policy Validation
Deploy
Route
Route
No Internet
CI/CD Pipeline
Internal Git Repository
Internal Container Registry
Traefik Hub Gateway
Microservices
AI Models
X
Operators

The challenge: your API platform needs to configure itself, enforce policies, collect telemetry, and manage hundreds of services without ever reaching the public internet.

The Architecture: Full Stack Without External Dependencies

Here’s the complete air-gapped architecture showing every component and data flow:

graph TB
    subgraph "Secure Perimeter"
        subgraph "Development Zone"
            DEV[Developer Workstation]
            GIT[Internal GitLab/GitHub]
            CI[Jenkins/GitLab CI]
        end
        
        subgraph "Artifact Management"
            REG[Harbor Registry]
            SIGN[Cosign Signing Service]
            SCAN[Trivy Scanner]
        end
        
        subgraph "Production Kubernetes Cluster"
            subgraph "Control Plane"
                API[K8s API Server]
                ETCD[etcd]
            end
            
            subgraph "Traefik Hub Namespace"
                TH[Traefik Hub Controller]
                GW1[Gateway Instance 1]
                GW2[Gateway Instance 2]
                GW3[Gateway Instance 3]
            end
            
            subgraph "Application Namespaces"
                NS1[Finance APIs]
                NS2[Customer APIs]
                NS3[AI/ML Services]
            end
            
            subgraph "Observability Stack"
                PROM[Prometheus]
                JAEG[Jaeger]
                GRAF[Grafana]
            end
        end
    end
    
    INTERNET[Public Internet]
    
    DEV -->|git push| GIT
    GIT -->|webhook| CI
    CI -->|validate| SCAN
    CI -->|build| REG
    SIGN -->|sign artifacts| REG
    REG -->|pull images| TH
    TH -->|configure| GW1
    TH -->|configure| GW2
    TH -->|configure| GW3
    GW1 -->|route| NS1
    GW2 -->|route| NS2
    GW3 -->|route| NS3
    GW1 -.->|metrics| PROM
    GW2 -.->|traces| JAEG
    PROM -->|visualize| GRAF
    
    INTERNET -.->|X NO CONNECTION| TH
    
    style INTERNET fill:#e74c3c,stroke:#c0392b,color:#fff
    style TH fill:#3498db,stroke:#2980b9,color:#fff
    style GIT fill:#2ecc71,stroke:#27ae60,color:#fff
    style REG fill:#f39c12,stroke:#e67e22,color:#fff
Secure Perimeter
Development Zone
Artifact Management
Production Kubernetes Cluster
Control Plane
Traefik Hub Namespace
Application Namespaces
Observability Stack
git push
webhook
validate
build
sign artifacts
pull images
configure
configure
configure
route
route
route
metrics
traces
visualize
X NO CONNECTION
Prometheus
Jaeger
Grafana
Finance APIs
Customer APIs
AI/ML Services
Traefik Hub Controller
Gateway Instance 1
Gateway Instance 2
Gateway Instance 3
K8s API Server
etcd
Harbor Registry
Cosign Signing Service
Trivy Scanner
Developer Workstation
Internal GitLab/GitHub
Jenkins/GitLab CI
Public Internet

Every component lives inside your perimeter. Let’s build it step by step.

Step 1: Bootstrap Your Internal Infrastructure

First, establish your internal control plane. You need three foundational services before deploying any API gateway:

Internal Container Registry – I recommend Harbor for its vulnerability scanning and signing integration:

# harbor-values.yaml
expose:
  type: clusterIP
  tls:
    enabled: true
    certSource: secret
    secret:
      secretName: harbor-tls

externalURL: https://registry.internal.company.local

persistence:
  enabled: true
  persistentVolumeClaim:
    registry:
      storageClass: "local-path"
      size: 500Gi

trivy:
  enabled: true
  gitHubToken: ""  # No external GitHub access

notary:
  enabled: true  # For image signing

Git Server – GitLab CE works well for air-gapped deployments:

# gitlab-values.yaml
global:
  edition: ce
  hosts:
    domain: internal.company.local
    gitlab:
      name: git.internal.company.local
  
  registry:
    enabled: false  # Use Harbor instead
  
  grafana:
    enabled: false  # Use your own observability stack

gitlab:
  webservice:
    minReplicas: 2
    maxReplicas: 4

postgresql:
  persistence:
    size: 100Gi

redis:
  master:
    persistence:
      size: 10Gi

Artifact Signing Pipeline – Use Cosign for provenance:

#!/bin/bash
# ci-sign-and-push.sh

set -euo pipefail

IMAGE_NAME="${1}"
IMAGE_TAG="${2}"
REGISTRY="registry.internal.company.local"

# Build the image
docker build -t ${REGISTRY}/${IMAGE_NAME}:${IMAGE_TAG} .

# Push to internal registry
docker push ${REGISTRY}/${IMAGE_NAME}:${IMAGE_TAG}

# Sign with Cosign
cosign sign --key cosign.key \
  --tlog-upload=false \
  ${REGISTRY}/${IMAGE_NAME}:${IMAGE_TAG}

# Verify signature immediately
cosign verify --key cosign.pub \
  ${REGISTRY}/${IMAGE_NAME}:${IMAGE_TAG}

echo "✅ Image signed and verified: ${IMAGE_NAME}:${IMAGE_TAG}"

Step 2: Deploy Traefik Hub with GitOps

Now deploy Traefik Hub entirely from your internal resources:

# traefik-hub-deployment.yaml
apiVersion: v1
kind: Namespace
metadata:
  name: traefik-system
---
apiVersion: helm.cattle.io/v1
kind: HelmChart
metadata:
  name: traefik-hub
  namespace: traefik-system
spec:
  chart: traefik/traefik-hub
  repo: https://registry.internal.company.local/chartrepo/traefik
  targetNamespace: traefik-system
  
  valuesContent: |-
    image:
      registry: registry.internal.company.local
      repository: traefik/traefik-hub
      tag: v3.2.0
    
    hub:
      airgap:
        enabled: true  # Critical for air-gapped mode
        licensePath: /licenses/traefik-hub.lic
      
    deployment:
      replicas: 3
      
    service:
      type: LoadBalancer
      annotations:
        metallb.universe.tf/address-pool: production
    
    # Disable all external integrations
    pilot:
      enabled: false
    
    metrics:
      prometheus:
        enabled: true
        addEntryPointsLabels: true
        addRoutersLabels: true
        addServicesLabels: true
    
    tracing:
      otlp:
        enabled: true
        grpc:
          endpoint: jaeger-collector.observability:4317
          insecure: true
    
    logs:
      access:
        enabled: true
        format: json

Step 3: Define APIs as Code

This is where GitOps shines. Every API is a declarative Kubernetes resource:

# apis/payment-gateway.yaml
apiVersion: hub.traefik.io/v1alpha1
kind: API
metadata:
  name: payment-gateway
  namespace: finance
  labels:
    team: fintech
    compliance: pci-dss
spec:
  openApiSpec:
    path: /specs/payment-v3.yaml
    url: http://registry.internal.company.local/specs/payment-v3.yaml
  
  service:
    name: payment-backend
    port:
      number: 8080
    
  cors:
    allowOrigins:
      - "https://app.internal.company.local"
    allowMethods:
      - GET
      - POST
    allowHeaders:
      - "Authorization"
      - "Content-Type"
  
  rateLimit:
    limit: 5000
    period: 1m
    strategy: ip
    
  authentication:
    jwt:
      secretName: payment-jwt-secret
      issuer: https://auth.internal.company.local
      audience: payment-api
      
  accessControl:
    policies:
      - name: require-mfa
        rule: "Header(`X-MFA-Verified`, `true`)"
      - name: business-hours-only
        rule: "!HeaderRegexp(`X-Request-Time`, `^(0[0-8]|1[8-9]|2[0-3]):.*`)"

Step 4: Automate Policy Enforcement

Build a CI pipeline that validates every change before it reaches production:

flowchart TD
    START([Developer Creates PR]) --> LINT{Schema<br/>Validation}
    LINT -->|Pass| SEC{Security<br/>Policy Scan}
    LINT -->|Fail| REJECT1[Reject with<br/>Validation Errors]
    
    SEC -->|Pass| BREAK{Breaking<br/>Change Check}
    SEC -->|Fail| REJECT2[Block: Security<br/>Violation Detected]
    
    BREAK -->|Pass| COMP{Compliance<br/>Rules Check}
    BREAK -->|Warn| WARN1[Warning: API<br/>Version Required]
    
    COMP -->|Pass| REVIEW{Human<br/>Review}
    COMP -->|Fail| REJECT3[Block: Compliance<br/>Violation]
    
    REVIEW -->|Approved| SIGN[Sign Artifact<br/>with Private Key]
    REVIEW -->|Rejected| REJECT4[PR Rejected]
    
    SIGN --> BUILD[Build Container<br/>Bundle]
    BUILD --> PUSH[Push to Internal<br/>Registry]
    PUSH --> TAG[Create Git Tag<br/>v1.x.x]
    TAG --> DEPLOY[Deploy to<br/>Staging]
    DEPLOY --> SMOKE{Smoke<br/>Tests Pass}
    SMOKE -->|Pass| PROD[Promote to<br/>Production]
    SMOKE -->|Fail| ROLLBACK[Automatic<br/>Rollback]
    PROD --> AUDIT[Audit Log<br/>Entry Created]
    AUDIT --> END([Deployment Complete])
    
    style START fill:#2ecc71
    style END fill:#2ecc71
    style REJECT1 fill:#e74c3c,color:#fff
    style REJECT2 fill:#e74c3c,color:#fff
    style REJECT3 fill:#e74c3c,color:#fff
    style REJECT4 fill:#e74c3c,color:#fff
    style ROLLBACK fill:#e67e22,color:#fff
    style SIGN fill:#3498db,color:#fff
    style PROD fill:#27ae60,color:#fff
PassFailPassFailPassWarnPassFailApprovedRejectedPassFailDeveloper Creates PRSchemaValidationSecurityPolicy ScanReject withValidation ErrorsBreakingChange CheckBlock: SecurityViolation DetectedComplianceRules CheckWarning: APIVersion RequiredHumanReviewBlock: ComplianceViolationSign Artifactwith Private KeyPR RejectedBuild ContainerBundlePush to InternalRegistryCreate Git Tagv1.x.xDeploy toStagingSmokeTests PassPromote toProductionAutomaticRollbackAudit LogEntry CreatedDeployment Complete

Here’s the GitLab CI pipeline that enforces these gates:

# .gitlab-ci.yml
stages:
  - validate
  - security
  - build
  - sign
  - deploy

variables:
  REGISTRY: registry.internal.company.local
  KUBECONFIG: /etc/kubernetes/admin.conf

validate-schema:
  stage: validate
  image: ${REGISTRY}/tools/kubectl:1.29
  script:
    - kubectl apply --dry-run=client -f apis/
    - kubectl apply --dry-run=server -f apis/
  only:
    changes:
      - apis/**/*.yaml

security-scan:
  stage: security
  image: ${REGISTRY}/tools/kubesec:latest
  script:
    - kubesec scan apis/*.yaml
    - |
      if grep -r "privileged: true" apis/; then
        echo "❌ Privileged containers not allowed"
        exit 1
      fi
  only:
    changes:
      - apis/**/*.yaml

check-breaking-changes:
  stage: validate
  image: ${REGISTRY}/tools/oasdiff:latest
  script:
    - |
      for file in apis/*.yaml; do
        git show HEAD:${file} > old.yaml
        oasdiff breaking old.yaml ${file}
      done
  allow_failure: true

build-bundle:
  stage: build
  image: ${REGISTRY}/tools/kustomize:latest
  script:
    - kustomize build apis/ > bundle.yaml
  artifacts:
    paths:
      - bundle.yaml
    expire_in: 1 week

sign-artifacts:
  stage: sign
  image: ${REGISTRY}/tools/cosign:latest
  script:
    - cosign sign-blob --key ${COSIGN_PRIVATE_KEY} bundle.yaml > bundle.sig
  artifacts:
    paths:
      - bundle.sig
  only:
    - main
    - /^release-.*$/

deploy-staging:
  stage: deploy
  image: ${REGISTRY}/tools/kubectl:1.29
  script:
    - kubectl config use-context staging
    - kubectl apply -f bundle.yaml
    - ./scripts/smoke-test.sh
  environment:
    name: staging
  only:
    - main

deploy-production:
  stage: deploy
  image: ${REGISTRY}/tools/kubectl:1.29
  script:
    - cosign verify-blob --key ${COSIGN_PUBLIC_KEY} --signature bundle.sig bundle.yaml
    - kubectl config use-context production
    - kubectl apply -f bundle.yaml
  environment:
    name: production
  when: manual
  only:
    - main

Step 5: Multi-Tenant Isolation

Platform teams serving multiple business units need strict isolation. Here’s how to implement namespace-based multi-tenancy:

graph TB
    subgraph "Kubernetes Cluster - Shared Infrastructure"
        subgraph "Namespace: platform-team"
            PT_TH[Traefik Hub Controller<br/>ClusterRole: admin]
            PT_CRD[API CRDs<br/>Cluster-wide definitions]
        end
        
        subgraph "Namespace: finance-team"
            FIN_GW[Gateway Instance<br/>ServiceAccount: finance-sa]
            FIN_API1[Payment API<br/>Quota: 10k req/min]
            FIN_API2[Trading API<br/>Quota: 50k req/min]
            FIN_SEC[NetworkPolicy<br/>Deny all except approved]
            FIN_QUOTA[ResourceQuota<br/>8 CPU, 16GB RAM]
        end
        
        subgraph "Namespace: healthcare-team"
            HC_GW[Gateway Instance<br/>ServiceAccount: health-sa]
            HC_API1[Patient API<br/>Quota: 5k req/min]
            HC_API2[Records API<br/>Quota: 2k req/min]
            HC_SEC[NetworkPolicy<br/>HIPAA compliant routes]
            HC_QUOTA[ResourceQuota<br/>4 CPU, 8GB RAM]
        end
        
        subgraph "Namespace: ml-team"
            ML_GW[AI Gateway Instance<br/>ServiceAccount: ml-sa]
            ML_LLM1[Local Llama Model<br/>Rate: 100 req/min]
            ML_LLM2[Fine-tuned GPT<br/>Rate: 50 req/min]
            ML_SEC[NetworkPolicy<br/>GPU node affinity]
            ML_QUOTA[ResourceQuota<br/>16 CPU, 64GB RAM, 2 GPU]
        end
    end
    
    PT_TH -.->|Manages| FIN_GW
    PT_TH -.->|Manages| HC_GW
    PT_TH -.->|Manages| ML_GW
    
    FIN_GW -->|Routes| FIN_API1
    FIN_GW -->|Routes| FIN_API2
    HC_GW -->|Routes| HC_API1
    HC_GW -->|Routes| HC_API2
    ML_GW -->|Routes| ML_LLM1
    ML_GW -->|Routes| ML_LLM2
    
    style PT_TH fill:#9b59b6,color:#fff
    style FIN_GW fill:#3498db,color:#fff
    style HC_GW fill:#2ecc71,color:#fff
    style ML_GW fill:#e67e22,color:#fff
    style FIN_SEC fill:#e74c3c,color:#fff
    style HC_SEC fill:#e74c3c,color:#fff
    style ML_SEC fill:#e74c3c,color:#fff
Kubernetes Cluster – Shared Infrastructure
Namespace: platform-team
Namespace: finance-team
Namespace: healthcare-team
Namespace: ml-team
Manages
Manages
Manages
Routes
Routes
Routes
Routes
Routes
Routes
AI Gateway Instance
ServiceAccount: ml-sa
Local Llama Model
Rate: 100 req/min
Fine-tuned GPT
Rate: 50 req/min
NetworkPolicy
GPU node affinity
ResourceQuota
16 CPU, 64GB RAM, 2 GPU
Gateway Instance
ServiceAccount: health-sa
Patient API
Quota: 5k req/min
Records API
Quota: 2k req/min
NetworkPolicy
HIPAA compliant routes
ResourceQuota
4 CPU, 8GB RAM
Gateway Instance
ServiceAccount: finance-sa
Payment API
Quota: 10k req/min
Trading API
Quota: 50k req/min
NetworkPolicy
Deny all except approved
ResourceQuota
8 CPU, 16GB RAM
Traefik Hub Controller
ClusterRole: admin
API CRDs
Cluster-wide definitions

Implement with ResourceQuotas and NetworkPolicies:

# finance-namespace-isolation.yaml
apiVersion: v1
kind: Namespace
metadata:
  name: finance-team
  labels:
    tenant: finance
    compliance: pci-dss
---
apiVersion: v1
kind: ResourceQuota
metadata:
  name: finance-compute-quota
  namespace: finance-team
spec:
  hard:
    requests.cpu: "8"
    requests.memory: 16Gi
    limits.cpu: "16"
    limits.memory: 32Gi
    persistentvolumeclaims: "10"
---
apiVersion: v1
kind: ResourceQuota
metadata:
  name: finance-api-quota
  namespace: finance-team
spec:
  hard:
    count/api.hub.traefik.io: "20"
    count/middleware.hub.traefik.io: "50"
---
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: finance-default-deny
  namespace: finance-team
spec:
  podSelector: {}
  policyTypes:
    - Ingress
    - Egress
  egress:
    - to:
      - namespaceSelector:
          matchLabels:
            name: finance-team
    - to:
      - namespaceSelector:
          matchLabels:
            name: traefik-system
      ports:
        - protocol: TCP
          port: 443
    # Deny all other egress including internet

Step 6: AI Gateway for Local LLMs

The AI Gateway feature lets you standardize access to self-hosted LLMs while maintaining air-gap compliance:

sequenceDiagram
    autonumber
    participant Client as Application Client
    participant TH as Traefik Hub<br/>AI Gateway
    participant CG as Content Guard<br/>Filter
    participant Cache as Semantic Cache
    participant Local as Local vLLM<br/>(Priority 1)
    participant Hybrid as Hybrid Cloud Model<br/>(Priority 2)
    participant Audit as Audit Logger
    
    Client->>TH: POST /v1/chat/completions<br/>{model: "auto", prompt: "..."}
    TH->>CG: Scan incoming prompt
    
    alt Blocked Pattern Detected
        CG->>Audit: Log violation attempt
        CG->>Client: 403 Forbidden<br/>"Content policy violation"
    else Pattern Allowed
        CG->>Cache: Check semantic similarity
        
        alt Cache Hit
            Cache->>TH: Return cached response
            TH->>Client: 200 OK (cached)
            TH->>Audit: Log cache hit
        else Cache Miss
            Cache->>TH: No match found
            TH->>Local: Forward to local model
            
            alt Local Model Available
                Local->>TH: Generate response
                TH->>CG: Scan outgoing response
                CG->>Cache: Store response (TTL: 1h)
                CG->>TH: Response approved
                TH->>Client: 200 OK
                TH->>Audit: Log successful request
            else Local Model Unavailable
                Local--xTH: Connection timeout
                TH->>Hybrid: Failover to approved cloud
                Hybrid->>TH: Generate response
                TH->>CG: Scan outgoing response
                CG->>TH: Response approved
                TH->>Client: 200 OK (from hybrid)
                TH->>Audit: Log failover event
            end
        end
    end
Application ClientTraefik HubAI GatewayContent GuardFilterSemantic CacheLocal vLLM(Priority 1)Hybrid Cloud Model(Priority 2)Audit LoggerPOST /v1/chat/completions{model: “auto”, prompt: “…”}1Scan incoming prompt2Log violation attempt3403 Forbidden“Content policy violation”4Check semantic similarity5Return cached response6200 OK (cached)7Log cache hit8No match found9Forward to local model10Generate response11Scan outgoing response12Store response (TTL: 1h)13Response approved14200 OK15Log successful request16Connection timeout17Failover to approved cloud18Generate response19Scan outgoing response20Response approved21200 OK (from hybrid)22Log failover event23alt[ Local Model Available ][ Local Model Unavailable ]alt[ Cache Hit ][ Cache Miss ]alt[ Blocked Pattern Detected ][ Pattern Allowed ]Application ClientTraefik HubAI GatewayContent GuardFilterSemantic CacheLocal vLLM(Priority 1)Hybrid Cloud Model(Priority 2)Audit Logger

Deploy it with this configuration:

# ai-gateway.yaml
apiVersion: hub.traefik.io/v1alpha1
kind: AIGateway
metadata:
  name: sovereign-ai-gateway
  namespace: ml-team
spec:
  # Multiple backend support with priorities
  backends:
    - name: local-llama-3
      url: http://vllm-service.ml-team:8000
      models:
        - llama-3-70b-instruct
        - llama-3-8b-instruct
      priority: 1  # Try local first
      timeout: 30s
      
    - name: local-mistral
      url: http://mistral-service.ml-team:8001
      models:
        - mistral-7b-instruct-v0.2
      priority: 1
      timeout: 30s
      
    - name: approved-cloud-fallback
      url: https://api.approved-cloud.internal
      models:
        - claude-sonnet-4
      priority: 2  # Fallback only
      requiresApproval: true
      headers:
        X-Internal-Routing: "approved-gateway-only"
  
  # Content filtering and guardrails
  contentGuard:
    enabled: true
    scanPrompts: true
    scanResponses: true
    blockPatterns:
      - regex: "(?i)(confidential|internal-only|secret)"
        action: block
        auditLevel: high
      - regex: "(?i)(ssn|credit[\\s-]?card|password)"
        action: block
        auditLevel: critical
    allowPatterns:
      - regex: "(?i)(public|general|approved)"
        action: allow
  
  # Semantic caching for efficiency
  semanticCache:
    enabled: true
    ttl: 3600  # 1 hour
    similarityThreshold: 0.95
    maxCacheSize: 10GB
    evictionPolicy: lru
  
  # Rate limiting per tenant
  rateLimit:
    global:
      limit: 1000
      period: 1h
    perUser:
      limit: 100
      period: 1h
  
  # Observability
  metrics:
    enabled: true
    includeModelName: true
    includeTokenCount: true
    includeLatency: true
    
  tracing:
    enabled: true
    sampleRate: 0.1  # 10% sampling

Access it with standard OpenAI SDK:

# client-example.py
from openai import OpenAI

# Point to your internal AI Gateway
client = OpenAI(
    base_url="https://ai-gateway.internal.company.local/v1",
    api_key="internal-jwt-token-here"
)

response = client.chat.completions.create(
    model="auto",  # Gateway selects best available
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Analyze this customer feedback."}
    ],
    temperature=0.7,
    max_tokens=500
)

print(response.choices[0].message.content)

Step 7: Built-in Resilience and Rollback

Production systems fail. Here’s your automated rollback strategy:

stateDiagram-v2
    [*] --> Healthy: Normal Operations
    
    Healthy --> Detecting: Anomaly Detected<br/>(Error rate spike)
    
    Detecting --> Investigating: Automated Health Check<br/>Running
    
    Investigating --> Healthy: False Alarm<br/>(Metrics normalized)
    Investigating --> Degraded: Confirmed Issue<br/>(3 consecutive failures)
    
    Degraded --> RollbackInitiated: Automatic Trigger<br/>(Error rate > 5%)
    Degraded --> ManualIntervention: Manual Override<br/>(Platform team decision)
    
    RollbackInitiated --> FetchingPrevious: Pull previous version<br/>from Git tag
    FetchingPrevious --> ApplyingPrevious: kubectl apply -f<br/>previous-config.yaml
    ApplyingPrevious --> Validating: Run validation suite
    
    Validating --> Healthy: Tests Pass<br/>(Rollback successful)
    Validating --> Failed: Tests Fail<br/>(Rollback failed)
    
    Failed --> ManualIntervention: Escalate to<br/>on-call engineer
    
    ManualIntervention --> Emergency: Apply emergency<br/>maintenance mode
    Emergency --> Healthy: Issue Resolved
    
    note right of Healthy
        - All APIs responding
        - Latency < 200ms
        - Error rate < 0.1%
    end note
    
    note right of Degraded
        - Some APIs slow
        - Latency > 500ms
        - Error rate 1-5%
    end note
    
    note right of Failed
        - Critical failure
        - Manual recovery needed
        - Incident created
    end note
Normal OperationsAnomaly Detected(Error rate spike)Automated Health CheckRunningFalse Alarm(Metrics normalized)Confirmed Issue(3 consecutive failures)Automatic Trigger(Error rate > 5%)Manual Override(Platform team decision)Pull previous versionfrom Git tagkubectl apply -fprevious-config.yamlRun validation suiteTests Pass(Rollback successful)Tests Fail(Rollback failed)Escalate toon-call engineerApply emergencymaintenance modeIssue ResolvedHealthyDetectingInvestigatingDegradedRollbackInitiatedManualInterventionFetchingPreviousApplyingPreviousValidatingFailedEmergency– All APIs responding– Latency < 200ms– Error rate < 0.1%end notenote right of Degraded– Some APIs slow– Latency > 500ms– Error rate 1-5%end notenote right of Failed– Critical failure– Manual recovery needed– Incident created

Automate with Prometheus alerts and a rollback script:

# prometheus-rollback-alert.yaml
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: api-gateway-alerts
  namespace: traefik-system
spec:
  groups:
    - name: gateway-health
      interval: 30s
      rules:
        - alert: HighErrorRate
          expr: |
            (
              sum(rate(traefik_service_requests_total{code=~"5.."}[5m]))
              /
              sum(rate(traefik_service_requests_total[5m]))
            ) > 0.05
          for: 2m
          labels:
            severity: critical
            team: platform
          annotations:
            summary: "High error rate detected"
            description: "Error rate is {{ $value | humanizePercentage }}"
            runbook: "https://runbooks.internal/gateway-rollback"
        
        - alert: HighLatency
          expr: |
            histogram_quantile(0.95,
              sum(rate(traefik_service_request_duration_seconds_bucket[5m])) by (le)
            ) > 0.5
          for: 5m
          labels:
            severity: warning
            team: platform
          annotations:
            summary: "High latency detected"
            description: "P95 latency is {{ $value }}s"

Rollback automation script:

#!/bin/bash
# automated-rollback.sh

set -euo pipefail

NAMESPACE="${1:-traefik-system}"
CURRENT_VERSION=$(kubectl get deployment traefik-hub -n ${NAMESPACE} -o jsonpath='{.spec.template.spec.containers[0].image}' | cut -d: -f2)
PREVIOUS_VERSION=$(git describe --tags --abbrev=0 HEAD~1)

echo "🔍 Current version: ${CURRENT_VERSION}"
echo "⏪ Rolling back to: ${PREVIOUS_VERSION}"

# Fetch previous configuration from Git
git checkout tags/${PREVIOUS_VERSION} -- config/

# Verify signature before applying
cosign verify-blob \
  --key cosign.pub \
  --signature config/bundle.sig \
  config/bundle.yaml

# Apply rollback
kubectl apply -f config/bundle.yaml

# Wait for rollout
kubectl rollout status deployment/traefik-hub -n ${NAMESPACE} --timeout=5m

# Run smoke tests
./scripts/smoke-test.sh

if [ $? -eq 0 ]; then
  echo "✅ Rollback successful to ${PREVIOUS_VERSION}"
  # Create incident post-mortem
  ./scripts/create-incident.sh "Automated rollback from ${CURRENT_VERSION}"
else
  echo "❌ Rollback failed - manual intervention required"
  kubectl set image deployment/traefik-hub \
    traefik-hub=registry.internal/traefik/traefik-hub:emergency-stable \
    -n ${NAMESPACE}
  exit 1
fi

Observability Without SaaS Dependencies

Export all telemetry to your internal stack:

# observability-config.yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: traefik-observability
  namespace: traefik-system
data:
  traefik.yaml: |
    metrics:
      prometheus:
        addEntryPointsLabels: true
        addRoutersLabels: true
        addServicesLabels: true
        buckets:
          - 0.1
          - 0.3
          - 1.0
          - 2.5
          - 5.0
          - 10.0
        manualRouting: true
    
    tracing:
      otlp:
        grpc:
          endpoint: jaeger-collector.observability:4317
          insecure: true
          headers:
            X-Internal-Cluster: "production"
        
    accessLog:
      filePath: /var/log/traefik/access.log
      format: json
      fields:
        defaultMode: keep
        names:
          ClientUsername: drop
        headers:
          defaultMode: keep
          names:
            Authorization: redact
            Cookie: redact
    
    log:
      level: INFO
      format: json
      filePath: /var/log/traefik/traefik.log
```

## Production Checklist

Before going live in your air-gapped environment:

**Infrastructure**
- [ ] Internal Harbor registry deployed and accessible
- [ ] GitLab/GitHub internal instance configured
- [ ] Cosign key pairs generated and secured
- [ ] Network policies deny all egress by default
- [ ] Load balancer (MetalLB/Cilium) configured

**Security**
- [ ] All images signed and verified
- [ ] RBAC policies limit namespace access
- [ ] Secret management (Vault/SealedSecrets) deployed
- [ ] Audit logging enabled and shipped to SIEM
- [ ] Vulnerability scanning in CI pipeline

**Observability**
- [ ] Prometheus scraping all gateway metrics
- [ ] Jaeger collecting distributed traces
- [ ] Grafana dashboards imported
- [ ] AlertManager rules configured
- [ ] Log aggregation (Loki/ELK) operational

**CI/CD**
- [ ] GitLab Runners registered and tested
- [ ] Pipeline validates all API definitions
- [ ] Automated smoke tests passing
- [ ] Rollback procedures documented and tested
- [ ] Deployment requires signed artifacts

**Documentation**
- [ ] Runbooks created for common incidents
- [ ] API onboarding guide published
- [ ] Emergency contact list updated
- [ ] Disaster recovery plan tested
- [ ] Change management process defined

## What You've Built

You now have a production-grade, air-gapped API management platform that:

- Operates entirely within your secure perimeter without internet access
- Manages APIs as declarative code with full Git history
- Enforces policies automatically through CI/CD pipelines
- Provides multi-tenant isolation with namespace-scoped quotas
- Routes AI traffic through content-filtered gateways
- Exports comprehensive telemetry to your observability stack
- Rolls back automatically when anomalies are detected
- Maintains a cryptographically-verified chain of custody

Most importantly, you control every component. No SaaS vendor can revoke your license, change pricing, or access your data. Your platform scales horizontally, upgrades predictably, and operates reliably—even when the internet doesn't exist.

---

## 📦 Bonus: Quick Start Repository

A reference repository structure:
```
air-gapped-api-platform/
├── infrastructure/
│   ├── harbor/
│   ├── gitlab/
│   └── cosign/
├── traefik-hub/
│   ├── base/
│   ├── overlays/
│   │   ├── staging/
│   │   └── production/
├── apis/
│   ├── finance/
│   ├── healthcare/
│   └── ml/
├── policies/
│   ├── opa/
│   └── kyverno/
├── observability/
│   ├── prometheus/
│   ├── jaeger/
│   └── grafana/
├── ci/
│   ├── .gitlab-ci.yml
│   └── scripts/
└── docs/
    ├── runbooks/
    └── architecture/

Next Steps:

  1. Fork this architecture to your internal Git
  2. Deploy Harbor and establish your registry
  3. Start with one API in staging
  4. Build confidence through testing
  5. Graduate to production with full automation

Keywords : air-gapped kubernetes, zero-egress api gateway, gitops api management, sovereign cloud deployment, policy-as-code api, kubernetes multi-tenancy, air-gapped ai gateway, offline api management, traefik hub airgap

Leave a Reply

Your email address will not be published. Required fields are marked *