Orchestration

Karpenter: Unlock Instant Kubernetes Node Scaling

Karpenter: Just-in-Time Node Provisioning for Kubernetes Clusters

Managing node infrastructure in Kubernetes can often feel like a delicate balancing act. Traditional cluster autoscalers, while effective, often react to pending pods by provisioning new nodes based on predefined instance types. This reactive approach can lead to inefficiencies, such as over-provisioning or delays in scaling up, especially when workloads have diverse and specific resource requirements. Enter Karpenter, an open-source, high-performance Kubernetes cluster autoscaler built by Amazon Web Services. Karpenter takes a fundamentally different approach: it observes pending pods and directly launches the right compute resources to meet their needs, often in seconds.

Karpenter revolutionizes how you think about Kubernetes node provisioning. Instead of managing node groups or instance types manually, you define your requirements for compute and let Karpenter handle the rest. This “just-in-time” provisioning significantly improves cluster efficiency, reduces operational overhead, and can lead to substantial cost savings by ensuring your cluster always has just enough, but not too much, capacity. Whether you’re running bursty workloads, diverse machine learning jobs, or simply want to optimize your cloud spend, Karpenter offers a powerful and flexible solution.

This guide will walk you through deploying Karpenter on an EKS cluster, configuring its core components, and demonstrating its ability to provision nodes dynamically. By the end, you’ll have a clear understanding of how Karpenter can transform your Kubernetes infrastructure management.

TL;DR: Karpenter Just-in-Time Node Provisioning

Karpenter is an intelligent Kubernetes autoscaler that provisions the right node at the right time for your pending pods, optimizing cost and performance. Here’s a quick summary:

  • Install Karpenter: Use Helm to deploy Karpenter into your EKS cluster.
  • Configure IAM: Create necessary IAM roles and policies for Karpenter to manage EC2 instances.
  • Create a Provisioner: Define a Provisioner resource to specify node requirements (e.g., instance types, zones, capacity types).
  • Deploy a Workload: Deploy a sample application with specific resource requests that trigger Karpenter to provision a new node.
  • Observe Scaling: Watch Karpenter rapidly provision nodes and terminate them when no longer needed.

# Install Karpenter (after configuring IAM and environment variables)
helm upgrade --install karpenter oci://public.ecr.aws/karpenter/karpenter --version ${KARPENTER_VERSION} --namespace karpenter --create-namespace \
    --set serviceAccount.annotations."eks\.amazonaws\.com/role-arn"=${KARPENTER_IRSA_ARN} \
    --set settings.aws.clusterName=${CLUSTER_NAME} \
    --set settings.aws.defaultInstanceProfile=${KARPENTER_INSTANCE_PROFILE} \
    --set settings.aws.interruptionQueue=${CLUSTER_NAME}

# Example Provisioner
kubectl apply -f - <

Prerequisites

Before you embark on your Karpenter journey, ensure you have the following:

  • Kubernetes Cluster: An Amazon EKS cluster (version 1.21 or higher) is required. Karpenter is tightly integrated with AWS services for node provisioning.
  • AWS CLI: Configured with appropriate permissions to manage EKS, EC2, IAM, and other related services. You can download it from the official AWS CLI documentation.
  • kubectl: The Kubernetes command-line tool, configured to connect to your EKS cluster. Refer to the Kubernetes documentation for installation.
  • helm: The Kubernetes package manager, used for deploying Karpenter. Install it from the Helm website.
  • jq: A lightweight and flexible command-line JSON processor, useful for parsing AWS CLI output.
  • Basic AWS Knowledge: Familiarity with EC2, IAM, VPC, Subnets, and Security Groups.
  • Administrator Permissions: Your AWS user or role needs permissions to create IAM roles, policies, EC2 instances, and manage EKS resources.

Step-by-Step Guide: Deploying and Configuring Karpenter

Step 1: Set Up Environment Variables

First, let's define some environment variables that will be used throughout the deployment process. This makes commands more readable and reusable. Replace your-cluster-name and your-aws-region with your actual cluster name and AWS region.


export CLUSTER_NAME="karpenter-demo-cluster"
export AWS_REGION="us-east-1"
export ACCOUNT_ID=$(aws sts get-caller-identity --query Account --output text)
export KARPENTER_VERSION="v0.32.0" # Use the latest stable version
export K8S_VERSION="1.28" # Match your EKS cluster version

echo "AWS Account ID: ${ACCOUNT_ID}"
echo "EKS Cluster Name: ${CLUSTER_NAME}"
echo "AWS Region: ${AWS_REGION}"
echo "Karpenter Version: ${KARPENTER_VERSION}"
echo "Kubernetes Version: ${K8S_VERSION}"

Verify: Ensure the environment variables are set correctly by running the echo commands. The output should reflect your cluster details.


AWS Account ID: 123456789012
EKS Cluster Name: karpenter-demo-cluster
AWS Region: us-east-1
Karpenter Version: v0.32.0
Kubernetes Version: 1.28

Step 2: Create IAM Role for Karpenter Controller (IRSA)

Karpenter needs permission to interact with AWS APIs to provision and deprovision nodes. We'll use IAM Roles for Service Accounts (IRSA) to grant these permissions securely. This involves creating an IAM policy, an IAM role, and then associating it with a Kubernetes service account.

First, create an IAM policy for Karpenter. This policy grants permissions to manage EC2 instances, launch templates, security groups, and other resources necessary for node operations. For more on securing your supply chain, consider our article on Securing Container Supply Chains with Sigstore and Kyverno.


# Create Karpenter Controller IAM Policy
aws iam create-policy \
    --policy-name KarpenterControllerPolicy-${CLUSTER_NAME} \
    --policy-document '{"Version": "2012-10-17","Statement": [{"Effect": "Allow","Action": ["ec2:CreateLaunchTemplate","ec2:CreateFleet","ec2:RunInstances","ec2:CreateTags","ec2:TerminateInstances","ec2:DescribeLaunchTemplates","ec2:DescribeInstances","ec2:DescribeImages","ec2:DescribeSubnets","ec2:DescribeSecurityGroups","ec2:DescribeInstanceTypes","ec2:DescribeInstanceTypeOfferings","ec2:DescribeAvailabilityZones","ec2:DeleteLaunchTemplate","ec2:DeleteTags","ec2:GetLaunchTemplateData","ec2:ModifyInstanceTemplate","ec2:ModifyLaunchTemplate","ec2:RequestSpotInstances","ec2:CancelSpotInstanceRequests","ec2:DescribeSpotInstanceRequests","ec2:DescribeSpotPriceHistory"],"Resource": "*"},{"Effect": "Allow","Action": "ssm:GetParameter","Resource": "arn:aws:ssm:*:*:parameter/aws/service/ami-amazon-linux-2/*"}]}'

# Create an IAM role for Karpenter and attach the policy
eksctl create iamserviceaccount \
    --cluster ${CLUSTER_NAME} \
    --namespace karpenter \
    --name karpenter \
    --role-name KarpenterControllerRole-${CLUSTER_NAME} \
    --attach-policy-arn arn:aws:iam::${ACCOUNT_ID}:policy/KarpenterControllerPolicy-${CLUSTER_NAME} \
    --approve \
    --override-existing-serviceaccounts

export KARPENTER_IRSA_ARN="arn:aws:iam::${ACCOUNT_ID}:role/KarpenterControllerRole-${CLUSTER_NAME}"
echo "Karpenter IRSA ARN: ${KARPENTER_IRSA_ARN}"

Verify: Check if the IAM role and service account are created successfully.


aws iam get-role --role-name KarpenterControllerRole-${CLUSTER_NAME}
kubectl get sa karpenter -n karpenter -o yaml

The aws iam get-role command should return details about the IAM role, and the kubectl get sa command should show an annotation like eks.amazonaws.com/role-arn: arn:aws:iam::....

Step 3: Create IAM Role for Karpenter Nodes

Karpenter-provisioned nodes also need an IAM role to join the EKS cluster and interact with AWS services. This role is similar to the one used by managed node groups. It permits nodes to register with the cluster, pull container images from ECR, and send logs to CloudWatch.


# Create an IAM instance profile for Karpenter nodes
aws iam create-instance-profile --instance-profile-name KarpenterNodeInstanceProfile-${CLUSTER_NAME}

# Create a new IAM role for Karpenter nodes
aws iam create-role --role-name KarpenterNodeRole-${CLUSTER_NAME} \
    --assume-role-policy-document '{"Version": "2012-10-17","Statement": [{"Effect": "Allow","Principal": {"Service": "ec2.amazonaws.com"},"Action": "sts:AssumeRole"}]}'

# Attach the AmazonEKSWorkerNodePolicy
aws iam attach-role-policy \
    --role-name KarpenterNodeRole-${CLUSTER_NAME} \
    --policy-arn arn:aws:iam::aws:policy/AmazonEKSWorkerNodePolicy

# Attach the AmazonEKS_CNI_Policy
aws iam attach-role-policy \
    --role-name KarpenterNodeRole-${CLUSTER_NAME} \
    --policy-arn arn:aws:iam::aws:policy/AmazonEKS_CNI_Policy

# Attach the AmazonEC2ContainerRegistryReadOnly policy
aws iam attach-role-policy \
    --role-name KarpenterNodeRole-${CLUSTER_NAME} \
    --policy-arn arn:aws:iam::aws:policy/AmazonEC2ContainerRegistryReadOnly

# Add the KarpenterNodeRole to the instance profile
aws iam add-role-to-instance-profile \
    --instance-profile-name KarpenterNodeInstanceProfile-${CLUSTER_NAME} \
    --role-name KarpenterNodeRole-${CLUSTER_NAME}

export KARPENTER_NODE_ROLE_NAME="KarpenterNodeRole-${CLUSTER_NAME}"
export KARPENTER_INSTANCE_PROFILE="KarpenterNodeInstanceProfile-${CLUSTER_NAME}"
echo "Karpenter Node Role Name: ${KARPENTER_NODE_ROLE_NAME}"
echo "Karpenter Instance Profile: ${KARPENTER_INSTANCE_PROFILE}"

Verify: Confirm the instance profile and role exist.


aws iam get-instance-profile --instance-profile-name KarpenterNodeInstanceProfile-${CLUSTER_NAME}
aws iam get-role --role-name KarpenterNodeRole-${CLUSTER_NAME}

Step 4: Discover Subnets and Security Groups

Karpenter needs to know which subnets and security groups it can use to launch instances. We'll tag these resources so Karpenter can discover them automatically. This is a crucial step for Karpenter's operational success. For advanced networking configurations, you might explore topics like Cilium WireGuard Encryption.


# Discover VPC ID
export VPC_ID=$(aws eks describe-cluster --name ${CLUSTER_NAME} --query "cluster.resourcesVpcConfig.vpcId" --output text)
echo "VPC ID: ${VPC_ID}"

# Tag Subnets
for SUBNET_ID in $(aws ec2 describe-subnets --filters "Name=vpc-id,Values=${VPC_ID}" "Name=tag:eks:cluster-name,Values=${CLUSTER_NAME}" --query "Subnets[*].SubnetId" --output text); do
    aws ec2 create-tags --resources ${SUBNET_ID} --tags Key=karpenter.sh/discovery,Value=${CLUSTER_NAME}
    echo "Tagged subnet: ${SUBNET_ID}"
done

# Tag Security Groups
for SG_ID in $(aws ec2 describe-security-groups --filters "Name=vpc-id,Values=${VPC_ID}" "Name=tag:eks:cluster-name,Values=${CLUSTER_NAME}" --query "SecurityGroups[*].GroupId" --output text); do
    aws ec2 create-tags --resources ${SG_ID} --tags Key=karpenter.sh/discovery,Value=${CLUSTER_NAME}
    echo "Tagged security group: ${SG_ID}"
done

# Ensure there's a security group for your EKS cluster's control plane.
# This is usually created by EKS and is named something like "eks-cluster-sg--..."
# If you have a custom security group for nodes, tag that as well.
# Example for default EKS-created security group (adjust if needed):
for SG_ID in $(aws ec2 describe-security-groups --filters Name=group-name,Values="*eks-cluster-sg*${CLUSTER_NAME}*" Name=vpc-id,Values=${VPC_ID} --query "SecurityGroups[*].GroupId" --output text); do
    aws ec2 create-tags --resources ${SG_ID} --tags Key=karpenter.sh/discovery,Value=${CLUSTER_NAME}
    echo "Tagged EKS default security group: ${SG_ID}"
done

Verify: Check if the tags are applied correctly.


aws ec2 describe-subnets --filters "Name=vpc-id,Values=${VPC_ID}" "Name=tag:karpenter.sh/discovery,Values=${CLUSTER_NAME}" --query "Subnets[*].SubnetId" --output text
aws ec2 describe-security-groups --filters "Name=vpc-id,Values=${VPC_ID}" "Name=tag:karpenter.sh/discovery,Values=${CLUSTER_NAME}" --query "SecurityGroups[*].GroupId" --output text

You should see a list of subnet and security group IDs.

Step 5: Install Karpenter using Helm

Now that all the prerequisites are in place, we can install Karpenter using Helm. We'll specify the IRSA ARN and the instance profile created earlier.


helm upgrade --install karpenter oci://public.ecr.aws/karpenter/karpenter --version ${KARPENTER_VERSION} --namespace karpenter --create-namespace \
    --set serviceAccount.annotations."eks\.amazonaws\.com/role-arn"=${KARPENTER_IRSA_ARN} \
    --set settings.aws.clusterName=${CLUSTER_NAME} \
    --set settings.aws.defaultInstanceProfile=${KARPENTER_INSTANCE_PROFILE} \
    --set settings.aws.interruptionQueue=${CLUSTER_NAME} \
    --wait

Verify: Check if Karpenter pods are running in the karpenter namespace.


kubectl get pods -n karpenter

You should see output similar to this, indicating the Karpenter controller is running:


NAME                         READY   STATUS    RESTARTS   AGE
karpenter-xxxxxxxxx-xxxxx    1/1     Running   0          2m

Step 6: Create Karpenter NodePool and EC2NodeClass

Karpenter uses NodePools and EC2NodeClasses (since v0.32.0, replacing the older Provisioner resource) to define how nodes should be provisioned. A NodePool specifies the requirements for your nodes (e.g., CPU, memory, instance families, capacity types), while an EC2NodeClass defines AWS-specific parameters like AMI family, IAM role, subnets, and security groups. This separation allows for cleaner configuration and reusability.

Create an EC2NodeClass that references the node role and uses the tags for subnet and security group discovery.


apiVersion: karpenter.k8s.aws/v1beta1
kind: EC2NodeClass
metadata:
  name: default
spec:
  amiFamily: AL2 # Amazon Linux 2 (or AL2023 for Amazon Linux 2023)
  role: ${KARPENTER_NODE_ROLE_NAME} # IAM role for nodes
  subnetSelectorTerms:
    - tags:
        karpenter.sh/discovery: ${CLUSTER_NAME}
  securityGroupSelectorTerms:
    - tags:
        karpenter.sh/discovery: ${CLUSTER_NAME}
  tags:
    karpenter.sh/discovery: ${CLUSTER_NAME}
    Environment: Dev
    ManagedBy: Karpenter
  # Optional: Specify AMIs if you don't want to use AL2 default
  # amiSelectorTerms:
  #   - id: ami-0abcdef1234567890 # Replace with a valid AMI ID
  #   - tags:
  #       eks-build: 1.28
---
apiVersion: karpenter.sh/v1beta1
kind: NodePool
metadata:
  name: default
spec:
  template:
    spec:
      requirements:
        - key: karpenter.k8s.aws/instance-category
          operator: In
          values: ["c", "m", "r"] # Allow compute, memory, and general purpose instances
        - key: karpenter.k8s.aws/instance-family
          operator: In
          values: ["c5", "m5", "r5", "c6i", "m6i", "r6i"] # Specific instance families
        - key: karpenter.k8s.aws/instance-size
          operator: NotIn
          values: ["nano", "micro", "small"] # Exclude very small instances
        - key: karpenter.sh/capacity-type
          operator: In
          values: ["on-demand", "spot"] # Allow both On-Demand and Spot instances
        - key: "kubernetes.io/arch"
          operator: In
          values: ["amd64"] # Or arm64 if you want Graviton instances
      nodeClassRef:
        name: default # Reference the EC2NodeClass created above
  limits:
    cpu: "1000" # Max 1000 CPU cores for this NodePool
    memory: "1000Gi" # Max 1000 GiB memory for this NodePool
  disruption:
    consolidationPolicy: WhenUnderutilized # Consolidate nodes when underutilized
    expireAfter: 720h # Nodes will be deprovisioned after 30 days, regardless of utilization

Save the above YAML to a file (e.g., karpenter-nodepool.yaml) and apply it:


envsubst < karpenter-nodepool.yaml | kubectl apply -f -

Verify: Ensure the NodePool and EC2NodeClass are created.


kubectl get nodepool default
kubectl get ec2nodeclass default

Step 7: Deploy a Sample Application to Trigger Scaling

Now, let's deploy a sample application that requests more resources than currently available in your cluster (assuming you start with a minimal EKS setup or no worker nodes). This will prompt Karpenter to provision new nodes.


apiVersion: apps/v1
kind: Deployment
metadata:
  name: inflated-nginx
spec:
  replicas: 5
  selector:
    matchLabels:
      app: inflated-nginx
  template:
    metadata:
      labels:
        app: inflated-nginx
    spec:
      terminationGracePeriodSeconds: 0
      containers:
      - name: nginx
        image: public.ecr.aws/nginx/nginx:latest
        resources:
          requests:
            cpu: 1
            memory: 2Gi

Save this as inflated-nginx.yaml and apply it:


kubectl apply -f inflated-nginx.yaml

Verify: Observe the pending pods and Karpenter's actions.


kubectl get pods -w # Watch for pods to be scheduled
kubectl get nodes -w # Watch for new nodes to appear
kubectl logs -f -n karpenter $(kubectl get pods -n karpenter -l app.kubernetes.io/instance=karpenter -o name) # Watch Karpenter logs

You should see some inflated-nginx pods in a Pending state, followed by Karpenter logs indicating it's launching new instances. Eventually, new nodes will appear in kubectl get nodes, and the pods will transition to Running. This "just-in-time" provisioning is a core benefit of Karpenter, and it's particularly useful for workloads like LLMs with GPU scheduling.

Step 8: Observe Node Deprovisioning

Karpenter doesn't just provision nodes; it also deprovisions them when they are no longer needed, driven by the disruption policies defined in your NodePool. To see this in action, scale down your sample application.


kubectl scale deployment inflated-nginx --replicas=0

Verify: Watch for nodes to be terminated.


kubectl get pods -w # Confirm pods are terminating
kubectl get nodes -w # Watch for nodes to disappear
kubectl logs -f -n karpenter $(kubectl get pods -n karpenter -l app.kubernetes.io/instance=karpenter -o name) # Watch Karpenter logs for deprovisioning

After a short period (defined by ttlSecondsAfterEmpty in the old Provisioner or ttlSecondsAfterEmpty in NodePool's disruption), Karpenter will cordon and drain the empty nodes, then terminate the underlying EC2 instances. This dynamic scaling down is crucial for cost optimization, as discussed in detail in our Karpenter Cost Optimization guide.

Production Considerations

  • Resource Limits: Always define limits in your NodePool (or Provisioner) for CPU and memory to prevent uncontrolled scaling and unexpected cloud bills.
  • Instance Types: Be specific with your instance-family and instance-category requirements. Allowing too many types can lead to Karpenter picking suboptimal instances. Consider using a mix of On-Demand and Spot instances for cost savings, but ensure your workloads are fault-tolerant for Spot interruptions.
  • AMI Management: While Karpenter can discover AMIs, for production, it's often better to specify a hardened or custom AMI through amiSelectorTerms in your EC2NodeClass. This ensures consistency and security.
  • Taints and Tolerations: Use taints in your NodePool template to dedicate nodes to specific workloads (e.g., GPU nodes for ML tasks). Pods will then need corresponding tolerations.
  • Health Checks: Monitor Karpenter's logs and metrics. Integrate with your observability stack. For advanced observability, check out eBPF Observability: Building Custom Metrics with Hubble.
  • Interruption Handling: Karpenter integrates with AWS EC2 Spot Interruptions and ASG lifecycle hooks to gracefully drain nodes before termination. Ensure your applications are designed to handle node evictions.
  • Security Groups and Networking: Ensure your security groups are properly configured to allow communication between Karpenter-provisioned nodes, the EKS control plane, and any other necessary services. For further security hardening, refer to our Kubernetes Network Policies: Complete Security Hardening Guide.
  • Multiple NodePools: For complex environments, consider using multiple NodePools to provision different types of nodes for different workloads (e.g., one for general-purpose, one for high-CPU, one for GPU).
  • Cost Monitoring: Regularly review your AWS costs to ensure Karpenter is optimizing effectively. Tagging resources created by Karpenter (via EC2NodeClass.spec.tags) is essential for cost allocation.

Troubleshooting

  1. Karpenter pods are stuck in Pending or CrashLoopBackOff.

    Solution:

    This often indicates an issue with the Karpenter controller's IAM role (IRSA) or insufficient resources for the Karpenter pod itself. First, check the pod logs:

    
    kubectl logs -n karpenter -l app.kubernetes.io/instance=karpenter -f
    

    Look for permission errors (e.g., "Access Denied"). Verify the IRSA ARN is correctly annotated on the Karpenter service account and that the IAM policy attached to the role has all necessary permissions (Step 2). Also, ensure your cluster has at least one worker node (even a small one) for Karpenter to run on initially.

  2. Pods are pending, but Karpenter is not provisioning new nodes.

    Solution:

    This is a common issue. Check the Karpenter controller logs for specific reasons:

    
    kubectl logs -n karpenter -l app.kubernetes.io/instance=karpenter -f
    

    Common reasons include:

    • No matching NodePool/EC2NodeClass: The pending pods' resource requests (CPU, memory, GPU) or node selectors/tolerations might not match any defined NodePool requirements.
    • Insufficient AWS capacity: Karpenter might be unable to find available instances in the specified regions/zones or with the requested instance types.
    • IAM permissions: The Karpenter controller role might lack permissions to describe subnets, security groups, or launch instances.
    • Incorrect subnet/security group tags: Ensure the karpenter.sh/discovery: ${CLUSTER_NAME} tags are correctly applied to your subnets and security groups (Step 4).
    • NodePool limits exceeded: The limits defined in your NodePool might be preventing further scaling.
  3. New nodes are provisioned but don't join the cluster (remain NotReady).

    Solution:

    This usually points to an issue with the Karpenter node's IAM role or networking. Check:

    • Node IAM Role: Ensure the KarpenterNodeRole-${CLUSTER_NAME} has the AmazonEKSWorkerNodePolicy, AmazonEKS_CNI_Policy, and AmazonEC2ContainerRegistryReadOnly policies attached (Step 3).
    • Security Groups: Verify that the security groups applied to the new nodes allow inbound traffic from the EKS control plane (port 443) and outbound traffic to the control plane and other Kubernetes services.
    • User Data: The node's user data script might fail. You can inspect the user data and cloud-init logs on the EC2 instance itself (e.g., /var/log/cloud-init-output.log).
    • Kubelet logs: SSH into the problematic node and check journalctl -u kubelet for errors.
  4. Karpenter is not deprovisioning empty or underutilized nodes.

    Solution:

    Check the disruption section of your NodePool. Ensure expireAfter and consolidationPolicy are set appropriately. If ttlSecondsAfterEmpty is set, nodes should be deprovisioned after that time if they have no pods. If consolidationPolicy: WhenUnderutilized is set, Karpenter will try to consolidate nodes if it can reschedule pods onto fewer, larger nodes. Ensure there are no pods with terminationGracePeriodSeconds: 0 or other disruption budget issues preventing pod eviction.

  5. Karpenter is provisioning instance types I don't want.

    Solution:

    Refine the requirements in your NodePool. Use operator: NotIn for instance families or sizes you want to exclude. Be as specific as possible with instance-category, instance-family, instance-size, and cpu/memory requirements. For example, if you only want compute-optimized instances, specify instance-category: ["c"]. Also, check your amiFamily in EC2NodeClass to ensure it aligns with your desired instance types.

FAQ Section

  1. What is the main difference between Karpenter and Cluster Autoscaler?

    Traditional Cluster Autoscaler (CA) operates by monitoring pending pods and scaling up predefined Auto Scaling Groups (ASGs). It chooses from a fixed set of instance types. Karpenter, on the other hand, directly observes pending pods and provisions the exact right EC2 instance type (from a much wider range) to satisfy the aggregate demands of those pods, often in seconds. This "just-in-time" approach leads to faster scaling, better resource utilization, and often lower costs. For more context on Kubernetes networking, you might find our

Leave a Reply

Your email address will not be published. Required fields are marked *