Orchestration

Kubernetes Spot ML: Faster, Cheaper Training

Introduction

Training large-scale Machine Learning (ML) models often requires significant computational resources, especially GPUs. However, the cost associated with on-demand instances can quickly become prohibitive, making advanced research and development inaccessible for many teams. This is where Kubernetes, combined with cloud provider Spot Instances (or Preemptible VMs on GCP, Spot VMs on Azure), offers a compelling solution. Spot Instances provide access to unused cloud capacity at a steep discount, often 70-90% off on-demand prices, but with the caveat that they can be reclaimed by the cloud provider with short notice.

Leveraging Spot Instances for ML training within Kubernetes allows organizations to drastically reduce infrastructure costs without sacrificing the scalability and flexibility that Kubernetes provides. The ephemeral nature of Spot Instances, while a challenge for long-running, stateful applications, is often acceptable for ML training jobs that are designed to be fault-tolerant and can checkpoint their progress. This guide will walk you through setting up a Kubernetes cluster to effectively utilize Spot Instances for your ML workloads, focusing on robust scheduling, cost optimization, and resilience strategies.

TL;DR

Harness Kubernetes and Spot Instances for massive ML cost savings. Use Karpenter for intelligent node provisioning, tolerations/taints for scheduling, and design fault-tolerant ML jobs with checkpointing. Expect 70-90% cost reduction but prepare for instance preemption. For advanced GPU scheduling, refer to our LLM GPU Scheduling Guide.

# Install Karpenter (example for AWS)
helm upgrade --install karpenter oci://public.ecr.aws/karpenter/karpenter --version ${KARPENTER_VERSION} \
  --namespace karpenter --create-namespace \
  --set serviceAccount.create=false \
  --set serviceAccount.name=karpenter \
  --set settings.aws.clusterName=${CLUSTER_NAME} \
  --set settings.aws.defaultInstanceProfile=KarpenterNodeInstanceProfile-${CLUSTER_NAME} \
  --wait # Wait for the deployment to complete

# Create a Karpenter Provisioner for Spot Instances
kubectl apply -f - <<EOF
apiVersion: karpenter.sh/v1beta1
kind: Provisioner
metadata:
  name: spot-gpu-provisioner
spec:
  requirements:
    - key: "kubernetes.io/arch"
      operator: In
      values: ["amd64"]
    - key: "kubernetes.io/os"
      operator: In
      values: ["linux"]
    - key: "karpenter.sh/capacity-type"
      operator: In
      values: ["spot"]
    - key: "karpenter.sh/instance-category"
      operator: In
      values: ["g"] # For GPU instances
    - key: "karpenter.sh/instance-family"
      operator: In
      values: ["g4dn", "p3", "p4"] # Example GPU instance families
  limits:
    resources:
      cpu: "1000"
      memory: "1000Gi"
      nvidia.com/gpu: "100"
  providerRef:
    name: default
  ttlSecondsAfterEmpty: 60 # Scale down nodes after 60 seconds of no pods
  ttlSecondsUntilExpired: 2592000 # Nodes expire after 30 days
EOF

# Example ML Job with node selector and toleration
kubectl apply -f - <<EOF
apiVersion: batch/v1
kind: Job
metadata:
  name: ml-gpu-training-spot
spec:
  template:
    spec:
      restartPolicy: OnFailure
      tolerations:
      - key: "karpenter.sh/capacity-type"
        operator: "Exists"
        effect: "NoSchedule"
      nodeSelector:
        karpenter.sh/capacity-type: spot
        karpenter.sh/instance-category: g
      containers:
      - name: trainer
        image: your-ml-gpu-image:latest # Replace with your GPU-enabled ML image
        command: ["python", "train.py", "--epochs", "10", "--checkpoint-path", "/mnt/checkpoints"]
        resources:
          limits:
            nvidia.com/gpu: 1
            memory: "16Gi"
            cpu: "8"
          requests:
            nvidia.com/gpu: 1
            memory: "8Gi"
            cpu: "4"
        volumeMounts:
        - name: checkpoint-storage
          mountPath: /mnt/checkpoints
      volumes:
      - name: checkpoint-storage
        persistentVolumeClaim:
          claimName: ml-checkpoints-pvc # Ensure this PVC exists and is backed by a resilient storage
EOF

Prerequisites

Before diving in, ensure you have the following:

  • A Kubernetes cluster (version 1.20+ recommended). This guide focuses on AWS, but concepts are transferable.
  • kubectl installed and configured to connect to your cluster.
  • Helm installed (version 3+).
  • AWS CLI installed and configured with appropriate permissions.
  • Basic understanding of Kubernetes concepts: Pods, Deployments, Jobs, Persistent Volumes, and NodeSelectors/Tolerations.
  • Familiarity with cloud provider Spot Instances and their preemption model.
  • An existing GPU-enabled ML training image (e.g., TensorFlow, PyTorch with CUDA). For best practices on running such workloads, see our LLM GPU Scheduling Guide.

Step-by-Step Guide

1. Set up IAM Roles and Policies for Karpenter (AWS Specific)

Karpenter needs specific IAM permissions to launch and manage EC2 instances on your behalf. This involves creating an IAM role for Karpenter and an Instance Profile for the nodes it provisions.

First, define environment variables for your cluster name and AWS region.

export CLUSTER_NAME="your-kubezilla-ml-cluster" # Replace with your cluster name
export AWS_REGION="us-east-1" # Replace with your cluster's region
export ACCOUNT_ID=$(aws sts get-caller-identity --query Account --output text)

echo "Cluster Name: ${CLUSTER_NAME}"
echo "AWS Region: ${AWS_REGION}"
echo "AWS Account ID: ${ACCOUNT_ID}"

Next, create an IAM policy for Karpenter. This policy grants Karpenter permissions to interact with EC2, IAM, and other AWS services required for node provisioning.

# Create Karpenter IAM policy
cat <<EOF > karpenter-policy.json
{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                "ec2:CreateLaunchTemplate",
                "ec2:CreateFleet",
                "ec2:RunInstances",
                "ec2:CreateTags",
                "ec2:TerminateInstances",
                "ec2:DeleteLaunchTemplate",
                "ec2:DescribeLaunchTemplates",
                "ec2:DescribeInstances",
                "ec2:DescribeImages",
                "ec2:DescribeSubnets",
                "ec2:DescribeSecurityGroups",
                "ec2:DescribeInstanceTypes",
                "ec2:DescribeInstanceTypeOfferings",
                "ec2:DescribeAvailabilityZones",
                "ec2:DeleteTags",
                "ec2:AssociateAddress",
                "ec2:DisassociateAddress",
                "ec2:DescribeSpotPriceHistory",
                "ssm:GetParameter"
            ],
            "Resource": "*"
        },
        {
            "Effect": "Allow",
            "Action": "iam:PassRole",
            "Resource": "arn:aws:iam::${ACCOUNT_ID}:role/KarpenterNodeRole-${CLUSTER_NAME}"
        },
        {
            "Effect": "Allow",
            "Action": "eks:DescribeCluster",
            "Resource": "arn:aws:eks:${AWS_REGION}:${ACCOUNT_ID}:cluster/${CLUSTER_NAME}"
        }
    ]
}
EOF

aws iam create-policy \
    --policy-name KarpenterPolicy-${CLUSTER_NAME} \
    --policy-document file://karpenter-policy.json

# Create an IAM role for Karpenter and attach the policy
aws iam create-role \
    --role-name KarpenterRole-${CLUSTER_NAME} \
    --assume-role-policy-document "{\"Version\":\"2012-10-17\",\"Statement\":[{\"Effect\":\"Allow\",\"Principal\":{\"Federated\":\"arn:aws:iam::${ACCOUNT_ID}:oidc-provider/oidc.eks.${AWS_REGION}.amazonaws.com/id/${OIDC_ID}\"},\"Action\":\"sts:AssumeRoleWithWebIdentity\",\"Condition\":{\"StringEquals\":{\"oidc.eks.${AWS_REGION}.amazonaws.com/id/${OIDC_ID}:aud\":\"sts.amazonaws.com\",\"oidc.eks.${AWS_REGION}.amazonaws.com/id/${OIDC_ID}:sub\":\"system:serviceaccount:karpenter:karpenter\"}}}]}"

aws iam attach-role-policy \
    --role-name KarpenterRole-${CLUSTER_NAME} \
    --policy-arn arn:aws:iam::${ACCOUNT_ID}:policy/KarpenterPolicy-${CLUSTER_NAME}

You’ll need to replace ${OIDC_ID} with your cluster’s OIDC provider ID. You can fetch this using: aws eks describe-cluster --name ${CLUSTER_NAME} --query "cluster.identity.oidc.issuer" --output text | cut -d '/' -f 5

Finally, create an Instance Profile for the nodes Karpenter will launch. This profile grants the EC2 instances the necessary permissions to join the EKS cluster.

# Create Node IAM role and attach policies
aws iam create-role --role-name KarpenterNodeRole-${CLUSTER_NAME} \
    --assume-role-policy-document '{"Version":"2012-10-17","Statement":[{"Effect":"Allow","Principal":{"Service":"ec2.amazonaws.com"},"Action":"sts:AssumeRole"}]}'

aws iam attach-role-policy --role-name KarpenterNodeRole-${CLUSTER_NAME} --policy-arn arn:aws:iam::aws:policy/AmazonEKSWorkerNodePolicy
aws iam attach-role-policy --role-name KarpenterNodeRole-${CLUSTER_NAME} --policy-arn arn:aws:iam::aws:policy/AmazonEC2ContainerRegistryReadOnly
aws iam attach-role-policy --role-name KarpenterNodeRole-${CLUSTER_NAME} --policy-arn arn:aws:iam::aws:policy/AmazonEKS_CNI_Policy

# Create Instance Profile
aws iam create-instance-profile --instance-profile-name KarpenterNodeInstanceProfile-${CLUSTER_NAME}
aws iam add-role-to-instance-profile --instance-profile-name KarpenterNodeInstanceProfile-${CLUSTER_NAME} --role-name KarpenterNodeRole-${CLUSTER_NAME}

2. Install Karpenter

Karpenter is an open-source, high-performance Kubernetes cluster autoscaler built by AWS. Unlike the Kubernetes Cluster Autoscaler, Karpenter directly interfaces with the cloud provider’s API to provision nodes, making it incredibly fast and efficient. It’s particularly adept at leveraging Spot Instances and diverse instance types. For more on cost optimization with Karpenter, see our guide on Karpenter Cost Optimization.

Install Karpenter using Helm:

# Get the latest Karpenter version
export KARPENTER_VERSION="0.32.0" # Check https://karpenter.sh/docs/getting-started/ for the latest version

# Create a Kubernetes Service Account for Karpenter
kubectl create namespace karpenter
kubectl create serviceaccount karpenter -n karpenter

# Link the Service Account to the IAM role
kubectl annotate serviceaccount karpenter -n karpenter \
    eks.amazonaws.com/role-arn=arn:aws:iam::${ACCOUNT_ID}:role/KarpenterRole-${CLUSTER_NAME}

# Install Karpenter Helm chart
helm upgrade --install karpenter oci://public.ecr.aws/karpenter/karpenter --version ${KARPENTER_VERSION} \
  --namespace karpenter --create-namespace \
  --set serviceAccount.create=false \
  --set serviceAccount.name=karpenter \
  --set settings.aws.clusterName=${CLUSTER_NAME} \
  --set settings.aws.defaultInstanceProfile=KarpenterNodeInstanceProfile-${CLUSTER_NAME} \
  --set settings.aws.interruptionQueueName=${CLUSTER_NAME} # Optional: for faster Spot interruption handling
  --wait # Wait for the deployment to complete

Verify Karpenter deployment:

kubectl get pods -n karpenter

Expected Output:

NAME                          READY   STATUS    RESTARTS   AGE
karpenter-xxxxxxxxx-xxxxx     1/1     Running   0          2m

3. Configure Karpenter Provisioner for Spot GPU Instances

The core of using Spot Instances with Karpenter is defining a `Provisioner`. This resource tells Karpenter *what kind* of nodes to launch based on pod requirements. We’ll create a provisioner specifically for GPU-enabled Spot instances.

Create a Provisioner called `spot-gpu-provisioner`.

apiVersion: karpenter.sh/v1beta1
kind: Provisioner
metadata:
  name: spot-gpu-provisioner
spec:
  requirements:
    - key: "kubernetes.io/arch"
      operator: In
      values: ["amd64"]
    - key: "kubernetes.io/os"
      operator: In
      values: ["linux"]
    - key: "karpenter.sh/capacity-type" # This is the key that tells Karpenter to use Spot
      operator: In
      values: ["spot"]
    - key: "karpenter.sh/instance-category" # Filter for GPU instances
      operator: In
      values: ["g"]
    - key: "karpenter.sh/instance-family" # Specify preferred GPU instance families
      operator: In
      values: ["g4dn", "p3", "p4"] # Adjust based on your region and budget
    - key: "karpenter.sh/instance-cpu-topology-key" # Optional: For specific CPU architectures
      operator: Exists
  limits:
    resources:
      cpu: "1000" # Max CPU Karpenter can provision for this provisioner
      memory: "1000Gi" # Max Memory
      nvidia.com/gpu: "100" # Max GPUs
  providerRef:
    name: default # Refers to the default AWSNodeTemplate created by Karpenter.
                  # For more advanced configurations, you might define a custom AWSNodeTemplate.
  ttlSecondsAfterEmpty: 60 # Scale down nodes after 60 seconds of no pods
  ttlSecondsUntilExpired: 2592000 # Nodes expire after 30 days, forcing a refresh
  consolidation:
    enabled: true # Karpenter will try to consolidate nodes for cost savings

Apply the provisioner:

kubectl apply -f spot-gpu-provisioner.yaml

Verify the provisioner is created:

kubectl get provisioner

Expected Output:

NAME                    AGE
spot-gpu-provisioner    1m
default                 1m  # Default provisioner might also exist

4. Deploy GPU Operator

To enable Kubernetes to recognize and schedule workloads on GPU resources, you need a GPU operator. NVIDIA’s GPU Operator is the standard for this. It automates the deployment of all necessary components, including GPU drivers, CUDA, and device plugins.

Install the NVIDIA GPU Operator using Helm:

helm repo add nvidia https://helm.ngc.nvidia.com/nvidia \
    --force-update
helm repo update

helm install --wait --generate-name nvidia/gpu-operator \
  --set driver.enabled=true \
  --set toolkit.enabled=true \
  --set devicePlugin.enabled=true \
  --set validator.enabled=true \
  --set nfd.enabled=true # Node Feature Discovery for better node labeling

Verify the GPU Operator components are running (they will be pending until a GPU node is available):

kubectl get pods -n gpu-operator

Expected Output (after GPU nodes are provisioned):

NAME                                             READY   STATUS    RESTARTS   AGE
gpu-operator-cleanup-xxxxx                       0/1     Completed 0          5m
gpu-operator-container-toolkit-daemonset-xxxxx   1/1     Running   0          5m
gpu-operator-driver-daemonset-xxxxx              1/1     Running   0          5m
gpu-operator-device-plugin-daemonset-xxxxx       1/1     Running   0          5m
gpu-operator-nfd-master-xxxxx                    1/1     Running   0          5m
gpu-operator-nfd-worker-daemonset-xxxxx          1/1     Running   0          5m
gpu-operator-validator-xxxxx                     1/1     Running   0          5m

5. Create Persistent Storage for Checkpointing

ML training jobs, especially those running on Spot Instances, must be fault-tolerant. This means they should periodically save their state (checkpoints) to persistent storage, so they can resume from the last saved state if a node is preempted. AWS EFS (Elastic File System) or FSx for Lustre are good choices for shared, high-performance storage.

First, ensure you have an EFS CSI driver installed or a similar solution for shared storage. For AWS, you can install the EFS CSI driver:

# Install EFS CSI Driver (if not already present)
helm repo add aws-efs-csi-driver https://kubernetes-sigs.github.io/aws-efs-csi-driver/
helm repo update

helm upgrade -i aws-efs-csi-driver aws-efs-csi-driver/aws-efs-csi-driver \
    --namespace kube-system \
    --set image.repository=registry.k8s.io/aws-efs-csi-driver/csi-driver \
    --set controller.serviceAccount.create=false \
    --set controller.serviceAccount.name=efs-csi-controller-sa \
    --set node.serviceAccount.create=false \
    --set node.serviceAccount.name=efs-csi-node-sa

Next, create an EFS file system (if you don’t have one) and then a Kubernetes StorageClass and PersistentVolumeClaim (PVC).

Create EFS File System (Manual or via AWS CLI):

# Find your VPC ID and subnet IDs
VPC_ID=$(aws eks describe-cluster --name ${CLUSTER_NAME} --query "cluster.resourcesVpcConfig.vpcId" --output text)
SUBNET_IDS=$(aws ec2 describe-subnets --filters "Name=vpc-id,Values=${VPC_ID}" --query "Subnets[*].SubnetId" --output text)

# Create EFS file system
EFS_ID=$(aws efs create-file-system --performance-mode generalPurpose --query "FileSystemId" --output text)
echo "EFS File System ID: ${EFS_ID}"

# Create mount targets for each subnet
for SUBNET_ID in ${SUBNET_IDS}; do
    aws efs create-mount-target --file-system-id ${EFS_ID} --subnet-id ${SUBNET_ID}
done

# You might need to adjust security groups for EFS access.

Create StorageClass and PVC:

apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: efs-sc
provisioner: efs.csi.aws.com
parameters:
  provisioningMode: efs-ap
  fileSystemId: ${EFS_ID} # Replace with your EFS File System ID
  directoryPerms: "777" # Adjust permissions as needed
  throughputMode: bursting # or provisioned
  #encrypted: "true" # Uncomment if EFS is encrypted
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: ml-checkpoints-pvc
spec:
  accessModes:
    - ReadWriteMany # Essential for shared access if multiple pods need to write
  storageClassName: efs-sc
  resources:
    requests:
      storage: 100Gi # Request sufficient storage for checkpoints

Apply the StorageClass and PVC:

# Replace ${EFS_ID} with the actual ID from the previous step
sed "s|\${EFS_ID}|${EFS_ID}|g" efs-storage.yaml | kubectl apply -f -

Verify PVC creation:

kubectl get pvc ml-checkpoints-pvc

Expected Output:

NAME                   STATUS   VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS   AGE
ml-checkpoints-pvc     Bound    pvc-xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx   100Gi      RWX            efs-sc         1m

6. Submit an ML Training Job

Now, let’s submit an ML training job that leverages the Spot GPU instances provisioned by Karpenter and uses the persistent storage for checkpointing. The key here is using `nodeSelector` and `tolerations` to ensure the pod lands on a Spot GPU instance.

The `tolerations` ensure that the pod *can* be scheduled on nodes tainted with `karpenter.sh/capacity-type=spot`. The `nodeSelector` *prefers* or *requires* the pod to land on a node with those specific labels. Karpenter detects these pending pods, and if no suitable node exists, it provisions one.

apiVersion: batch/v1
kind: Job
metadata:
  name: ml-gpu-training-spot
spec:
  template:
    spec:
      restartPolicy: OnFailure
      tolerations:
      - key: "karpenter.sh/capacity-type"
        operator: "Exists"
        effect: "NoSchedule"
      - key: "nvidia.com/gpu" # Tolerate NVIDIA GPU taint (added by GPU operator)
        operator: "Exists"
        effect: "NoSchedule"
      nodeSelector:
        karpenter.sh/capacity-type: spot
        karpenter.sh/instance-category: g # Ensure it's a GPU instance
      containers:
      - name: trainer
        image: your-ml-gpu-image:latest # IMPORTANT: Replace with your actual GPU-enabled ML image
        command: ["python", "train.py", "--epochs", "10", "--checkpoint-path", "/mnt/checkpoints"]
        resources:
          limits:
            nvidia.com/gpu: 1 # Request 1 GPU
            memory: "16Gi"
            cpu: "8"
          requests:
            nvidia.com/gpu: 1
            memory: "8Gi"
            cpu: "4"
        volumeMounts:
        - name: checkpoint-storage
          mountPath: /mnt/checkpoints
      volumes:
      - name: checkpoint-storage
        persistentVolumeClaim:
          claimName: ml-checkpoints-pvc
EOF

Apply the ML Job:

kubectl apply -f ml-gpu-training-job.yaml

Verify the job and node provisioning:

# Watch for new nodes to be provisioned by Karpenter
kubectl get nodes --watch

# Watch for the pod to be scheduled and run
kubectl get pods -l job-name=ml-gpu-training-spot --watch

Expected Output (after a new node comes up):

# kubectl get nodes --watch
NAME                                           STATUS   ROLES    AGE   VERSION
ip-xxx-xxx-xxx-xxx.ec2.internal                Ready    <none>   2m    v1.28.x # This will be your new Spot GPU node
# kubectl get pods -l job-name=ml-gpu-training-spot --watch
NAME                           READY   STATUS    RESTARTS   AGE
ml-gpu-training-spot-xxxxx     0/1     Pending   0          0s
ml-gpu-training-spot-xxxxx     0/1     ContainerCreating   0          10s
ml-gpu-training-spot-xxxxx     1/1     Running   0          45s

You can also describe the Karpenter controller logs to see its actions:

kubectl logs -f -n karpenter $(kubectl get pod -n karpenter -l app.kubernetes.io/name=karpenter -o name)

Production Considerations

When using Spot Instances for ML training in production, several factors need careful consideration to ensure reliability, cost-effectiveness, and maintainability.

  1. Fault Tolerance and Checkpointing: This is paramount. Your ML training code *must* be designed to save its state periodically (e.g., every N epochs or every M minutes) to persistent storage. It should also be able to resume training from the latest checkpoint. Consider libraries like PyTorch’s `torch.save` and `torch.load` or TensorFlow’s `tf.train.Checkpoint`.
  2. Distributed Training: For large models, distributed training is common. While possible on Spot, it adds complexity. Ensure your distributed training framework (e.g., Horovod, PyTorch Distributed, Ray) can handle node failures gracefully. This often involves mechanisms for re-joining the training cluster or restarting workers.
  3. Monitoring and Alerting: Monitor your ML jobs and the underlying Spot instances. Use tools like Prometheus and Grafana to track job progress, GPU utilization, and node preemption events. Karpenter provides metrics that can be scraped. External links for monitoring Kubernetes: Prometheus and Grafana. For advanced eBPF-based observability, check our eBPF Observability with Hubble guide.
  4. Network Performance: ML training, especially distributed, can be network-intensive. Ensure your chosen instance types and network configuration (e.g., EFA on AWS) provide sufficient bandwidth. Also, consider network policies for security and isolation; our Network Policies Security Guide can help.
  5. Image Management: Keep your ML container images optimized and stored in a low-latency registry (e.g., ECR, GCR, ACR). Large images can slow down pod startup on new Spot instances.
  6. Cost Management: Regularly review your cloud bills. While Spot instances are cheap, inefficient use (e.g., over-provisioning GPUs, idle clusters) can still lead to costs. Karpenter’s consolidation helps, but manual review is still valuable.
  7. Security: Implement robust security practices. Use Pod Security Standards, ensure proper IAM roles for your pods, and consider solutions like Sigstore and Kyverno for supply chain security.
  8. Preemption Handling: While Karpenter helps by quickly replacing preempted nodes, your application needs to gracefully handle the preemption signal (usually a `SIGTERM` within the container). Implement a short shutdown hook to save final checkpoints. Cloud providers offer APIs to detect upcoming preemption notices, which can be integrated into your application or a custom controller.
  9. Node Taints and Tolerations: Be precise with your taints and tolerations. For Spot nodes, Karpenter automatically applies `karpenter.sh/capacity-type=spot:NoSchedule`. Your pods should tolerate this. Avoid broad `Exists` tolerations unless truly necessary.
  10. Instance Diversity: In your `Provisioner`, specify a wide range of instance types and families. This increases Karpenter’s chances of finding available Spot capacity and reduces the likelihood of preemption for a specific type.

Troubleshooting

  1. Issue: Pods remain in `Pending` state despite Karpenter provisioner being configured.

    Explanation: This usually means Karpenter can’t provision a node that satisfies the pod’s requirements, or there’s an issue with Karpenter itself. Common causes include insufficient permissions, incorrect provisioner configuration, or lack of available Spot capacity for the requested instance types.

    Solution:

    • Check Karpenter controller logs:
      kubectl logs -f -n karpenter $(kubectl get pod -n karpenter -l app.kubernetes.io/name=karpenter -o name)

      Look for errors related to EC2 API calls, instance type failures, or subnet issues.

    • Verify your `Provisioner` requirements match the pod’s `nodeSelector` and `tolerations`.
    • Ensure the Karpenter IAM role has all necessary EC2 permissions.
    • Check if there’s Spot capacity for your specified `instance-family` in your region/AZ. Try broadening the `instance-family` list in your `Provisioner`.
    • Check if your subnets have sufficient IP addresses.
  2. Issue: GPU Operator pods are stuck in `Pending` or `CrashLoopBackOff`.

    Explanation: The GPU operator components often require a GPU-enabled node to be present before they can fully initialize. If no GPU node is available, they might stay pending. If they crash, it could be a driver issue or misconfiguration.

    Solution:

    • Ensure Karpenter has successfully provisioned a GPU node. Check `kubectl get nodes -L karpenter.sh/instance-category`.
    • If a GPU node exists, check the logs of the failing GPU Operator pod:
      kubectl logs -n gpu-operator <pod-name>
      kubectl describe pod -n gpu-operator <pod-name>
    • Verify the GPU drivers are compatible with the kernel version running on your nodes. The NVIDIA GPU operator usually handles this, but custom AMIs might cause issues.
  3. Issue: ML job fails with “out of memory” or “CUDA out of memory” errors.

    Explanation: This indicates that your container is requesting more GPU memory or system memory than available on the assigned node/GPU, or your code has a memory leak.

    Solution:

    • Increase the `resources.limits.memory` in your pod spec.
    • If it’s GPU memory, try reducing your ML model’s batch size, model size, or using techniques like gradient accumulation.
    • Request a larger GPU instance type if your current one is insufficient. You can add a `requirements` to your Karpenter `Provisioner` for specific GPU memory sizes if needed.
    • Monitor GPU memory usage with tools like `nvidia-smi` (if you can exec into the container) or GPU metrics from your monitoring solution.
  4. Issue: Spot instance preemption interrupts ML training, and job doesn’t resume correctly.

    Explanation: Your ML training code isn’t properly checkpointing or isn’t designed to resume from a checkpoint.

    Solution:

    • Implement Robust Checkpointing: Ensure your ML framework saves checkpoints frequently to the mounted persistent volume.
    • Resume Logic: Modify your training

Leave a Reply

Your email address will not be published. Required fields are marked *