Introduction
Training large-scale Machine Learning (ML) models often requires significant computational resources, especially GPUs. However, the cost associated with on-demand instances can quickly become prohibitive, making advanced research and development inaccessible for many teams. This is where Kubernetes, combined with cloud provider Spot Instances (or Preemptible VMs on GCP, Spot VMs on Azure), offers a compelling solution. Spot Instances provide access to unused cloud capacity at a steep discount, often 70-90% off on-demand prices, but with the caveat that they can be reclaimed by the cloud provider with short notice.
Leveraging Spot Instances for ML training within Kubernetes allows organizations to drastically reduce infrastructure costs without sacrificing the scalability and flexibility that Kubernetes provides. The ephemeral nature of Spot Instances, while a challenge for long-running, stateful applications, is often acceptable for ML training jobs that are designed to be fault-tolerant and can checkpoint their progress. This guide will walk you through setting up a Kubernetes cluster to effectively utilize Spot Instances for your ML workloads, focusing on robust scheduling, cost optimization, and resilience strategies.
TL;DR
Harness Kubernetes and Spot Instances for massive ML cost savings. Use Karpenter for intelligent node provisioning, tolerations/taints for scheduling, and design fault-tolerant ML jobs with checkpointing. Expect 70-90% cost reduction but prepare for instance preemption. For advanced GPU scheduling, refer to our LLM GPU Scheduling Guide.
# Install Karpenter (example for AWS)
helm upgrade --install karpenter oci://public.ecr.aws/karpenter/karpenter --version ${KARPENTER_VERSION} \
--namespace karpenter --create-namespace \
--set serviceAccount.create=false \
--set serviceAccount.name=karpenter \
--set settings.aws.clusterName=${CLUSTER_NAME} \
--set settings.aws.defaultInstanceProfile=KarpenterNodeInstanceProfile-${CLUSTER_NAME} \
--wait # Wait for the deployment to complete
# Create a Karpenter Provisioner for Spot Instances
kubectl apply -f - <<EOF
apiVersion: karpenter.sh/v1beta1
kind: Provisioner
metadata:
name: spot-gpu-provisioner
spec:
requirements:
- key: "kubernetes.io/arch"
operator: In
values: ["amd64"]
- key: "kubernetes.io/os"
operator: In
values: ["linux"]
- key: "karpenter.sh/capacity-type"
operator: In
values: ["spot"]
- key: "karpenter.sh/instance-category"
operator: In
values: ["g"] # For GPU instances
- key: "karpenter.sh/instance-family"
operator: In
values: ["g4dn", "p3", "p4"] # Example GPU instance families
limits:
resources:
cpu: "1000"
memory: "1000Gi"
nvidia.com/gpu: "100"
providerRef:
name: default
ttlSecondsAfterEmpty: 60 # Scale down nodes after 60 seconds of no pods
ttlSecondsUntilExpired: 2592000 # Nodes expire after 30 days
EOF
# Example ML Job with node selector and toleration
kubectl apply -f - <<EOF
apiVersion: batch/v1
kind: Job
metadata:
name: ml-gpu-training-spot
spec:
template:
spec:
restartPolicy: OnFailure
tolerations:
- key: "karpenter.sh/capacity-type"
operator: "Exists"
effect: "NoSchedule"
nodeSelector:
karpenter.sh/capacity-type: spot
karpenter.sh/instance-category: g
containers:
- name: trainer
image: your-ml-gpu-image:latest # Replace with your GPU-enabled ML image
command: ["python", "train.py", "--epochs", "10", "--checkpoint-path", "/mnt/checkpoints"]
resources:
limits:
nvidia.com/gpu: 1
memory: "16Gi"
cpu: "8"
requests:
nvidia.com/gpu: 1
memory: "8Gi"
cpu: "4"
volumeMounts:
- name: checkpoint-storage
mountPath: /mnt/checkpoints
volumes:
- name: checkpoint-storage
persistentVolumeClaim:
claimName: ml-checkpoints-pvc # Ensure this PVC exists and is backed by a resilient storage
EOF
Prerequisites
Before diving in, ensure you have the following:
- A Kubernetes cluster (version 1.20+ recommended). This guide focuses on AWS, but concepts are transferable.
kubectlinstalled and configured to connect to your cluster.- Helm installed (version 3+).
- AWS CLI installed and configured with appropriate permissions.
- Basic understanding of Kubernetes concepts: Pods, Deployments, Jobs, Persistent Volumes, and NodeSelectors/Tolerations.
- Familiarity with cloud provider Spot Instances and their preemption model.
- An existing GPU-enabled ML training image (e.g., TensorFlow, PyTorch with CUDA). For best practices on running such workloads, see our LLM GPU Scheduling Guide.
Step-by-Step Guide
1. Set up IAM Roles and Policies for Karpenter (AWS Specific)
Karpenter needs specific IAM permissions to launch and manage EC2 instances on your behalf. This involves creating an IAM role for Karpenter and an Instance Profile for the nodes it provisions.
First, define environment variables for your cluster name and AWS region.
export CLUSTER_NAME="your-kubezilla-ml-cluster" # Replace with your cluster name
export AWS_REGION="us-east-1" # Replace with your cluster's region
export ACCOUNT_ID=$(aws sts get-caller-identity --query Account --output text)
echo "Cluster Name: ${CLUSTER_NAME}"
echo "AWS Region: ${AWS_REGION}"
echo "AWS Account ID: ${ACCOUNT_ID}"
Next, create an IAM policy for Karpenter. This policy grants Karpenter permissions to interact with EC2, IAM, and other AWS services required for node provisioning.
# Create Karpenter IAM policy
cat <<EOF > karpenter-policy.json
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"ec2:CreateLaunchTemplate",
"ec2:CreateFleet",
"ec2:RunInstances",
"ec2:CreateTags",
"ec2:TerminateInstances",
"ec2:DeleteLaunchTemplate",
"ec2:DescribeLaunchTemplates",
"ec2:DescribeInstances",
"ec2:DescribeImages",
"ec2:DescribeSubnets",
"ec2:DescribeSecurityGroups",
"ec2:DescribeInstanceTypes",
"ec2:DescribeInstanceTypeOfferings",
"ec2:DescribeAvailabilityZones",
"ec2:DeleteTags",
"ec2:AssociateAddress",
"ec2:DisassociateAddress",
"ec2:DescribeSpotPriceHistory",
"ssm:GetParameter"
],
"Resource": "*"
},
{
"Effect": "Allow",
"Action": "iam:PassRole",
"Resource": "arn:aws:iam::${ACCOUNT_ID}:role/KarpenterNodeRole-${CLUSTER_NAME}"
},
{
"Effect": "Allow",
"Action": "eks:DescribeCluster",
"Resource": "arn:aws:eks:${AWS_REGION}:${ACCOUNT_ID}:cluster/${CLUSTER_NAME}"
}
]
}
EOF
aws iam create-policy \
--policy-name KarpenterPolicy-${CLUSTER_NAME} \
--policy-document file://karpenter-policy.json
# Create an IAM role for Karpenter and attach the policy
aws iam create-role \
--role-name KarpenterRole-${CLUSTER_NAME} \
--assume-role-policy-document "{\"Version\":\"2012-10-17\",\"Statement\":[{\"Effect\":\"Allow\",\"Principal\":{\"Federated\":\"arn:aws:iam::${ACCOUNT_ID}:oidc-provider/oidc.eks.${AWS_REGION}.amazonaws.com/id/${OIDC_ID}\"},\"Action\":\"sts:AssumeRoleWithWebIdentity\",\"Condition\":{\"StringEquals\":{\"oidc.eks.${AWS_REGION}.amazonaws.com/id/${OIDC_ID}:aud\":\"sts.amazonaws.com\",\"oidc.eks.${AWS_REGION}.amazonaws.com/id/${OIDC_ID}:sub\":\"system:serviceaccount:karpenter:karpenter\"}}}]}"
aws iam attach-role-policy \
--role-name KarpenterRole-${CLUSTER_NAME} \
--policy-arn arn:aws:iam::${ACCOUNT_ID}:policy/KarpenterPolicy-${CLUSTER_NAME}
You’ll need to replace ${OIDC_ID} with your cluster’s OIDC provider ID. You can fetch this using: aws eks describe-cluster --name ${CLUSTER_NAME} --query "cluster.identity.oidc.issuer" --output text | cut -d '/' -f 5
Finally, create an Instance Profile for the nodes Karpenter will launch. This profile grants the EC2 instances the necessary permissions to join the EKS cluster.
# Create Node IAM role and attach policies
aws iam create-role --role-name KarpenterNodeRole-${CLUSTER_NAME} \
--assume-role-policy-document '{"Version":"2012-10-17","Statement":[{"Effect":"Allow","Principal":{"Service":"ec2.amazonaws.com"},"Action":"sts:AssumeRole"}]}'
aws iam attach-role-policy --role-name KarpenterNodeRole-${CLUSTER_NAME} --policy-arn arn:aws:iam::aws:policy/AmazonEKSWorkerNodePolicy
aws iam attach-role-policy --role-name KarpenterNodeRole-${CLUSTER_NAME} --policy-arn arn:aws:iam::aws:policy/AmazonEC2ContainerRegistryReadOnly
aws iam attach-role-policy --role-name KarpenterNodeRole-${CLUSTER_NAME} --policy-arn arn:aws:iam::aws:policy/AmazonEKS_CNI_Policy
# Create Instance Profile
aws iam create-instance-profile --instance-profile-name KarpenterNodeInstanceProfile-${CLUSTER_NAME}
aws iam add-role-to-instance-profile --instance-profile-name KarpenterNodeInstanceProfile-${CLUSTER_NAME} --role-name KarpenterNodeRole-${CLUSTER_NAME}
2. Install Karpenter
Karpenter is an open-source, high-performance Kubernetes cluster autoscaler built by AWS. Unlike the Kubernetes Cluster Autoscaler, Karpenter directly interfaces with the cloud provider’s API to provision nodes, making it incredibly fast and efficient. It’s particularly adept at leveraging Spot Instances and diverse instance types. For more on cost optimization with Karpenter, see our guide on Karpenter Cost Optimization.
Install Karpenter using Helm:
# Get the latest Karpenter version
export KARPENTER_VERSION="0.32.0" # Check https://karpenter.sh/docs/getting-started/ for the latest version
# Create a Kubernetes Service Account for Karpenter
kubectl create namespace karpenter
kubectl create serviceaccount karpenter -n karpenter
# Link the Service Account to the IAM role
kubectl annotate serviceaccount karpenter -n karpenter \
eks.amazonaws.com/role-arn=arn:aws:iam::${ACCOUNT_ID}:role/KarpenterRole-${CLUSTER_NAME}
# Install Karpenter Helm chart
helm upgrade --install karpenter oci://public.ecr.aws/karpenter/karpenter --version ${KARPENTER_VERSION} \
--namespace karpenter --create-namespace \
--set serviceAccount.create=false \
--set serviceAccount.name=karpenter \
--set settings.aws.clusterName=${CLUSTER_NAME} \
--set settings.aws.defaultInstanceProfile=KarpenterNodeInstanceProfile-${CLUSTER_NAME} \
--set settings.aws.interruptionQueueName=${CLUSTER_NAME} # Optional: for faster Spot interruption handling
--wait # Wait for the deployment to complete
Verify Karpenter deployment:
kubectl get pods -n karpenter
Expected Output:
NAME READY STATUS RESTARTS AGE
karpenter-xxxxxxxxx-xxxxx 1/1 Running 0 2m
3. Configure Karpenter Provisioner for Spot GPU Instances
The core of using Spot Instances with Karpenter is defining a `Provisioner`. This resource tells Karpenter *what kind* of nodes to launch based on pod requirements. We’ll create a provisioner specifically for GPU-enabled Spot instances.
Create a Provisioner called `spot-gpu-provisioner`.
apiVersion: karpenter.sh/v1beta1
kind: Provisioner
metadata:
name: spot-gpu-provisioner
spec:
requirements:
- key: "kubernetes.io/arch"
operator: In
values: ["amd64"]
- key: "kubernetes.io/os"
operator: In
values: ["linux"]
- key: "karpenter.sh/capacity-type" # This is the key that tells Karpenter to use Spot
operator: In
values: ["spot"]
- key: "karpenter.sh/instance-category" # Filter for GPU instances
operator: In
values: ["g"]
- key: "karpenter.sh/instance-family" # Specify preferred GPU instance families
operator: In
values: ["g4dn", "p3", "p4"] # Adjust based on your region and budget
- key: "karpenter.sh/instance-cpu-topology-key" # Optional: For specific CPU architectures
operator: Exists
limits:
resources:
cpu: "1000" # Max CPU Karpenter can provision for this provisioner
memory: "1000Gi" # Max Memory
nvidia.com/gpu: "100" # Max GPUs
providerRef:
name: default # Refers to the default AWSNodeTemplate created by Karpenter.
# For more advanced configurations, you might define a custom AWSNodeTemplate.
ttlSecondsAfterEmpty: 60 # Scale down nodes after 60 seconds of no pods
ttlSecondsUntilExpired: 2592000 # Nodes expire after 30 days, forcing a refresh
consolidation:
enabled: true # Karpenter will try to consolidate nodes for cost savings
Apply the provisioner:
kubectl apply -f spot-gpu-provisioner.yaml
Verify the provisioner is created:
kubectl get provisioner
Expected Output:
NAME AGE
spot-gpu-provisioner 1m
default 1m # Default provisioner might also exist
4. Deploy GPU Operator
To enable Kubernetes to recognize and schedule workloads on GPU resources, you need a GPU operator. NVIDIA’s GPU Operator is the standard for this. It automates the deployment of all necessary components, including GPU drivers, CUDA, and device plugins.
Install the NVIDIA GPU Operator using Helm:
helm repo add nvidia https://helm.ngc.nvidia.com/nvidia \
--force-update
helm repo update
helm install --wait --generate-name nvidia/gpu-operator \
--set driver.enabled=true \
--set toolkit.enabled=true \
--set devicePlugin.enabled=true \
--set validator.enabled=true \
--set nfd.enabled=true # Node Feature Discovery for better node labeling
Verify the GPU Operator components are running (they will be pending until a GPU node is available):
kubectl get pods -n gpu-operator
Expected Output (after GPU nodes are provisioned):
NAME READY STATUS RESTARTS AGE
gpu-operator-cleanup-xxxxx 0/1 Completed 0 5m
gpu-operator-container-toolkit-daemonset-xxxxx 1/1 Running 0 5m
gpu-operator-driver-daemonset-xxxxx 1/1 Running 0 5m
gpu-operator-device-plugin-daemonset-xxxxx 1/1 Running 0 5m
gpu-operator-nfd-master-xxxxx 1/1 Running 0 5m
gpu-operator-nfd-worker-daemonset-xxxxx 1/1 Running 0 5m
gpu-operator-validator-xxxxx 1/1 Running 0 5m
5. Create Persistent Storage for Checkpointing
ML training jobs, especially those running on Spot Instances, must be fault-tolerant. This means they should periodically save their state (checkpoints) to persistent storage, so they can resume from the last saved state if a node is preempted. AWS EFS (Elastic File System) or FSx for Lustre are good choices for shared, high-performance storage.
First, ensure you have an EFS CSI driver installed or a similar solution for shared storage. For AWS, you can install the EFS CSI driver:
# Install EFS CSI Driver (if not already present)
helm repo add aws-efs-csi-driver https://kubernetes-sigs.github.io/aws-efs-csi-driver/
helm repo update
helm upgrade -i aws-efs-csi-driver aws-efs-csi-driver/aws-efs-csi-driver \
--namespace kube-system \
--set image.repository=registry.k8s.io/aws-efs-csi-driver/csi-driver \
--set controller.serviceAccount.create=false \
--set controller.serviceAccount.name=efs-csi-controller-sa \
--set node.serviceAccount.create=false \
--set node.serviceAccount.name=efs-csi-node-sa
Next, create an EFS file system (if you don’t have one) and then a Kubernetes StorageClass and PersistentVolumeClaim (PVC).
Create EFS File System (Manual or via AWS CLI):
# Find your VPC ID and subnet IDs
VPC_ID=$(aws eks describe-cluster --name ${CLUSTER_NAME} --query "cluster.resourcesVpcConfig.vpcId" --output text)
SUBNET_IDS=$(aws ec2 describe-subnets --filters "Name=vpc-id,Values=${VPC_ID}" --query "Subnets[*].SubnetId" --output text)
# Create EFS file system
EFS_ID=$(aws efs create-file-system --performance-mode generalPurpose --query "FileSystemId" --output text)
echo "EFS File System ID: ${EFS_ID}"
# Create mount targets for each subnet
for SUBNET_ID in ${SUBNET_IDS}; do
aws efs create-mount-target --file-system-id ${EFS_ID} --subnet-id ${SUBNET_ID}
done
# You might need to adjust security groups for EFS access.
Create StorageClass and PVC:
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
name: efs-sc
provisioner: efs.csi.aws.com
parameters:
provisioningMode: efs-ap
fileSystemId: ${EFS_ID} # Replace with your EFS File System ID
directoryPerms: "777" # Adjust permissions as needed
throughputMode: bursting # or provisioned
#encrypted: "true" # Uncomment if EFS is encrypted
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: ml-checkpoints-pvc
spec:
accessModes:
- ReadWriteMany # Essential for shared access if multiple pods need to write
storageClassName: efs-sc
resources:
requests:
storage: 100Gi # Request sufficient storage for checkpoints
Apply the StorageClass and PVC:
# Replace ${EFS_ID} with the actual ID from the previous step
sed "s|\${EFS_ID}|${EFS_ID}|g" efs-storage.yaml | kubectl apply -f -
Verify PVC creation:
kubectl get pvc ml-checkpoints-pvc
Expected Output:
NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE
ml-checkpoints-pvc Bound pvc-xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx 100Gi RWX efs-sc 1m
6. Submit an ML Training Job
Now, let’s submit an ML training job that leverages the Spot GPU instances provisioned by Karpenter and uses the persistent storage for checkpointing. The key here is using `nodeSelector` and `tolerations` to ensure the pod lands on a Spot GPU instance.
The `tolerations` ensure that the pod *can* be scheduled on nodes tainted with `karpenter.sh/capacity-type=spot`. The `nodeSelector` *prefers* or *requires* the pod to land on a node with those specific labels. Karpenter detects these pending pods, and if no suitable node exists, it provisions one.
apiVersion: batch/v1
kind: Job
metadata:
name: ml-gpu-training-spot
spec:
template:
spec:
restartPolicy: OnFailure
tolerations:
- key: "karpenter.sh/capacity-type"
operator: "Exists"
effect: "NoSchedule"
- key: "nvidia.com/gpu" # Tolerate NVIDIA GPU taint (added by GPU operator)
operator: "Exists"
effect: "NoSchedule"
nodeSelector:
karpenter.sh/capacity-type: spot
karpenter.sh/instance-category: g # Ensure it's a GPU instance
containers:
- name: trainer
image: your-ml-gpu-image:latest # IMPORTANT: Replace with your actual GPU-enabled ML image
command: ["python", "train.py", "--epochs", "10", "--checkpoint-path", "/mnt/checkpoints"]
resources:
limits:
nvidia.com/gpu: 1 # Request 1 GPU
memory: "16Gi"
cpu: "8"
requests:
nvidia.com/gpu: 1
memory: "8Gi"
cpu: "4"
volumeMounts:
- name: checkpoint-storage
mountPath: /mnt/checkpoints
volumes:
- name: checkpoint-storage
persistentVolumeClaim:
claimName: ml-checkpoints-pvc
EOF
Apply the ML Job:
kubectl apply -f ml-gpu-training-job.yaml
Verify the job and node provisioning:
# Watch for new nodes to be provisioned by Karpenter
kubectl get nodes --watch
# Watch for the pod to be scheduled and run
kubectl get pods -l job-name=ml-gpu-training-spot --watch
Expected Output (after a new node comes up):
# kubectl get nodes --watch
NAME STATUS ROLES AGE VERSION
ip-xxx-xxx-xxx-xxx.ec2.internal Ready <none> 2m v1.28.x # This will be your new Spot GPU node
# kubectl get pods -l job-name=ml-gpu-training-spot --watch
NAME READY STATUS RESTARTS AGE
ml-gpu-training-spot-xxxxx 0/1 Pending 0 0s
ml-gpu-training-spot-xxxxx 0/1 ContainerCreating 0 10s
ml-gpu-training-spot-xxxxx 1/1 Running 0 45s
You can also describe the Karpenter controller logs to see its actions:
kubectl logs -f -n karpenter $(kubectl get pod -n karpenter -l app.kubernetes.io/name=karpenter -o name)
Production Considerations
When using Spot Instances for ML training in production, several factors need careful consideration to ensure reliability, cost-effectiveness, and maintainability.
- Fault Tolerance and Checkpointing: This is paramount. Your ML training code *must* be designed to save its state periodically (e.g., every N epochs or every M minutes) to persistent storage. It should also be able to resume training from the latest checkpoint. Consider libraries like PyTorch’s `torch.save` and `torch.load` or TensorFlow’s `tf.train.Checkpoint`.
- Distributed Training: For large models, distributed training is common. While possible on Spot, it adds complexity. Ensure your distributed training framework (e.g., Horovod, PyTorch Distributed, Ray) can handle node failures gracefully. This often involves mechanisms for re-joining the training cluster or restarting workers.
- Monitoring and Alerting: Monitor your ML jobs and the underlying Spot instances. Use tools like Prometheus and Grafana to track job progress, GPU utilization, and node preemption events. Karpenter provides metrics that can be scraped. External links for monitoring Kubernetes: Prometheus and Grafana. For advanced eBPF-based observability, check our eBPF Observability with Hubble guide.
- Network Performance: ML training, especially distributed, can be network-intensive. Ensure your chosen instance types and network configuration (e.g., EFA on AWS) provide sufficient bandwidth. Also, consider network policies for security and isolation; our Network Policies Security Guide can help.
- Image Management: Keep your ML container images optimized and stored in a low-latency registry (e.g., ECR, GCR, ACR). Large images can slow down pod startup on new Spot instances.
- Cost Management: Regularly review your cloud bills. While Spot instances are cheap, inefficient use (e.g., over-provisioning GPUs, idle clusters) can still lead to costs. Karpenter’s consolidation helps, but manual review is still valuable.
- Security: Implement robust security practices. Use Pod Security Standards, ensure proper IAM roles for your pods, and consider solutions like Sigstore and Kyverno for supply chain security.
- Preemption Handling: While Karpenter helps by quickly replacing preempted nodes, your application needs to gracefully handle the preemption signal (usually a `SIGTERM` within the container). Implement a short shutdown hook to save final checkpoints. Cloud providers offer APIs to detect upcoming preemption notices, which can be integrated into your application or a custom controller.
- Node Taints and Tolerations: Be precise with your taints and tolerations. For Spot nodes, Karpenter automatically applies `karpenter.sh/capacity-type=spot:NoSchedule`. Your pods should tolerate this. Avoid broad `Exists` tolerations unless truly necessary.
- Instance Diversity: In your `Provisioner`, specify a wide range of instance types and families. This increases Karpenter’s chances of finding available Spot capacity and reduces the likelihood of preemption for a specific type.
Troubleshooting
-
Issue: Pods remain in `Pending` state despite Karpenter provisioner being configured.
Explanation: This usually means Karpenter can’t provision a node that satisfies the pod’s requirements, or there’s an issue with Karpenter itself. Common causes include insufficient permissions, incorrect provisioner configuration, or lack of available Spot capacity for the requested instance types.
Solution:
- Check Karpenter controller logs:
kubectl logs -f -n karpenter $(kubectl get pod -n karpenter -l app.kubernetes.io/name=karpenter -o name)Look for errors related to EC2 API calls, instance type failures, or subnet issues.
- Verify your `Provisioner` requirements match the pod’s `nodeSelector` and `tolerations`.
- Ensure the Karpenter IAM role has all necessary EC2 permissions.
- Check if there’s Spot capacity for your specified `instance-family` in your region/AZ. Try broadening the `instance-family` list in your `Provisioner`.
- Check if your subnets have sufficient IP addresses.
- Check Karpenter controller logs:
-
Issue: GPU Operator pods are stuck in `Pending` or `CrashLoopBackOff`.
Explanation: The GPU operator components often require a GPU-enabled node to be present before they can fully initialize. If no GPU node is available, they might stay pending. If they crash, it could be a driver issue or misconfiguration.
Solution:
- Ensure Karpenter has successfully provisioned a GPU node. Check `kubectl get nodes -L karpenter.sh/instance-category`.
- If a GPU node exists, check the logs of the failing GPU Operator pod:
kubectl logs -n gpu-operator <pod-name>kubectl describe pod -n gpu-operator <pod-name> - Verify the GPU drivers are compatible with the kernel version running on your nodes. The NVIDIA GPU operator usually handles this, but custom AMIs might cause issues.
-
Issue: ML job fails with “out of memory” or “CUDA out of memory” errors.
Explanation: This indicates that your container is requesting more GPU memory or system memory than available on the assigned node/GPU, or your code has a memory leak.
Solution:
- Increase the `resources.limits.memory` in your pod spec.
- If it’s GPU memory, try reducing your ML model’s batch size, model size, or using techniques like gradient accumulation.
- Request a larger GPU instance type if your current one is insufficient. You can add a `requirements` to your Karpenter `Provisioner` for specific GPU memory sizes if needed.
- Monitor GPU memory usage with tools like `nvidia-smi` (if you can exec into the container) or GPU metrics from your monitoring solution.
-
Issue: Spot instance preemption interrupts ML training, and job doesn’t resume correctly.
Explanation: Your ML training code isn’t properly checkpointing or isn’t designed to resume from a checkpoint.
Solution:
- Implement Robust Checkpointing: Ensure your ML framework saves checkpoints frequently to the mounted persistent volume.
- Resume Logic: Modify your training