Orchestration

Velero Backup & DR for Kubernetes: Your Guide

Introduction

In the dynamic world of Kubernetes, applications are ephemeral, and infrastructure can be volatile. While Kubernetes excels at maintaining desired states and self-healing, it doesn’t inherently protect your precious application data and configurations from accidental deletions, cluster failures, or catastrophic disasters. Imagine losing weeks of development work due to a misconfigured deployment, or facing hours of downtime because a critical PersistentVolume got corrupted. This is where a robust backup and disaster recovery (DR) strategy becomes not just a best practice, but an absolute necessity.

Velero, an open-source tool developed by VMware, steps in to fill this critical gap. It provides a simple yet powerful way to back up and restore your Kubernetes cluster resources and persistent volumes. Whether you need to migrate applications between clusters, recover from a bad deployment, or perform a full cluster restore after a disaster, Velero offers the flexibility and reliability required. This guide will walk you through setting up Velero, performing backups, and executing restores, ensuring your Kubernetes environments are resilient and your data is safe.

TL;DR: Velero Backup & Disaster Recovery

Velero enables robust backup and restore for Kubernetes cluster resources and persistent volumes, crucial for disaster recovery, migrations, and application recovery.

  • Install Velero: Use Helm or the Velero CLI to deploy Velero to your cluster, configuring it with an object storage provider (AWS S3, GCP GCS, Azure Blob Storage).
  • Backup Resources: Create on-demand backups or schedule them. Velero backs up YAML manifests and can snapshot PVs.
  • Restore Resources: Restore entire clusters, specific namespaces, or individual resources from a backup.
  • Key Commands:
  • velero install [options] – Install Velero
  • velero backup create my-backup --include-namespaces my-app-ns – Create a backup
  • velero restore create --from-backup my-backup – Restore from a backup
  • velero schedule create daily-backup --schedule "0 7 * * *" --include-namespaces my-app-ns – Schedule daily backups

Prerequisites

Before diving into Velero, ensure you have the following:

  • Kubernetes Cluster: A running Kubernetes cluster (v1.16 or later).
  • kubectl: Configured to interact with your cluster. You can find installation instructions on the official Kubernetes documentation.
  • Cloud Provider Credentials: Access to an object storage service like AWS S3, GCP GCS, or Azure Blob Storage. Velero stores backups in these buckets. This guide will primarily use AWS S3 as an example, but the concepts apply universally.
  • velero CLI: The Velero command-line interface installed on your local machine.
  • Helm (Optional but Recommended): For easier Velero installation and management. Download it from the Helm website.

Step-by-Step Guide

1. Install the Velero CLI

The Velero CLI is essential for interacting with your Velero installation, creating backups, and initiating restores. Download the appropriate binary for your operating system from the Velero GitHub releases page. It’s good practice to choose a version that matches the Velero server version you plan to install.

# Download the latest Velero CLI (adjust version as needed)
VELERO_VERSION="1.13.0" # Check for the latest stable release
OS=$(uname -s | tr '[:upper:]' '[:lower:]')
ARCH=$(uname -m)

if [ "$ARCH" == "x86_64" ]; then
  ARCH="amd64"
fi

wget https://github.com/vmware-tanzu/velero/releases/download/v${VELERO_VERSION}/velero-v${VELERO_VERSION}-${OS}-${ARCH}.tar.gz

# Extract and move to your PATH
tar -xvf velero-v${VELERO_VERSION}-${OS}-${ARCH}.tar.gz
sudo mv velero-v${VELERO_VERSION}-${OS}-${ARCH}/velero /usr/local/bin/
rm -rf velero-v${VELERO_VERSION}-${OS}-${ARCH} velero-v${VELERO_VERSION}-${OS}-${ARCH}.tar.gz

# Verify installation
velero version --client

Verify:

Client:
        Version: v1.13.0
        Git SHA: 60a4f669a7c64d8527a206b4b449b775432d94c9

2. Prepare Object Storage Credentials

Velero needs access to an object storage bucket to store your backup data. This involves creating a bucket and generating credentials with appropriate permissions. For AWS S3, this means creating an IAM user with programmatic access and attaching a policy that grants read/write permissions to your designated S3 bucket. Ensure these credentials are secure and follow the principle of least privilege.

First, create an S3 bucket. For example, my-velero-backups-12345.

Next, create an IAM policy (e.g., velero-policy.json) allowing Velero to manage objects in this bucket:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                "ec2:DescribeVolumes",
                "ec2:DescribeSnapshots",
                "ec2:CreateSnapshot",
                "ec2:DeleteSnapshot",
                "ec2:AttachVolume",
                "ec2:DetachVolume"
            ],
            "Resource": "*"
        },
        {
            "Effect": "Allow",
            "Action": [
                "s3:GetObject",
                "s3:DeleteObject",
                "s3:PutObject",
                "s3:ListBucket"
            ],
            "Resource": [
                "arn:aws:s3:::my-velero-backups-12345/*",
                "arn:aws:s3:::my-velero-backups-12345"
            ]
        }
    ]
}

Apply this policy to a new IAM user and generate an Access Key ID and Secret Access Key. Save these in a file named credentials-velero:

[default]
aws_access_key_id = AKIAIOSFODNN7EXAMPLE
aws_secret_access_key = wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY

Verify:
Ensure your AWS CLI (if installed) can list objects in the bucket using these credentials. This step is crucial for Velero’s successful operation.

# Test with AWS CLI (if installed)
AWS_ACCESS_KEY_ID=AKIAIOSFODNN7EXAMPLE \
AWS_SECRET_ACCESS_KEY=wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY \
aws s3 ls s3://my-velero-backups-12345

3. Install Velero Server in Your Cluster

Now, deploy the Velero server components into your Kubernetes cluster. This involves creating a dedicated namespace, a service account, and the Velero deployment itself. We’ll use the velero install command, which streamlines much of this process. Remember to specify your cloud provider and bucket details. If you’re using Helm, the process is similar but involves a values file.

For AWS, the command looks like this:

velero install \
    --provider aws \
    --plugins velero/velero-plugin-for-aws:v1.9.0 \
    --bucket my-velero-backups-12345 \
    --secret-file ./credentials-velero \
    --backup-location-config region=us-east-1 \
    --snapshot-location-config region=us-east-1 \
    --namespace velero

Explanation of parameters:

  • --provider aws: Specifies the cloud provider.
  • --plugins velero/velero-plugin-for-aws:v1.9.0: Installs the necessary plugin for AWS. Change the version to match your Velero version.
  • --bucket my-velero-backups-12345: Your S3 bucket name.
  • --secret-file ./credentials-velero: Path to your AWS credentials file. Velero will create a Kubernetes secret from this.
  • --backup-location-config region=us-east-1: The region where your S3 bucket is located.
  • --snapshot-location-config region=us-east-1: The region for EBS volume snapshots. This is critical for backing up PersistentVolumes.
  • --namespace velero: The Kubernetes namespace where Velero components will be installed.

Verify:
Check if Velero pods are running in the velero namespace. All pods should be in a Running state.

kubectl get pods -n velero

Expected Output:

NAME                      READY   STATUS    RESTARTS   AGE
velero-7b8c8d8b4f-abcde   1/1     Running   0          2m

4. Create Your First Backup

With Velero installed, you can now create your first backup. Velero backups are comprehensive, capturing both Kubernetes resource definitions (e.g., Deployments, Services, ConfigMaps, Secrets) and, optionally, PersistentVolumes via snapshots. You can back up an entire cluster, specific namespaces, or even individual resource types. For complex applications, backing up specific namespaces is often preferred. For a deeper dive into securing your cluster resources, consider our Kubernetes Network Policies: Complete Security Hardening Guide.

Let’s first deploy a sample application to back up. Create a namespace and a simple Nginx deployment:

# app.yaml
apiVersion: v1
kind: Namespace
metadata:
  name: my-app-ns
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: nginx-deployment
  namespace: my-app-ns
spec:
  replicas: 1
  selector:
    matchLabels:
      app: nginx
  template:
    metadata:
      labels:
        app: nginx
    spec:
      containers:
      - name: nginx
        image: nginx:latest
        ports:
        - containerPort: 80
---
apiVersion: v1
kind: Service
metadata:
  name: nginx-service
  namespace: my-app-ns
spec:
  selector:
    app: nginx
  ports:
    - protocol: TCP
      port: 80
      targetPort: 80
  type: ClusterIP
kubectl apply -f app.yaml
kubectl get all -n my-app-ns

Expected Output:

NAME                                 READY   STATUS    RESTARTS   AGE
pod/nginx-deployment-7848d4b868-abcde   1/1     Running   0          1m

NAME                   TYPE        CLUSTER-IP    EXTERNAL-IP   PORT(S)   AGE
service/nginx-service   ClusterIP   10.96.100.100   <none>        80/TCP    1m

NAME                             READY   UP-TO-DATE   AVAILABLE   AGE
deployment.apps/nginx-deployment   1/1     1            1           1m

NAME                                       DESIRED   CURRENT   READY   AGE
replicaset.apps/nginx-deployment-7848d4b868   1         1         1       1m

Now, create a backup of the my-app-ns namespace:

velero backup create my-first-backup --include-namespaces my-app-ns --wait

The --wait flag will make the CLI wait until the backup completes. This can take some time depending on the size of your data and the number of resources.

Verify:
Check the status of your backup and list its contents.

velero backup get
velero backup describe my-first-backup

Expected Output (velero backup get):

NAME                STATUS      ERRORS   WARNINGS   CREATED                         EXPIRES   STORAGE LOCATION   SELECTOR
my-first-backup     Completed   0        0          2023-10-27 10:30:00 +0000 UTC   29d       default            <none>

Expected Output (velero backup describe my-first-backup – partial):

Name:         my-first-backup
Namespace:    velero
Labels:       velero.io/backup-name=my-first-backup
              velero.io/backup-uid=a1b2c3d4-e5f6-7890-1234-567890abcdef
Annotations:  

Phase:  Completed

Errors:    0
Warnings:  0

Cluster-scoped resources:
  Included:        
  Excluded:        
  Label selector:  

Namespaces:
  Included:  my-app-ns
  Excluded:  

Resources:
  Included:        *
  Excluded:        
  Label selector:  
  ... (truncated for brevity)

5. Simulate Disaster & Delete Resources

To test the restore functionality, we need to simulate a disaster by deleting the application and its namespace. This mimics accidental deletion or a cluster failure. For advanced traffic management and resilience, you might explore tools like Istio Ambient Mesh, which can help mitigate certain types of failures, but Velero remains critical for data persistence.

kubectl delete namespace my-app-ns

Verify:
Ensure the namespace and all its resources are gone.

kubectl get all -n my-app-ns

Expected Output:

No resources found in my-app-ns namespace.
Error from server (NotFound): namespaces "my-app-ns" not found

6. Restore from Backup

Now, bring your application back from the backup. Velero can restore an entire cluster, specific namespaces, or even individual resources. For this example, we’ll restore the entire my-app-ns namespace.

velero restore create --from-backup my-first-backup --wait

The --wait flag will make the CLI wait until the restore operation completes.

Verify:
Check the status of the restore and confirm that your application resources are back in the cluster.

velero restore get
velero restore describe my-first-backup-20231027103500 # Replace with your actual restore name
kubectl get all -n my-app-ns

Expected Output (velero restore get):

NAME                               BACKUP              STATUS      STARTED                         COMPLETED                       ERRORS   WARNINGS   SELECTOR
my-first-backup-20231027103500     my-first-backup     Completed   2023-10-27 10:35:00 +0000 UTC   2023-10-27 10:35:15 +0000 UTC   0        0          <none>

Expected Output (kubectl get all -n my-app-ns):

NAME                                 READY   STATUS    RESTARTS   AGE
pod/nginx-deployment-7848d4b868-abcde   1/1     Running   0          30s

NAME                   TYPE        CLUSTER-IP    EXTERNAL-IP   PORT(S)   AGE
service/nginx-service   ClusterIP   10.96.100.100   <none>        80/TCP    30s

NAME                             READY   UP-TO-DATE   AVAILABLE   AGE
deployment.apps/nginx-deployment   1/1     1            1           30s

NAME                                       DESIRED   CURRENT   READY   AGE
replicaset.apps/nginx-deployment-7848d4b868   1         1         1       30s

7. Schedule Automatic Backups

Manual backups are good for testing, but for production environments, you need scheduled backups. Velero allows you to define backup schedules using cron expressions. This ensures your data is regularly protected without manual intervention, aligning with robust cost optimization strategies by reducing manual overhead.

velero schedule create daily-backup \
    --schedule "0 7 * * *" \
    --include-namespaces my-app-ns \
    --ttl 720h0m0s # Retain backups for 30 days (720 hours)

This command creates a schedule named daily-backup that will run every day at 7:00 AM UTC, backing up the my-app-ns namespace, and retaining each backup for 30 days. Velero will automatically prune older backups.

Verify:
Check the created schedule.

velero schedule get

Expected Output:

NAME           STATUS    CREATED                         SCHEDULE    BACKUP TTL   LAST BACKUP   SELECTOR
daily-backup   Enabled   2023-10-27 10:45:00 +0000 UTC   0 7 * * *   720h0m0s     <never>       <none>

You can also check the backups created by the schedule:

velero backup get --selector velero.io/schedule-name=daily-backup

This command will show backups created by the daily-backup schedule once it starts running.

Production Considerations

  • Backup Strategy: Define clear RPO (Recovery Point Objective) and RTO (Recovery Time Objective) for your applications. This will dictate backup frequency and retention policies.
  • Storage Backend: Use highly available and durable object storage (e.g., S3, GCS, Azure Blob Storage) with appropriate redundancy.
  • Volume Snapshots: Ensure your cloud provider’s volume snapshot capability is properly configured and has sufficient permissions for Velero. Without it, PersistentVolumes will only be backed up as YAML definitions, not their actual data.
  • Security:
    • IAM Permissions: Grant Velero the absolute minimum necessary permissions to your object storage and cloud provider APIs. Regularly review these permissions.
    • Secrets Management: Store your cloud credentials securely. Velero creates a Kubernetes Secret, but ensure the source file is protected.
    • Network Policies: Restrict network access to the Velero pod and its associated components using Kubernetes Network Policies. This prevents unauthorized access to your backup operations.
  • Monitoring and Alerting: Set up monitoring for Velero backup and restore jobs. Alert on failures, warnings, or overdue backups. Tools like Prometheus and Grafana can be integrated. For advanced observability, consider leveraging eBPF Observability: Building Custom Metrics with Hubble.
  • Testing Restores: Regularly test your restore process in a separate, non-production environment. A backup is only as good as its restore. This includes full cluster restores and specific application restores.
  • Backup Hooks: For stateful applications, use Velero’s backup hooks to quiesce applications before snapshotting and unquiesce them afterward. This ensures data consistency.
  • Resource Exclusion: Exclude non-essential resources from backups (e.g., monitoring stacks, ephemeral caches) to reduce backup size and restore time.
  • Cross-Cluster/Cross-Region Recovery: Velero can be used for migrating applications between clusters or for disaster recovery to a different region. Ensure your backup locations are accessible from your recovery cluster.
  • Velero Version Compatibility: Always check the compatibility matrix between your Kubernetes version and Velero version on the official Velero documentation.

Troubleshooting

Here are common issues you might encounter with Velero and their solutions:

1. Velero Pod Not Running / CrashLoopBackOff

Issue: The velero pod is not in a Running state or is crashing repeatedly.

Solution:
Check the pod logs for errors. Common causes include incorrect credentials, invalid bucket names, or missing permissions.

kubectl logs -n velero deploy/velero

Look for messages indicating access denied, bucket not found, or issues with AWS/GCP/Azure API calls. Verify your credentials-velero file and IAM policy.

2. Backup Fails with “Failed to upload backup”

Issue: Backup creation fails with an error indicating an upload failure.

Solution:
This typically points to issues with the object storage configuration or permissions.

  • Double-check your S3 bucket name and region in the velero install command.
  • Ensure the IAM user associated with the credentials has s3:PutObject, s3:GetObject, and s3:ListBucket permissions for the specified bucket.
  • Verify network connectivity from your Kubernetes nodes to the S3 endpoint.
velero backup describe <backup-name>

This command can provide more details on the specific upload error.

3. PersistentVolume Snapshots Not Created

Issue: Velero completes a backup, but there are no corresponding volume snapshots in your cloud provider, or the restore fails to provision volumes.

Solution:

  • Ensure the Velero IAM user (or service account) has the necessary permissions for creating and managing volume snapshots (e.g., ec2:CreateSnapshot, ec2:DeleteSnapshot for AWS EBS).
  • Verify that the --snapshot-location-config region=<region> was correctly specified during Velero installation and matches the region of your PersistentVolumes.
  • Check if your StorageClass supports volume provisioning with a CSI driver that Velero’s plugin can interact with. For example, for AWS EBS, the AWS EBS CSI driver must be installed and configured.
velero backup describe <backup-name>

Look for warnings or errors related to volume snapshots in the backup description.

4. Restore Fails / Resources Not Reappearing

Issue: A restore command completes, but the expected Kubernetes resources are not back in the cluster.

Solution:

  • Check the restore logs:
    velero restore describe <restore-name>
    kubectl logs -n velero deployment/velero
    

    Look for errors related to resource creation, conflicts, or missing dependencies.

  • Resource Conflicts: If you’re restoring to the same cluster where the original resources still exist (e.g., after a partial deletion), Velero might encounter conflicts. Use --existing-resource-policy Update or delete the conflicting resources first.
  • Namespace Missing: If you restored a namespace-scoped backup to a cluster where the namespace didn’t exist, Velero should create it. If not, there might be a permission issue for namespace creation.
  • Custom Resources (CRDs): If your application uses CRDs, ensure the CRDs themselves are present in the cluster before restoring the Custom Resources that depend on them. Velero can back up CRDs, but they need to be restored in the correct order.

5. Scheduled Backup Not Running

Issue: Your Velero schedule is created, but no backups are being generated at the specified times.

Solution:

  • Check the schedule status:
    velero schedule get
    

    Ensure it shows Enabled and the LAST BACKUP field (if any) is updating.

  • Verify the cron expression is correct. Use an online cron expression validator if unsure.
  • Check the Velero server logs for any errors related to the schedule controller:
    kubectl logs -n velero deploy/velero
    

    Look for messages indicating issues with processing schedules.

  • Ensure the Velero pod is healthy and not restarting.

6. Too Many Backups / Storage Full

Issue: Velero is creating too many backups, consuming excessive storage, or not deleting expired backups.

Solution:

  • TTL (Time-To-Live): Ensure your backups and schedules have an appropriate --ttl set. Velero relies on this to prune old backups.
    velero backup create my-backup --ttl 720h0m0s
    velero schedule create daily-backup --schedule "0 0 * * *" --ttl 168h0m0s # 7 days
    
  • Garbage Collection: Velero has internal garbage collection. Ensure it’s working. If backups are stuck in Deleting state, check Velero server logs for errors during deletion.
  • Manual Deletion: If necessary, manually delete backups:
    velero backup delete <backup-name>
    

    This will delete the backup record in Kubernetes and the corresponding data in object storage.

FAQ Section

1. What is the difference between a Velero backup and a cloud provider snapshot?

A Velero backup captures two main things: the YAML definitions of your Kubernetes resources (Deployments, Services, ConfigMaps, etc.) and, optionally, snapshots of your PersistentVolumes. Cloud provider snapshots (like AWS EBS snapshots or GCP Persistent Disk snapshots) only capture the state of a specific disk volume. Velero orchestrates these snapshots and links them to your Kubernetes resource backups, providing a complete application-centric recovery point.

2. Can Velero back up my entire Kubernetes cluster?

Yes, Velero can back up nearly all resources in your cluster. By default, velero backup create <name> will back up all namespaces and cluster-scoped resources. However, it’s often more practical to back up specific applications or namespaces. Velero does not back up the underlying worker node operating systems or the Kubernetes control plane components themselves, but rather the resources managed by the control plane.

3. How does Velero handle application state during backup?

For stateless applications, a backup of their YAML manifests is usually sufficient. For stateful applications with PersistentVolumes, Velero uses cloud provider APIs to create volume snapshots. To ensure data consistency for actively writing applications, Velero supports backup hooks. These hooks allow you to run commands (e.g., to quiesce a database) inside a pod before the snapshot is taken and then unquiesce it afterward.

4. Can I use Velero to migrate applications between different Kubernetes clusters or cloud providers?

Absolutely! This is one of Velero’s powerful use cases. You can back up an application from one cluster (e.g., an on-premises cluster or a specific cloud environment) and restore it to another, even if the underlying infrastructure is different (e.g., moving from AWS EKS to GCP GKE). You just need to ensure the target cluster has Velero installed and configured with access to the same backup storage location, and that any necessary StorageClasses or CSI drivers are available.

5. Are there any limitations or resources Velero cannot back up?

While Velero is comprehensive, there are a few considerations:

  • Local PersistentVolumes: Velero typically relies on cloud provider snapshot APIs. If you’re using local PersistentVolumes, you’ll need a custom solution or a different Velero plugin (if available) for their data backup.
  • Dynamic IP Addresses: External IP addresses assigned by cloud load balancers or ingress controllers might change upon restore unless specifically configured to be static. For advanced networking and traffic management, consider solutions like the Kubernetes Gateway API.
  • Cluster-specific Configurations: While Velero backs up cluster-scoped resources, certain very low-level cluster configurations (like CNI plugin configurations, or specific node labels/taints) are outside its scope. For managing the entire cluster lifecycle, tools like Cluster API are more appropriate.

Cleanup Commands

To remove the sample application, Velero installation, and associated backups:

# 1. Delete the sample application namespace
kubectl delete namespace my-app-ns

# 2. Delete all Velero backups (this also deletes data from your S3 bucket)
velero backup delete --all

# 3. Delete all Velero schedules
velero schedule delete --all

# 4. Uninstall Velero from the cluster
# If installed with velero install:
kubectl delete namespace velero

# If installed with Helm:
helm uninstall velero -n velero

# 5. Remove the local credentials file
rm ./credentials-velero

# 6. (Optional) Delete the S3 bucket and IAM user/policy in AWS if no longer needed
# This needs to be done manually via AWS console or AWS CLI
# aws s3 rb s3://my-velero-backups-12345 --force
# aws iam delete-access-key --access-key-id AKIAIOSFODNN7EXAMPLE --user-name velero-user
# aws iam detach-user-policy --user-name velero-user --policy-arn arn:aws:iam::123456789012:policy/VeleroBackupPolicy
# aws iam delete-user --user-name velero-user
# aws iam delete-policy --policy-

Leave a Reply

Your email address will not be published. Required fields are marked *