Kubeflow Cheatsheet

What is Kubeflow?

Kubeflow is an open-source ML platform for Kubernetes that makes deployments of ML workflows on Kubernetes simple, portable, and scalable.

Core Components

1. Kubeflow Pipelines

  • Build and deploy portable, scalable ML workflows
  • Based on Docker containers

2. Notebooks

  • JupyterLab environments for data science
  • Pre-configured with ML frameworks

3. Training Operators

  • TFJob (TensorFlow)
  • PyTorchJob (PyTorch)
  • MXNetJob (MXNet)
  • XGBoostJob (XGBoost)

4. KServe (formerly KFServing)

  • Model serving on Kubernetes
  • Serverless inference

5. Katib

  • Hyperparameter tuning
  • Neural architecture search

6. Central Dashboard

  • Unified UI for all Kubeflow components

Installation

Prerequisites

# Kubernetes cluster (v1.25+)
# kubectl configured
# kustomize (v5.0.0+)

Install Kubeflow (via manifests)

# Clone the manifests repository
git clone https://github.com/kubeflow/manifests.git
cd manifests

# Install Kubeflow
while ! kustomize build example | kubectl apply -f -; do echo "Retrying..."; sleep 10; done

Install using MiniKF (local development)

# Using Vagrant
vagrant init arrikto/minikf
vagrant up

Check Installation

# Check all pods in kubeflow namespace
kubectl get pods -n kubeflow

# Check all services
kubectl get svc -n kubeflow

# Port forward to access dashboard
kubectl port-forward svc/istio-ingressgateway -n istio-system 8080:80

Kubeflow Pipelines

Pipeline SDK Installation

pip install kfp

Create a Simple Pipeline

from kfp import dsl
from kfp import compiler

@dsl.component
def add_numbers(a: int, b: int) -> int:
    return a + b

@dsl.pipeline(
    name='Addition Pipeline',
    description='A simple pipeline that adds two numbers'
)
def add_pipeline(a: int = 1, b: int = 2):
    add_task = add_numbers(a=a, b=b)
    
# Compile the pipeline
compiler.Compiler().compile(add_pipeline, 'pipeline.yaml')

Pipeline Commands

# Upload a pipeline
kfp pipeline upload -p pipeline_name pipeline.yaml

# Create a run
kfp run submit -e experiment_name -p pipeline_name -r run_name

# List pipelines
kfp pipeline list

# List runs
kfp run list

# Get run details
kfp run get <run-id>

# Delete a pipeline
kfp pipeline delete <pipeline-id>

Pipeline Components

# Lightweight Python component
@dsl.component(base_image='python:3.9')
def preprocess_data(input_path: str, output_path: str):
    import pandas as pd
    df = pd.read_csv(input_path)
    # Processing logic
    df.to_csv(output_path, index=False)

# Container-based component
@dsl.container_component
def custom_container_op():
    return dsl.ContainerSpec(
        image='gcr.io/my-project/my-image:latest',
        command=['python', 'train.py'],
        args=['--epochs', '10']
    )

Jupyter Notebooks

Access Notebooks

# Port forward to notebook service
kubectl port-forward -n kubeflow svc/jupyter-web-app-service 8080:80

# Access at: http://localhost:8080

Create Notebook via kubectl

apiVersion: kubeflow.org/v1
kind: Notebook
metadata:
  name: my-notebook
  namespace: kubeflow-user
spec:
  template:
    spec:
      containers:
      - name: notebook
        image: jupyter/tensorflow-notebook:latest
        resources:
          requests:
            memory: "2Gi"
            cpu: "1"
kubectl apply -f notebook.yaml

Training Operators

TensorFlow Training (TFJob)

apiVersion: kubeflow.org/v1
kind: TFJob
metadata:
  name: mnist-training
  namespace: kubeflow
spec:
  tfReplicaSpecs:
    Worker:
      replicas: 2
      template:
        spec:
          containers:
          - name: tensorflow
            image: gcr.io/kubeflow-ci/tf-mnist-with-summaries:latest
            command:
            - python
            - /var/tf_mnist/mnist_with_summaries.py
    PS:
      replicas: 1
      template:
        spec:
          containers:
          - name: tensorflow
            image: gcr.io/kubeflow-ci/tf-mnist-with-summaries:latest

PyTorch Training (PyTorchJob)

apiVersion: kubeflow.org/v1
kind: PyTorchJob
metadata:
  name: pytorch-simple
  namespace: kubeflow
spec:
  pytorchReplicaSpecs:
    Master:
      replicas: 1
      template:
        spec:
          containers:
          - name: pytorch
            image: gcr.io/kubeflow-images-public/pytorch-operator:latest
            command:
            - python
            - /opt/pytorch-mnist/mnist.py
    Worker:
      replicas: 2
      template:
        spec:
          containers:
          - name: pytorch
            image: gcr.io/kubeflow-images-public/pytorch-operator:latest

Check Training Jobs

# Get TFJobs
kubectl get tfjob -n kubeflow

# Get PyTorchJobs
kubectl get pytorchjob -n kubeflow

# Describe a job
kubectl describe tfjob mnist-training -n kubeflow

# Check logs
kubectl logs -n kubeflow <pod-name>

KServe (Model Serving)

Deploy a Model

apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
  name: sklearn-iris
  namespace: kubeflow-user
spec:
  predictor:
    sklearn:
      storageUri: "gs://kfserving-examples/models/sklearn/iris"
kubectl apply -f inference-service.yaml

Test Inference

# Get the service URL
kubectl get inferenceservice sklearn-iris -n kubeflow-user

# Send a prediction request
curl -v -H "Content-Type: application/json" \
  http://sklearn-iris.kubeflow-user.example.com/v1/models/sklearn-iris:predict \
  -d '{"instances": [[6.8, 2.8, 4.8, 1.4]]}'

Common KServe Operations

# List inference services
kubectl get inferenceservices -n kubeflow-user

# Describe inference service
kubectl describe inferenceservice sklearn-iris -n kubeflow-user

# Delete inference service
kubectl delete inferenceservice sklearn-iris -n kubeflow-user

# Check service logs
kubectl logs -n kubeflow-user -l serving.kserve.io/inferenceservice=sklearn-iris

Katib (Hyperparameter Tuning)

Create an Experiment

apiVersion: kubeflow.org/v1beta1
kind: Experiment
metadata:
  name: random-experiment
  namespace: kubeflow
spec:
  objective:
    type: maximize
    goal: 0.99
    objectiveMetricName: Validation-accuracy
  algorithm:
    algorithmName: random
  parallelTrialCount: 3
  maxTrialCount: 12
  maxFailedTrialCount: 3
  parameters:
    - name: lr
      parameterType: double
      feasibleSpace:
        min: "0.01"
        max: "0.03"
    - name: num-layers
      parameterType: int
      feasibleSpace:
        min: "2"
        max: "5"
  trialTemplate:
    primaryContainerName: training-container
    trialSpec:
      apiVersion: batch/v1
      kind: Job
      spec:
        template:
          spec:
            containers:
              - name: training-container
                image: docker.io/kubeflowkatib/mxnet-mnist:latest
                command:
                  - "python3"
                  - "/opt/mxnet-mnist/mnist.py"
                  - "--batch-size=64"
                  - "--lr=${trialParameters.learningRate}"
                  - "--num-layers=${trialParameters.numberLayers}"

Katib Commands

# Get experiments
kubectl get experiments -n kubeflow

# Describe experiment
kubectl describe experiment random-experiment -n kubeflow

# Get trials for an experiment
kubectl get trials -n kubeflow -l experiment=random-experiment

# Delete experiment
kubectl delete experiment random-experiment -n kubeflow

Common kubectl Commands for Kubeflow

Namespace Operations

# List all kubeflow namespaces
kubectl get ns | grep kubeflow

# Create a user namespace
kubectl create namespace kubeflow-user-example

Pod Operations

# Get all pods in kubeflow namespace
kubectl get pods -n kubeflow

# Get pods with labels
kubectl get pods -n kubeflow -l app=ml-pipeline

# Check pod logs
kubectl logs -n kubeflow <pod-name>

# Follow logs
kubectl logs -n kubeflow <pod-name> -f

# Execute command in pod
kubectl exec -it -n kubeflow <pod-name> -- /bin/bash

Service Operations

# List services
kubectl get svc -n kubeflow

# Port forward to a service
kubectl port-forward -n kubeflow svc/<service-name> 8080:80

# Get service endpoints
kubectl get endpoints -n kubeflow <service-name>

ConfigMap and Secrets

# List configmaps
kubectl get configmap -n kubeflow

# List secrets
kubectl get secrets -n kubeflow

# Describe a secret
kubectl describe secret -n kubeflow <secret-name>

Troubleshooting

Check Component Status

# Check all kubeflow deployments
kubectl get deployments -n kubeflow

# Check statefulsets
kubectl get statefulsets -n kubeflow

# Check persistent volume claims
kubectl get pvc -n kubeflow

# Check events
kubectl get events -n kubeflow --sort-by='.lastTimestamp'

Pipeline Debugging

# Get pipeline run details
kubectl get pipelineruns -n kubeflow

# Check argo workflows
kubectl get workflows -n kubeflow

# Describe workflow
kubectl describe workflow <workflow-name> -n kubeflow

# Check pipeline logs
kubectl logs -n kubeflow -l workflows.argoproj.io/workflow=<workflow-name>

Common Issues

Issue: Pods stuck in Pending

# Check pod events
kubectl describe pod <pod-name> -n kubeflow

# Check node resources
kubectl top nodes

# Check PVC status
kubectl get pvc -n kubeflow

Issue: Pipeline fails to run

# Check ml-pipeline service
kubectl get svc ml-pipeline -n kubeflow

# Check ml-pipeline-ui logs
kubectl logs -n kubeflow -l app=ml-pipeline-ui

# Restart ml-pipeline
kubectl rollout restart deployment/ml-pipeline -n kubeflow

Issue: Cannot access dashboard

# Check istio-ingressgateway
kubectl get svc istio-ingressgateway -n istio-system

# Check authentication
kubectl get pods -n kubeflow | grep auth

# Port forward to dashboard
kubectl port-forward -n kubeflow svc/ml-pipeline-ui 8080:80

Useful Resources

  • Official Docs: https://www.kubeflow.org/docs/
  • GitHub: https://github.com/kubeflow/kubeflow
  • Pipelines SDK: https://kubeflow-pipelines.readthedocs.io/
  • KServe: https://kserve.github.io/website/
  • Katib: https://www.kubeflow.org/docs/components/katib/

Environment Variables

# Set Kubeflow namespace
export NAMESPACE=kubeflow

# Set pipeline endpoint
export PIPELINE_HOST=http://localhost:8080

# Set KServe domain
export KSERVE_DOMAIN=example.com

Docker Integration

Build Custom Notebook Image

FROM jupyter/tensorflow-notebook:latest

USER root
RUN apt-get update && apt-get install -y vim

USER jovyan
RUN pip install --no-cache-dir \
    kfp \
    kubeflow-katib \
    kubernetes

WORKDIR /home/jovyan

Build and Push

docker build -t my-registry/custom-notebook:latest .
docker push my-registry/custom-notebook:latest

Best Practices

  1. Use namespaces for multi-tenancy and isolation
  2. Version your pipelines and container images
  3. Set resource limits for notebooks and training jobs
  4. Use persistent volumes for data storage
  5. Implement monitoring with Prometheus and Grafana
  6. Use secrets for sensitive data (API keys, credentials)
  7. Enable authentication and RBAC
  8. Regular backups of pipeline metadata and artifacts
  9. Test pipelines in development namespace first
  10. Document pipelines and experiments

Quick Start Example

# 1. Install Kubeflow
kustomize build example | kubectl apply -f -

# 2. Access dashboard
kubectl port-forward svc/istio-ingressgateway -n istio-system 8080:80

# 3. Create a simple pipeline
cat > simple_pipeline.py << 'EOF'
from kfp import dsl, compiler

@dsl.component
def hello_world(name: str) -> str:
    return f"Hello, {name}!"

@dsl.pipeline(name='Hello World Pipeline')
def hello_pipeline(name: str = "Kubeflow"):
    hello_task = hello_world(name=name)

compiler.Compiler().compile(hello_pipeline, 'hello_pipeline.yaml')
EOF

# 4. Compile and upload
python simple_pipeline.py
kfp pipeline upload -p "Hello World" hello_pipeline.yaml

# 5. Run pipeline
kfp run submit -e default -p "Hello World" -r "test-run"