What is Kubeflow?
Kubeflow is an open-source ML platform for Kubernetes that makes deployments of ML workflows on Kubernetes simple, portable, and scalable.
Core Components
1. Kubeflow Pipelines
- Build and deploy portable, scalable ML workflows
- Based on Docker containers
2. Notebooks
- JupyterLab environments for data science
- Pre-configured with ML frameworks
3. Training Operators
- TFJob (TensorFlow)
- PyTorchJob (PyTorch)
- MXNetJob (MXNet)
- XGBoostJob (XGBoost)
4. KServe (formerly KFServing)
- Model serving on Kubernetes
- Serverless inference
5. Katib
- Hyperparameter tuning
- Neural architecture search
6. Central Dashboard
- Unified UI for all Kubeflow components
Installation
Prerequisites
# Kubernetes cluster (v1.25+)
# kubectl configured
# kustomize (v5.0.0+)
Install Kubeflow (via manifests)
# Clone the manifests repository
git clone https://github.com/kubeflow/manifests.git
cd manifests
# Install Kubeflow
while ! kustomize build example | kubectl apply -f -; do echo "Retrying..."; sleep 10; done
Install using MiniKF (local development)
# Using Vagrant
vagrant init arrikto/minikf
vagrant up
Check Installation
# Check all pods in kubeflow namespace
kubectl get pods -n kubeflow
# Check all services
kubectl get svc -n kubeflow
# Port forward to access dashboard
kubectl port-forward svc/istio-ingressgateway -n istio-system 8080:80
Kubeflow Pipelines
Pipeline SDK Installation
pip install kfp
Create a Simple Pipeline
from kfp import dsl
from kfp import compiler
@dsl.component
def add_numbers(a: int, b: int) -> int:
return a + b
@dsl.pipeline(
name='Addition Pipeline',
description='A simple pipeline that adds two numbers'
)
def add_pipeline(a: int = 1, b: int = 2):
add_task = add_numbers(a=a, b=b)
# Compile the pipeline
compiler.Compiler().compile(add_pipeline, 'pipeline.yaml')
Pipeline Commands
# Upload a pipeline
kfp pipeline upload -p pipeline_name pipeline.yaml
# Create a run
kfp run submit -e experiment_name -p pipeline_name -r run_name
# List pipelines
kfp pipeline list
# List runs
kfp run list
# Get run details
kfp run get <run-id>
# Delete a pipeline
kfp pipeline delete <pipeline-id>
Pipeline Components
# Lightweight Python component
@dsl.component(base_image='python:3.9')
def preprocess_data(input_path: str, output_path: str):
import pandas as pd
df = pd.read_csv(input_path)
# Processing logic
df.to_csv(output_path, index=False)
# Container-based component
@dsl.container_component
def custom_container_op():
return dsl.ContainerSpec(
image='gcr.io/my-project/my-image:latest',
command=['python', 'train.py'],
args=['--epochs', '10']
)
Jupyter Notebooks
Access Notebooks
# Port forward to notebook service
kubectl port-forward -n kubeflow svc/jupyter-web-app-service 8080:80
# Access at: http://localhost:8080
Create Notebook via kubectl
apiVersion: kubeflow.org/v1
kind: Notebook
metadata:
name: my-notebook
namespace: kubeflow-user
spec:
template:
spec:
containers:
- name: notebook
image: jupyter/tensorflow-notebook:latest
resources:
requests:
memory: "2Gi"
cpu: "1"
kubectl apply -f notebook.yaml
Training Operators
TensorFlow Training (TFJob)
apiVersion: kubeflow.org/v1
kind: TFJob
metadata:
name: mnist-training
namespace: kubeflow
spec:
tfReplicaSpecs:
Worker:
replicas: 2
template:
spec:
containers:
- name: tensorflow
image: gcr.io/kubeflow-ci/tf-mnist-with-summaries:latest
command:
- python
- /var/tf_mnist/mnist_with_summaries.py
PS:
replicas: 1
template:
spec:
containers:
- name: tensorflow
image: gcr.io/kubeflow-ci/tf-mnist-with-summaries:latest
PyTorch Training (PyTorchJob)
apiVersion: kubeflow.org/v1
kind: PyTorchJob
metadata:
name: pytorch-simple
namespace: kubeflow
spec:
pytorchReplicaSpecs:
Master:
replicas: 1
template:
spec:
containers:
- name: pytorch
image: gcr.io/kubeflow-images-public/pytorch-operator:latest
command:
- python
- /opt/pytorch-mnist/mnist.py
Worker:
replicas: 2
template:
spec:
containers:
- name: pytorch
image: gcr.io/kubeflow-images-public/pytorch-operator:latest
Check Training Jobs
# Get TFJobs
kubectl get tfjob -n kubeflow
# Get PyTorchJobs
kubectl get pytorchjob -n kubeflow
# Describe a job
kubectl describe tfjob mnist-training -n kubeflow
# Check logs
kubectl logs -n kubeflow <pod-name>
KServe (Model Serving)
Deploy a Model
apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
name: sklearn-iris
namespace: kubeflow-user
spec:
predictor:
sklearn:
storageUri: "gs://kfserving-examples/models/sklearn/iris"
kubectl apply -f inference-service.yaml
Test Inference
# Get the service URL
kubectl get inferenceservice sklearn-iris -n kubeflow-user
# Send a prediction request
curl -v -H "Content-Type: application/json" \
http://sklearn-iris.kubeflow-user.example.com/v1/models/sklearn-iris:predict \
-d '{"instances": [[6.8, 2.8, 4.8, 1.4]]}'
Common KServe Operations
# List inference services
kubectl get inferenceservices -n kubeflow-user
# Describe inference service
kubectl describe inferenceservice sklearn-iris -n kubeflow-user
# Delete inference service
kubectl delete inferenceservice sklearn-iris -n kubeflow-user
# Check service logs
kubectl logs -n kubeflow-user -l serving.kserve.io/inferenceservice=sklearn-iris
Katib (Hyperparameter Tuning)
Create an Experiment
apiVersion: kubeflow.org/v1beta1
kind: Experiment
metadata:
name: random-experiment
namespace: kubeflow
spec:
objective:
type: maximize
goal: 0.99
objectiveMetricName: Validation-accuracy
algorithm:
algorithmName: random
parallelTrialCount: 3
maxTrialCount: 12
maxFailedTrialCount: 3
parameters:
- name: lr
parameterType: double
feasibleSpace:
min: "0.01"
max: "0.03"
- name: num-layers
parameterType: int
feasibleSpace:
min: "2"
max: "5"
trialTemplate:
primaryContainerName: training-container
trialSpec:
apiVersion: batch/v1
kind: Job
spec:
template:
spec:
containers:
- name: training-container
image: docker.io/kubeflowkatib/mxnet-mnist:latest
command:
- "python3"
- "/opt/mxnet-mnist/mnist.py"
- "--batch-size=64"
- "--lr=${trialParameters.learningRate}"
- "--num-layers=${trialParameters.numberLayers}"
Katib Commands
# Get experiments
kubectl get experiments -n kubeflow
# Describe experiment
kubectl describe experiment random-experiment -n kubeflow
# Get trials for an experiment
kubectl get trials -n kubeflow -l experiment=random-experiment
# Delete experiment
kubectl delete experiment random-experiment -n kubeflow
Common kubectl Commands for Kubeflow
Namespace Operations
# List all kubeflow namespaces
kubectl get ns | grep kubeflow
# Create a user namespace
kubectl create namespace kubeflow-user-example
Pod Operations
# Get all pods in kubeflow namespace
kubectl get pods -n kubeflow
# Get pods with labels
kubectl get pods -n kubeflow -l app=ml-pipeline
# Check pod logs
kubectl logs -n kubeflow <pod-name>
# Follow logs
kubectl logs -n kubeflow <pod-name> -f
# Execute command in pod
kubectl exec -it -n kubeflow <pod-name> -- /bin/bash
Service Operations
# List services
kubectl get svc -n kubeflow
# Port forward to a service
kubectl port-forward -n kubeflow svc/<service-name> 8080:80
# Get service endpoints
kubectl get endpoints -n kubeflow <service-name>
ConfigMap and Secrets
# List configmaps
kubectl get configmap -n kubeflow
# List secrets
kubectl get secrets -n kubeflow
# Describe a secret
kubectl describe secret -n kubeflow <secret-name>
Troubleshooting
Check Component Status
# Check all kubeflow deployments
kubectl get deployments -n kubeflow
# Check statefulsets
kubectl get statefulsets -n kubeflow
# Check persistent volume claims
kubectl get pvc -n kubeflow
# Check events
kubectl get events -n kubeflow --sort-by='.lastTimestamp'
Pipeline Debugging
# Get pipeline run details
kubectl get pipelineruns -n kubeflow
# Check argo workflows
kubectl get workflows -n kubeflow
# Describe workflow
kubectl describe workflow <workflow-name> -n kubeflow
# Check pipeline logs
kubectl logs -n kubeflow -l workflows.argoproj.io/workflow=<workflow-name>
Common Issues
Issue: Pods stuck in Pending
# Check pod events
kubectl describe pod <pod-name> -n kubeflow
# Check node resources
kubectl top nodes
# Check PVC status
kubectl get pvc -n kubeflow
Issue: Pipeline fails to run
# Check ml-pipeline service
kubectl get svc ml-pipeline -n kubeflow
# Check ml-pipeline-ui logs
kubectl logs -n kubeflow -l app=ml-pipeline-ui
# Restart ml-pipeline
kubectl rollout restart deployment/ml-pipeline -n kubeflow
Issue: Cannot access dashboard
# Check istio-ingressgateway
kubectl get svc istio-ingressgateway -n istio-system
# Check authentication
kubectl get pods -n kubeflow | grep auth
# Port forward to dashboard
kubectl port-forward -n kubeflow svc/ml-pipeline-ui 8080:80
Useful Resources
- Official Docs: https://www.kubeflow.org/docs/
- GitHub: https://github.com/kubeflow/kubeflow
- Pipelines SDK: https://kubeflow-pipelines.readthedocs.io/
- KServe: https://kserve.github.io/website/
- Katib: https://www.kubeflow.org/docs/components/katib/
Environment Variables
# Set Kubeflow namespace
export NAMESPACE=kubeflow
# Set pipeline endpoint
export PIPELINE_HOST=http://localhost:8080
# Set KServe domain
export KSERVE_DOMAIN=example.com
Docker Integration
Build Custom Notebook Image
FROM jupyter/tensorflow-notebook:latest
USER root
RUN apt-get update && apt-get install -y vim
USER jovyan
RUN pip install --no-cache-dir \
kfp \
kubeflow-katib \
kubernetes
WORKDIR /home/jovyan
Build and Push
docker build -t my-registry/custom-notebook:latest .
docker push my-registry/custom-notebook:latest
Best Practices
- Use namespaces for multi-tenancy and isolation
- Version your pipelines and container images
- Set resource limits for notebooks and training jobs
- Use persistent volumes for data storage
- Implement monitoring with Prometheus and Grafana
- Use secrets for sensitive data (API keys, credentials)
- Enable authentication and RBAC
- Regular backups of pipeline metadata and artifacts
- Test pipelines in development namespace first
- Document pipelines and experiments
Quick Start Example
# 1. Install Kubeflow
kustomize build example | kubectl apply -f -
# 2. Access dashboard
kubectl port-forward svc/istio-ingressgateway -n istio-system 8080:80
# 3. Create a simple pipeline
cat > simple_pipeline.py << 'EOF'
from kfp import dsl, compiler
@dsl.component
def hello_world(name: str) -> str:
return f"Hello, {name}!"
@dsl.pipeline(name='Hello World Pipeline')
def hello_pipeline(name: str = "Kubeflow"):
hello_task = hello_world(name=name)
compiler.Compiler().compile(hello_pipeline, 'hello_pipeline.yaml')
EOF
# 4. Compile and upload
python simple_pipeline.py
kfp pipeline upload -p "Hello World" hello_pipeline.yaml
# 5. Run pipeline
kfp run submit -e default -p "Hello World" -r "test-run"