Training Models¶
Learn how to train ML models with Kubeflow SDK.
Overview¶
The Kubeflow SDK makes it easy to run training jobs on Kubernetes. You can:
Use any framework - PyTorch, TensorFlow, JAX, or custom code
Scale horizontally - Distribute training across multiple nodes
Use GPUs - Request GPU resources for your training jobs
Track progress - Monitor logs and job status in real-time
How It Works¶
You write a Python training function (or use a container)
The SDK packages your code and submits it to Kubernetes
Kubeflow Trainer runs your code on the cluster
You monitor progress and retrieve results
Three Ways to Train¶
Choose the approach that fits your workflow:
Approach |
Best For |
Example |
|---|---|---|
Custom Function |
Quick experiments, Jupyter notebooks |
|
Custom Container |
Production, reproducible builds |
|
Built-in Trainer |
LLM fine-tuning, standard workflows |
|
Guides¶
Package your Python code and run it on Kubernetes.
Scale training across multiple GPUs and nodes.
Understand pre-configured environments for PyTorch, TensorFlow, etc.
Download datasets and pre-trained models before training starts.
Common Patterns¶
Get logs while training:
for line in client.get_job_logs(job_name, follow=True):
print(line)
Wait for completion with timeout:
client.wait_for_job_status(job_name, timeout=3600) # 1 hour max
List all your training jobs:
jobs = client.list_jobs()
for job in jobs:
print(f"{job.name}: {job.status}")
Delete a job:
client.delete_job(job_name)