Training Models¶

Learn how to train ML models with Kubeflow SDK.

Overview¶

The Kubeflow SDK makes it easy to run training jobs on Kubernetes. You can:

Choose the approach that fits your workflow:

Approach	Best For	Example
Custom Function	Quick experiments, Jupyter notebooks	`CustomTrainer(func=train_fn)`
Custom Container	Production, reproducible builds	`CustomTrainerContainer(image="my-image")`
Built-in Trainer	LLM fine-tuning, standard workflows	`BuiltinTrainer(...)`

Custom Training Functions

Package your Python code and run it on Kubernetes.

Distributed Training

Scale training across multiple GPUs and nodes.

Training Runtimes

Understand pre-configured environments for PyTorch, TensorFlow, etc.

Data and Model Initializers

Download datasets and pre-trained models before training starts.

Get logs while training:

for line in client.get_job_logs(job_name, follow=True):
    print(line)

Wait for completion with timeout:

client.wait_for_job_status(job_name, timeout=3600)  # 1 hour max

List all your training jobs:

jobs = client.list_jobs()
for job in jobs:
    print(f"{job.name}: {job.status}")

Delete a job:

client.delete_job(job_name)