Training Runtimes¶
Runtimes are pre-configured environments for running your training jobs.
What is a Runtime?¶
A runtime provides:
Container image with ML frameworks pre-installed
Distributed training setup (environment variables, process management)
Framework-specific optimizations
Think of runtimes as “batteries included” environments for specific frameworks.
Listing Available Runtimes¶
See what runtimes are available on your cluster:
from kubeflow.trainer import TrainerClient
client = TrainerClient()
runtimes = client.list_runtimes()
for runtime in runtimes:
print(f"Name: {runtime.name}")
Common Runtimes¶
Runtime |
Description |
Use Case |
|---|---|---|
|
PyTorch with distributed training support |
Most PyTorch training jobs |
|
TensorFlow with MultiWorkerMirroredStrategy |
TensorFlow training jobs |
|
MPI-based distributed training |
Custom distributed frameworks |
Using a Runtime¶
Specify a runtime when creating a training job:
client.train(
runtime="torch-distributed",
trainer=CustomTrainer(func=my_train_function)
)
Checking Runtime Details¶
Inspect what’s installed in a runtime:
runtime = client.get_runtime("torch-distributed")
# Print installed packages
client.get_runtime_packages(runtime)
This is useful for debugging import errors or version conflicts.
Default Runtime¶
If you don’t specify a runtime, the SDK uses torch-distributed by default:
# These are equivalent
client.train(trainer=CustomTrainer(func=train))
client.train(runtime="torch-distributed", trainer=CustomTrainer(func=train))
Using Custom Containers¶
If the built-in runtimes don’t have what you need, use a custom container instead:
from kubeflow.trainer.types import CustomTrainerContainer
client.train(
trainer=CustomTrainerContainer(
image="my-registry/my-training-image:latest",
command=["python", "train.py"],
)
)
This gives you full control over the environment.