Training Runtimes

Runtimes are pre-configured environments for running your training jobs.

What is a Runtime?

A runtime provides:

  • Container image with ML frameworks pre-installed

  • Distributed training setup (environment variables, process management)

  • Framework-specific optimizations

Think of runtimes as “batteries included” environments for specific frameworks.

Listing Available Runtimes

See what runtimes are available on your cluster:

from kubeflow.trainer import TrainerClient

client = TrainerClient()
runtimes = client.list_runtimes()

for runtime in runtimes:
    print(f"Name: {runtime.name}")

Common Runtimes

Runtime

Description

Use Case

torch-distributed

PyTorch with distributed training support

Most PyTorch training jobs

tensorflow-distributed

TensorFlow with MultiWorkerMirroredStrategy

TensorFlow training jobs

mpi

MPI-based distributed training

Custom distributed frameworks

Using a Runtime

Specify a runtime when creating a training job:

client.train(
    runtime="torch-distributed",
    trainer=CustomTrainer(func=my_train_function)
)

Checking Runtime Details

Inspect what’s installed in a runtime:

runtime = client.get_runtime("torch-distributed")

# Print installed packages
client.get_runtime_packages(runtime)

This is useful for debugging import errors or version conflicts.

Default Runtime

If you don’t specify a runtime, the SDK uses torch-distributed by default:

# These are equivalent
client.train(trainer=CustomTrainer(func=train))
client.train(runtime="torch-distributed", trainer=CustomTrainer(func=train))

Using Custom Containers

If the built-in runtimes don’t have what you need, use a custom container instead:

from kubeflow.trainer.types import CustomTrainerContainer

client.train(
    trainer=CustomTrainerContainer(
        image="my-registry/my-training-image:latest",
        command=["python", "train.py"],
    )
)

This gives you full control over the environment.