API Reference

TrainerClient

class kubeflow.trainer.TrainerClient(backend_config: KubernetesBackendConfig | LocalProcessBackendConfig | ContainerBackendConfig | None = None)[source]

Bases: object

__init__(backend_config: KubernetesBackendConfig | LocalProcessBackendConfig | ContainerBackendConfig | None = None)[source]

Initialize a Kubeflow Trainer client.

Parameters:

backend_config (KubernetesBackendConfig | LocalProcessBackendConfig | ContainerBackendConfig | None) – Backend configuration. Either KubernetesBackendConfig, LocalProcessBackendConfig, ContainerBackendConfig, or None to use the backend’s default config class. Defaults to KubernetesBackendConfig.

Raises:

ValueError – Invalid backend configuration.

list_runtimes() list[Runtime][source]

List of the available runtimes.

Returns:

A list of available training runtimes. If no runtimes exist, an empty list is returned.

Raises:
get_runtime(name: str) Runtime[source]

Get the runtime object

Parameters:

name (str) – Name of the runtime.

Returns:

A runtime object.

Raises:
get_runtime_packages(runtime: Runtime)[source]

Print the installed Python packages for the given runtime. If a runtime has GPUs it also prints available GPUs on the single training node.

Parameters:

runtime (Runtime) – Reference to one of existing runtimes.

Raises:
train(runtime: str | Runtime | None = None, initializer: Initializer | None = None, trainer: CustomTrainer | CustomTrainerContainer | BuiltinTrainer | None = None, options: list | None = None) str[source]

Create a TrainJob. You can configure the TrainJob using one of these trainers:

  • CustomTrainer: Runs training with a user-defined function that fully encapsulates the

    training process.

  • CustomTrainerContainer: Runs training with a user-defined image that fully encapsulates

    the training process.

  • BuiltinTrainer: Uses a predefined trainer with built-in post-training logic, requiring

    only parameter configuration.

Parameters:
  • runtime (str | Runtime | None) – Optional reference to one of the existing runtimes. It can accept the runtime name or Runtime object from the get_runtime() API. Defaults to the torch-distributed runtime if not provided.

  • initializer (Initializer | None) – Optional configuration for the dataset and model initializers.

  • trainer (CustomTrainer | CustomTrainerContainer | BuiltinTrainer | None) – Optional configuration for a CustomTrainer, CustomTrainerContainer, or BuiltinTrainer. If not specified, the TrainJob will use the runtime’s default values.

  • options (list | None) – Optional list of configuration options to apply to the TrainJob. Options can be imported from kubeflow.trainer.options.

Returns:

The unique name of the TrainJob that has been generated.

Raises:
list_jobs(runtime: Runtime | None = None) list[TrainJob][source]

List of the created TrainJobs. If a runtime is specified, only TrainJobs associated with that runtime are returned.

Parameters:

runtime (Runtime | None) – Reference to one of the existing runtimes.

Returns:

List of created TrainJobs. If no TrainJobs exist, an empty list is returned.

Raises:
get_job(name: str) TrainJob[source]

Get the TrainJob object.

Parameters:

name (str) – Name of the TrainJob.

Returns:

A TrainJob object.

Raises:
get_job_logs(name: str, step: str = 'node-0', follow: bool | None = False) Iterator[str][source]

Get logs from a specific step of a TrainJob.

You can watch for the logs in realtime as follows: ```python from kubeflow.trainer import TrainerClient

for logline in TrainerClient().get_job_logs(name=”s8d44aa4fb6d”, follow=True):

print(logline)

```

Parameters:
  • name (str) – Name of the TrainJob.

  • step (str) – Step of the TrainJob to collect logs from, like dataset-initializer or node-0.

  • follow (bool | None) – Whether to stream logs in realtime as they are produced.

Returns:

Iterator of log lines.

Raises:
get_job_events(name: str) list[Event][source]

Get events for a TrainJob.

This provides additional clarity about the state of the TrainJob when logs alone are not sufficient. Events include information about pod state changes, errors, and other significant occurrences.

Parameters:

name (str) – Name of the TrainJob.

Returns:

A list of Event objects associated with the TrainJob.

Raises:
wait_for_job_status(name: str, status: set[str] = {'Complete'}, timeout: int = 600, polling_interval: int = 2, callbacks: list[Callable[[TrainJob], None]] | None = None) TrainJob[source]

Wait for a TrainJob to reach a desired status.

Parameters:
  • name (str) – Name of the TrainJob.

  • status (set[str]) – Expected statuses. Must be a subset of Created, Running, Complete, and Failed statuses.

  • timeout (int) – Maximum number of seconds to wait for the TrainJob to reach one of the expected statuses.

  • polling_interval (int) – The polling interval in seconds to check TrainJob status.

  • callbacks (list[Callable[[TrainJob], None]] | None) – Optional list of callback functions to be invoked after each polling interval. Each callback should accept a single argument: the TrainJob object.

Returns:

A TrainJob object that reaches the desired status.

Raises:
  • ValueError – The input values are incorrect.

  • RuntimeError – Failed to get TrainJob or TrainJob reaches unexpected Failed status.

  • TimeoutError – Timeout to wait for TrainJob status.

delete_job(name: str)[source]

Delete the TrainJob.

Parameters:

name (str) – Name of the TrainJob.

Raises:

Trainers

class kubeflow.trainer.CustomTrainer(func: Callable, func_args: dict | None = None, image: str | None = None, packages_to_install: list[str] | None = None, pip_index_urls: list[str] = <factory>, num_nodes: int | None = None, resources_per_node: dict | None = None, env: dict[str, str] | None=None) None[source]

Bases: object

Custom Trainer configuration. Configure the self-contained function

that encapsulates the entire model training process.

Parameters:
  • func (Callable) – The function that encapsulates the entire model training process.

  • func_args (Optional[dict]) – The arguments to pass to the function.

  • image (Optional[str]) – The optional container image to use in TrainJob.

  • packages_to_install (Optional[list[str]]) – A list of Python packages to install before running the function.

  • pip_index_urls (list[str]) – The PyPI URLs from which to install Python packages. The first URL will be the index-url, and remaining ones are extra-index-urls.

  • num_nodes (Optional[int]) – The number of nodes to use for training.

  • resources_per_node (Optional[dict]) –

    The computing resources to allocate per node.

    `python resources_per_node = {"gpu": 4, "cpu": 5, "memory": "10G"} `

    If your compute supports fractional GPUs (e.g. multi-instance GPU),

    you can set the resources as follows (request 1 GPU slice of 5Gb) :

    `python resources_per_node = {"mig-1g.5gb": 1} `

  • env (Optional[dict[str, str]]) – The environment variables to set in the training nodes.

func: Callable
func_args: dict | None = None
image: str | None = None
packages_to_install: list[str] | None = None
pip_index_urls: list[str]
num_nodes: int | None = None
resources_per_node: dict | None = None
env: dict[str, str] | None = None
class kubeflow.trainer.CustomTrainerContainer(image: str, num_nodes: int | None = None, resources_per_node: dict | None = None, env: dict[str, str] | None = None) None[source]

Bases: object

Custom Trainer Container configuration. Configure the container image

that encapsulates the entire model training process.

Parameters:
  • image (str) – The container image that encapsulates the entire model training process.

  • num_nodes (Optional[int]) – The number of nodes to use for training.

  • resources_per_node (Optional[dict]) –

    The computing resources to allocate per node.

    `python resources_per_node = {"gpu": 4, "cpu": 5, "memory": "10G"} `

    If your compute supports fractional GPUs (e.g. multi-instance GPU),

    you can set the resources as follows (request 1 GPU slice of 5Gb) :

    `python resources_per_node = {"mig-1g.5gb": 1} `

  • env (Optional[dict[str, str]]) – The environment variables to set in the training nodes.

image: str
num_nodes: int | None = None
resources_per_node: dict | None = None
env: dict[str, str] | None = None
class kubeflow.trainer.BuiltinTrainer(config: TorchTuneConfig) None[source]

Bases: object

Builtin Trainer configuration. Configure the builtin trainer that already includes

the fine-tuning logic, requiring only parameter adjustments.

Parameters:

config (TorchTuneConfig) – The configuration for the builtin trainer.

config: TorchTuneConfig

Initializers

class kubeflow.trainer.Initializer(dataset: HuggingFaceDatasetInitializer | S3DatasetInitializer | DataCacheInitializer | None = None, model: HuggingFaceModelInitializer | S3ModelInitializer | None = None) None[source]

Bases: object

Initializer defines configurations for dataset and pre-trained model initialization

Parameters:
  • dataset (Optional[Union[HuggingFaceDatasetInitializer, S3DatasetInitializer, DataCacheInitializer]]) – The configuration for one of the supported dataset initializers.

  • model (Optional[Union[HuggingFaceModelInitializer, S3ModelInitializer]]) – The configuration for one of the supported model initializers.

dataset: HuggingFaceDatasetInitializer | S3DatasetInitializer | DataCacheInitializer | None = None
model: HuggingFaceModelInitializer | S3ModelInitializer | None = None
class kubeflow.trainer.HuggingFaceDatasetInitializer(storage_uri: str, ignore_patterns: list[str] | None = None, access_token: str | None = None) None[source]

Bases: BaseInitializer

Configuration for downloading datasets from HuggingFace Hub.

Parameters:
  • storage_uri (str) – The HuggingFace Hub dataset identifier in the format ‘hf://username/repo_name’.

  • ignore_patterns (Optional[list[str]]) – List of file patterns to ignore during download.

  • access_token (Optional[str]) – HuggingFace Hub access token for private datasets.

ignore_patterns: list[str] | None = None
access_token: str | None = None
__post_init__()[source]

Validate HuggingFaceDatasetInitializer parameters.

class kubeflow.trainer.S3DatasetInitializer(storage_uri: str, ignore_patterns: list[str] | None = None, endpoint: str | None = None, access_key_id: str | None = None, secret_access_key: str | None = None, region: str | None = None, role_arn: str | None = None) None[source]

Bases: BaseInitializer

Configuration for downloading datasets from S3-compatible storage.

Parameters:
  • storage_uri (str) – The S3 URI for the dataset in the format ‘s3://bucket-name/path/to/dataset’.

  • ignore_patterns (Optional[list[str]]) – List of file patterns to ignore during download.

  • endpoint (Optional[str]) – Custom S3 endpoint URL.

  • access_key_id (Optional[str]) – Access key for authentication.

  • secret_access_key (Optional[str]) – Secret key for authentication.

  • region (Optional[str]) – Region used in instantiating the client.

  • role_arn (Optional[str]) – The ARN of the role you want to assume.

ignore_patterns: list[str] | None = None
endpoint: str | None = None
access_key_id: str | None = None
secret_access_key: str | None = None
region: str | None = None
role_arn: str | None = None
__post_init__()[source]

Validate S3DatasetInitializer parameters.

class kubeflow.trainer.DataCacheInitializer(storage_uri: str, metadata_loc: str, num_data_nodes: int, head_cpu: str | None = None, head_mem: str | None = None, worker_cpu: str | None = None, worker_mem: str | None = None, iam_role: str | None = None) None[source]

Bases: BaseInitializer

Configuration for distributed data caching system for training workloads.

Parameters:
  • storage_uri (str) – The URI for the cached data in the format ‘cache://<SCHEMA_NAME>/<TABLE_NAME>’. This specifies the location where the data cache will be stored and accessed.

  • metadata_loc (str) – The metadata file path of an iceberg table.

  • num_data_nodes (int) – The number of data nodes in the distributed cache system. Must be greater than 1.

  • head_cpu (Optional[str]) – The CPU resources to allocate for the cache head node.

  • head_mem (Optional[str]) – The memory resources to allocate for the cache head node.

  • worker_cpu (Optional[str]) – The CPU resources to allocate for each cache worker node.

  • worker_mem (Optional[str]) – The memory resources to allocate for each cache worker node.

  • iam_role (Optional[str]) – The IAM role to use for accessing metadata_loc file.

metadata_loc: str
num_data_nodes: int
head_cpu: str | None = None
head_mem: str | None = None
worker_cpu: str | None = None
worker_mem: str | None = None
iam_role: str | None = None
__post_init__()[source]

Validate DataCacheInitializer parameters.

class kubeflow.trainer.HuggingFaceModelInitializer(storage_uri: str, ignore_patterns: list[str] | None = <factory>, access_token: str | None = None) None[source]

Bases: BaseInitializer

Configuration for downloading models from HuggingFace Hub.

Parameters:
  • storage_uri (str) – The HuggingFace Hub model identifier in the format ‘hf://username/repo_name’.

  • ignore_patterns (Optional[list[str]]) – List of file patterns to ignore during download.

  • access_token (Optional[str]) – HuggingFace Hub access token.

ignore_patterns: list[str] | None
access_token: str | None = None
__post_init__()[source]

Validate HuggingFaceModelInitializer parameters.

class kubeflow.trainer.S3ModelInitializer(storage_uri: str, ignore_patterns: list[str] | None = <factory>, endpoint: str | None = None, access_key_id: str | None = None, secret_access_key: str | None = None, region: str | None = None, role_arn: str | None = None) None[source]

Bases: BaseInitializer

Configuration for downloading models from S3-compatible storage.

Parameters:
  • storage_uri (str) – The S3 URI for the model in the format ‘s3://bucket-name/path/to/model’.

  • ignore_patterns (Optional[list[str]]) – List of file patterns to ignore during download. Defaults to [‘*.msgpack’, ‘*.h5’, ‘*.bin’, ‘*.pt’, ‘*.pth’].

  • endpoint (Optional[str]) – Custom S3 endpoint URL.

  • access_key_id (Optional[str]) – Access key for authentication.

  • secret_access_key (Optional[str]) – Secret key for authentication.

  • region (Optional[str]) – Region used in instantiating the client.

  • role_arn (Optional[str]) – The ARN of the role you want to assume.

ignore_patterns: list[str] | None
endpoint: str | None = None
access_key_id: str | None = None
secret_access_key: str | None = None
region: str | None = None
role_arn: str | None = None
__post_init__()[source]

Validate S3ModelInitializer parameters.

Backend Configurations

class kubeflow.trainer.KubernetesBackendConfig(**data: Any) None[source]

Bases: BaseModel

namespace: str | None
config_file: str | None
context: str | None
client_configuration: Configuration | None
class Config[source]

Bases: object

arbitrary_types_allowed = True
model_config: ClassVar[ConfigDict] = {'arbitrary_types_allowed': True}

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

class kubeflow.trainer.LocalProcessBackendConfig(**data: Any) None[source]

Bases: BaseModel

cleanup_venv: bool
model_config: ClassVar[ConfigDict] = {}

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

class kubeflow.trainer.ContainerBackendConfig(**data: Any) None[source]

Bases: BaseModel

pull_policy: str
auto_remove: bool
container_host: str | None
container_runtime: Literal['docker', 'podman'] | None
runtime_source: TrainingRuntimeSource
dataset_initializer_image: str
model_initializer_image: str
initializer_timeout: int
model_config: ClassVar[ConfigDict] = {}

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].