API Reference¶

TrainerClient¶

class kubeflow.trainer.TrainerClient(backend_config: KubernetesBackendConfig | LocalProcessBackendConfig | ContainerBackendConfig | None = None)[source]¶

Bases: object

__init__(backend_config: KubernetesBackendConfig | LocalProcessBackendConfig | ContainerBackendConfig | None = None)[source]¶

Initialize a Kubeflow Trainer client.

Parameters:: backend_config (KubernetesBackendConfig | LocalProcessBackendConfig | ContainerBackendConfig | None) – Backend configuration. Either KubernetesBackendConfig, LocalProcessBackendConfig, ContainerBackendConfig, or None to use the backend’s default config class. Defaults to KubernetesBackendConfig.
Raises:: ValueError – Invalid backend configuration.

list_runtimes() → list[Runtime][source]¶

List of the available runtimes.

Returns:

A list of available training runtimes. If no runtimes exist, an empty list is returned.

Raises:

TimeoutError – Timeout to list runtimes.
RuntimeError – Failed to list runtimes.

get_runtime(name: str) → Runtime[source]¶

Get the runtime object

Parameters:

name (str) – Name of the runtime.

Returns:

A runtime object.

Raises:

TimeoutError – Timeout to get a runtime.
RuntimeError – Failed to get a runtime.

get_runtime_packages(runtime: Runtime)[source]¶

Print the installed Python packages for the given runtime. If a runtime has GPUs it also prints available GPUs on the single training node.

Parameters:

runtime (Runtime) – Reference to one of existing runtimes.

Raises:

ValueError – Input arguments are invalid.
RuntimeError – Failed to get Runtime.

Create a TrainJob. You can configure the TrainJob using one of these trainers:

CustomTrainer: Runs training with a user-defined function that fully encapsulates the
training process.
CustomTrainerContainer: Runs training with a user-defined image that fully encapsulates
the training process.
BuiltinTrainer: Uses a predefined trainer with built-in post-training logic, requiring
only parameter configuration.

Parameters:

runtime (str | Runtime | None) – Optional reference to one of the existing runtimes. It can accept the runtime name or Runtime object from the get_runtime() API. Defaults to the torch-distributed runtime if not provided.
initializer (Initializer | None) – Optional configuration for the dataset and model initializers.
trainer (CustomTrainer | CustomTrainerContainer | BuiltinTrainer | None) – Optional configuration for a CustomTrainer, CustomTrainerContainer, or BuiltinTrainer. If not specified, the TrainJob will use the runtime’s default values.
options (list | None) – Optional list of configuration options to apply to the TrainJob. Options can be imported from kubeflow.trainer.options.

Returns:

The unique name of the TrainJob that has been generated.

Raises:

ValueError – Input arguments are invalid.
TimeoutError – Timeout to create TrainJobs.
RuntimeError – Failed to create TrainJobs.

list_jobs(runtime: Runtime | None = None) → list[TrainJob][source]¶

List of the created TrainJobs. If a runtime is specified, only TrainJobs associated with that runtime are returned.

Parameters:

runtime (Runtime | None) – Reference to one of the existing runtimes.

Returns:

List of created TrainJobs. If no TrainJobs exist, an empty list is returned.

Raises:

TimeoutError – Timeout to list TrainJobs.
RuntimeError – Failed to list TrainJobs.

get_job(name: str) → TrainJob[source]¶

Get the TrainJob object.

Parameters:

name (str) – Name of the TrainJob.

Returns:

A TrainJob object.

Raises:

TimeoutError – Timeout to get a TrainJob.
RuntimeError – Failed to get a TrainJob.

get_job_logs(name: str, step: str = 'node-0', follow: bool | None = False) → Iterator[str][source]¶

Get logs from a specific step of a TrainJob.

You can watch for the logs in realtime as follows: ```python from kubeflow.trainer import TrainerClient

for logline in TrainerClient().get_job_logs(name=”s8d44aa4fb6d”, follow=True):: print(logline)

```

Parameters:

name (str) – Name of the TrainJob.
step (str) – Step of the TrainJob to collect logs from, like dataset-initializer or node-0.
follow (bool | None) – Whether to stream logs in realtime as they are produced.

Returns:

Iterator of log lines.

Raises:

TimeoutError – Timeout to get a TrainJob.
RuntimeError – Failed to get a TrainJob.

get_job_events(name: str) → list[Event][source]¶

Get events for a TrainJob.

This provides additional clarity about the state of the TrainJob when logs alone are not sufficient. Events include information about pod state changes, errors, and other significant occurrences.

Parameters:

name (str) – Name of the TrainJob.

Returns:

A list of Event objects associated with the TrainJob.

Raises:

TimeoutError – Timeout to get a TrainJob events.
RuntimeError – Failed to get a TrainJob events.

wait_for_job_status(name: str, status: set[str] = {'Complete'}, timeout: int = 600, polling_interval: int = 2, callbacks: list[Callable[[TrainJob], None]] | None = None) → TrainJob[source]¶

Wait for a TrainJob to reach a desired status.

Parameters:

name (str) – Name of the TrainJob.
status (set[str]) – Expected statuses. Must be a subset of Created, Running, Complete, and Failed statuses.
timeout (int) – Maximum number of seconds to wait for the TrainJob to reach one of the expected statuses.
polling_interval (int) – The polling interval in seconds to check TrainJob status.
callbacks (list[Callable[[TrainJob], None]] | None) – Optional list of callback functions to be invoked after each polling interval. Each callback should accept a single argument: the TrainJob object.

Returns:

A TrainJob object that reaches the desired status.

Raises:

ValueError – The input values are incorrect.
RuntimeError – Failed to get TrainJob or TrainJob reaches unexpected Failed status.
TimeoutError – Timeout to wait for TrainJob status.

delete_job(name: str)[source]¶

Delete the TrainJob.

Parameters:

name (str) – Name of the TrainJob.

Raises:

TimeoutError – Timeout to delete TrainJob.
RuntimeError – Failed to delete TrainJob.

Trainers¶

class kubeflow.trainer.CustomTrainer(func: Callable, func_args: dict | None = None, image: str | None = None, packages_to_install: list[str] | None = None, pip_index_urls: list[str] = <factory>, num_nodes: int | None = None, resources_per_node: dict | None = None, env: dict[str, str] | None=None) → None[source]¶

Bases: object

Custom Trainer configuration. Configure the self-contained function: that encapsulates the entire model training process.

Parameters:

func (Callable) – The function that encapsulates the entire model training process.
func_args (Optional[dict]) – The arguments to pass to the function.
image (Optional[str]) – The optional container image to use in TrainJob.
packages_to_install (Optional[list[str]]) – A list of Python packages to install before running the function.
pip_index_urls (list[str]) – The PyPI URLs from which to install Python packages. The first URL will be the index-url, and remaining ones are extra-index-urls.
num_nodes (Optional[int]) – The number of nodes to use for training.
resources_per_node (Optional[dict]) –

The computing resources to allocate per node.
`python resources_per_node = {"gpu": 4, "cpu": 5, "memory": "10G"} `

If your compute supports fractional GPUs (e.g. multi-instance GPU),

you can set the resources as follows (request 1 GPU slice of 5Gb) :

`python resources_per_node = {"mig-1g.5gb": 1} `
env (Optional[dict[str, str]]) – The environment variables to set in the training nodes.

func: Callable¶

func_args: dict | None = None¶

image: str | None = None¶

packages_to_install: list[str] | None = None¶

pip_index_urls: list[str]¶

num_nodes: int | None = None¶

resources_per_node: dict | None = None¶

env: dict[str, str] | None = None¶

class kubeflow.trainer.CustomTrainerContainer(image: str, num_nodes: int | None = None, resources_per_node: dict | None = None, env: dict[str, str] | None = None) → None[source]¶

Bases: object

Custom Trainer Container configuration. Configure the container image: that encapsulates the entire model training process.

Parameters:

image (str) – The container image that encapsulates the entire model training process.
num_nodes (Optional[int]) – The number of nodes to use for training.
resources_per_node (Optional[dict]) –

The computing resources to allocate per node.
`python resources_per_node = {"gpu": 4, "cpu": 5, "memory": "10G"} `

If your compute supports fractional GPUs (e.g. multi-instance GPU),

you can set the resources as follows (request 1 GPU slice of 5Gb) :

`python resources_per_node = {"mig-1g.5gb": 1} `
env (Optional[dict[str, str]]) – The environment variables to set in the training nodes.

image: str¶

num_nodes: int | None = None¶

resources_per_node: dict | None = None¶

env: dict[str, str] | None = None¶

class kubeflow.trainer.BuiltinTrainer(config: TorchTuneConfig) → None[source]¶

Bases: object

Builtin Trainer configuration. Configure the builtin trainer that already includes: the fine-tuning logic, requiring only parameter adjustments.

Parameters:: config (TorchTuneConfig) – The configuration for the builtin trainer.

config: TorchTuneConfig¶

Initializers¶

Bases: object

Initializer defines configurations for dataset and pre-trained model initialization

Parameters:

dataset (Optional[Union[HuggingFaceDatasetInitializer, S3DatasetInitializer, DataCacheInitializer]]) – The configuration for one of the supported dataset initializers.
model (Optional[Union[HuggingFaceModelInitializer, S3ModelInitializer]]) – The configuration for one of the supported model initializers.

dataset: HuggingFaceDatasetInitializer | S3DatasetInitializer | DataCacheInitializer | None = None¶

model: HuggingFaceModelInitializer | S3ModelInitializer | None = None¶

class kubeflow.trainer.HuggingFaceDatasetInitializer(storage_uri: str, ignore_patterns: list[str] | None = None, access_token: str | None = None) → None[source]¶

Bases: BaseInitializer

Configuration for downloading datasets from HuggingFace Hub.

Parameters:

storage_uri (str) – The HuggingFace Hub dataset identifier in the format ‘hf://username/repo_name’.
ignore_patterns (Optional[list[str]]) – List of file patterns to ignore during download.
access_token (Optional[str]) – HuggingFace Hub access token for private datasets.

ignore_patterns: list[str] | None = None¶

access_token: str | None = None¶

__post_init__()[source]¶: Validate HuggingFaceDatasetInitializer parameters.

Bases: BaseInitializer

Configuration for downloading datasets from S3-compatible storage.

Parameters:

storage_uri (str) – The S3 URI for the dataset in the format ‘s3://bucket-name/path/to/dataset’.
ignore_patterns (Optional[list[str]]) – List of file patterns to ignore during download.
endpoint (Optional[str]) – Custom S3 endpoint URL.
access_key_id (Optional[str]) – Access key for authentication.
secret_access_key (Optional[str]) – Secret key for authentication.
region (Optional[str]) – Region used in instantiating the client.
role_arn (Optional[str]) – The ARN of the role you want to assume.

ignore_patterns: list[str] | None = None¶

endpoint: str | None = None¶

access_key_id: str | None = None¶

secret_access_key: str | None = None¶

region: str | None = None¶

role_arn: str | None = None¶

__post_init__()[source]¶: Validate S3DatasetInitializer parameters.

class kubeflow.trainer.DataCacheInitializer(storage_uri: str, metadata_loc: str, num_data_nodes: int, head_cpu: str | None = None, head_mem: str | None = None, worker_cpu: str | None = None, worker_mem: str | None = None, iam_role: str | None = None) → None[source]¶

Bases: BaseInitializer

Configuration for distributed data caching system for training workloads.

Parameters:

storage_uri (str) – The URI for the cached data in the format ‘cache://<SCHEMA_NAME>/<TABLE_NAME>’. This specifies the location where the data cache will be stored and accessed.
metadata_loc (str) – The metadata file path of an iceberg table.
num_data_nodes (int) – The number of data nodes in the distributed cache system. Must be greater than 1.
head_cpu (Optional[str]) – The CPU resources to allocate for the cache head node.
head_mem (Optional[str]) – The memory resources to allocate for the cache head node.
worker_cpu (Optional[str]) – The CPU resources to allocate for each cache worker node.
worker_mem (Optional[str]) – The memory resources to allocate for each cache worker node.
iam_role (Optional[str]) – The IAM role to use for accessing metadata_loc file.

metadata_loc: str¶

num_data_nodes: int¶

head_cpu: str | None = None¶

head_mem: str | None = None¶

worker_cpu: str | None = None¶

worker_mem: str | None = None¶

iam_role: str | None = None¶

__post_init__()[source]¶: Validate DataCacheInitializer parameters.

class kubeflow.trainer.HuggingFaceModelInitializer(storage_uri: str, ignore_patterns: list[str] | None = <factory>, access_token: str | None = None) → None[source]¶

Bases: BaseInitializer

Configuration for downloading models from HuggingFace Hub.

Parameters:

storage_uri (str) – The HuggingFace Hub model identifier in the format ‘hf://username/repo_name’.
ignore_patterns (Optional[list[str]]) – List of file patterns to ignore during download.
access_token (Optional[str]) – HuggingFace Hub access token.

ignore_patterns: list[str] | None¶

access_token: str | None = None¶

__post_init__()[source]¶: Validate HuggingFaceModelInitializer parameters.

Bases: BaseInitializer

Configuration for downloading models from S3-compatible storage.

Parameters:

storage_uri (str) – The S3 URI for the model in the format ‘s3://bucket-name/path/to/model’.
ignore_patterns (Optional[list[str]]) – List of file patterns to ignore during download. Defaults to [‘*.msgpack’, ‘*.h5’, ‘*.bin’, ‘*.pt’, ‘*.pth’].
endpoint (Optional[str]) – Custom S3 endpoint URL.
access_key_id (Optional[str]) – Access key for authentication.
secret_access_key (Optional[str]) – Secret key for authentication.
region (Optional[str]) – Region used in instantiating the client.
role_arn (Optional[str]) – The ARN of the role you want to assume.

ignore_patterns: list[str] | None¶

endpoint: str | None = None¶

access_key_id: str | None = None¶

secret_access_key: str | None = None¶

region: str | None = None¶

role_arn: str | None = None¶

__post_init__()[source]¶: Validate S3ModelInitializer parameters.

Backend Configurations¶

class kubeflow.trainer.KubernetesBackendConfig(**data: Any) → None[source]¶

Bases: BaseModel

namespace: str | None¶

config_file: str | None¶

context: str | None¶

client_configuration: Configuration | None¶

class Config[source]¶

Bases: object

arbitrary_types_allowed = True¶

model_config: ClassVar[ConfigDict] = {'arbitrary_types_allowed': True}¶: Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

class kubeflow.trainer.LocalProcessBackendConfig(**data: Any) → None[source]¶

Bases: BaseModel

cleanup_venv: bool¶

model_config: ClassVar[ConfigDict] = {}¶: Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

class kubeflow.trainer.ContainerBackendConfig(**data: Any) → None[source]¶

Bases: BaseModel

pull_policy: str¶

auto_remove: bool¶

container_host: str | None¶

container_runtime: Literal['docker', 'podman'] | None¶

runtime_source: TrainingRuntimeSource¶

dataset_initializer_image: str¶

model_initializer_image: str¶

initializer_timeout: int¶

model_config: ClassVar[ConfigDict] = {}¶: Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].