API Reference¶
TrainerClient¶
- class kubeflow.trainer.TrainerClient(backend_config: KubernetesBackendConfig | LocalProcessBackendConfig | ContainerBackendConfig | None = None)[source]¶
Bases:
object- __init__(backend_config: KubernetesBackendConfig | LocalProcessBackendConfig | ContainerBackendConfig | None = None)[source]¶
Initialize a Kubeflow Trainer client.
- Parameters:
backend_config (
KubernetesBackendConfig|LocalProcessBackendConfig|ContainerBackendConfig|None) – Backend configuration. Either KubernetesBackendConfig, LocalProcessBackendConfig, ContainerBackendConfig, or None to use the backend’s default config class. Defaults to KubernetesBackendConfig.- Raises:
ValueError – Invalid backend configuration.
- list_runtimes() list[Runtime][source]¶
List of the available runtimes.
- Returns:
A list of available training runtimes. If no runtimes exist, an empty list is returned.
- Raises:
TimeoutError – Timeout to list runtimes.
RuntimeError – Failed to list runtimes.
- get_runtime(name: str) Runtime[source]¶
Get the runtime object
- Parameters:
name (
str) – Name of the runtime.- Returns:
A runtime object.
- Raises:
TimeoutError – Timeout to get a runtime.
RuntimeError – Failed to get a runtime.
- get_runtime_packages(runtime: Runtime)[source]¶
Print the installed Python packages for the given runtime. If a runtime has GPUs it also prints available GPUs on the single training node.
- Parameters:
runtime (
Runtime) – Reference to one of existing runtimes.- Raises:
ValueError – Input arguments are invalid.
RuntimeError – Failed to get Runtime.
- train(runtime: str | Runtime | None = None, initializer: Initializer | None = None, trainer: CustomTrainer | CustomTrainerContainer | BuiltinTrainer | None = None, options: list | None = None) str[source]¶
Create a TrainJob. You can configure the TrainJob using one of these trainers:
- CustomTrainer: Runs training with a user-defined function that fully encapsulates the
training process.
- CustomTrainerContainer: Runs training with a user-defined image that fully encapsulates
the training process.
- BuiltinTrainer: Uses a predefined trainer with built-in post-training logic, requiring
only parameter configuration.
- Parameters:
runtime (
str|Runtime|None) – Optional reference to one of the existing runtimes. It can accept the runtime name or Runtime object from the get_runtime() API. Defaults to the torch-distributed runtime if not provided.initializer (
Initializer|None) – Optional configuration for the dataset and model initializers.trainer (
CustomTrainer|CustomTrainerContainer|BuiltinTrainer|None) – Optional configuration for a CustomTrainer, CustomTrainerContainer, or BuiltinTrainer. If not specified, the TrainJob will use the runtime’s default values.options (
list|None) – Optional list of configuration options to apply to the TrainJob. Options can be imported from kubeflow.trainer.options.
- Returns:
The unique name of the TrainJob that has been generated.
- Raises:
ValueError – Input arguments are invalid.
TimeoutError – Timeout to create TrainJobs.
RuntimeError – Failed to create TrainJobs.
- list_jobs(runtime: Runtime | None = None) list[TrainJob][source]¶
List of the created TrainJobs. If a runtime is specified, only TrainJobs associated with that runtime are returned.
- Parameters:
runtime (
Runtime|None) – Reference to one of the existing runtimes.- Returns:
List of created TrainJobs. If no TrainJobs exist, an empty list is returned.
- Raises:
TimeoutError – Timeout to list TrainJobs.
RuntimeError – Failed to list TrainJobs.
- get_job(name: str) TrainJob[source]¶
Get the TrainJob object.
- Parameters:
name (
str) – Name of the TrainJob.- Returns:
A TrainJob object.
- Raises:
TimeoutError – Timeout to get a TrainJob.
RuntimeError – Failed to get a TrainJob.
- get_job_logs(name: str, step: str = 'node-0', follow: bool | None = False) Iterator[str][source]¶
Get logs from a specific step of a TrainJob.
You can watch for the logs in realtime as follows: ```python from kubeflow.trainer import TrainerClient
- for logline in TrainerClient().get_job_logs(name=”s8d44aa4fb6d”, follow=True):
print(logline)
- Parameters:
- Returns:
Iterator of log lines.
- Raises:
TimeoutError – Timeout to get a TrainJob.
RuntimeError – Failed to get a TrainJob.
- get_job_events(name: str) list[Event][source]¶
Get events for a TrainJob.
This provides additional clarity about the state of the TrainJob when logs alone are not sufficient. Events include information about pod state changes, errors, and other significant occurrences.
- Parameters:
name (
str) – Name of the TrainJob.- Returns:
A list of Event objects associated with the TrainJob.
- Raises:
TimeoutError – Timeout to get a TrainJob events.
RuntimeError – Failed to get a TrainJob events.
- wait_for_job_status(name: str, status: set[str] = {'Complete'}, timeout: int = 600, polling_interval: int = 2, callbacks: list[Callable[[TrainJob], None]] | None = None) TrainJob[source]¶
Wait for a TrainJob to reach a desired status.
- Parameters:
name (
str) – Name of the TrainJob.status (
set[str]) – Expected statuses. Must be a subset of Created, Running, Complete, and Failed statuses.timeout (
int) – Maximum number of seconds to wait for the TrainJob to reach one of the expected statuses.polling_interval (
int) – The polling interval in seconds to check TrainJob status.callbacks (
list[Callable[[TrainJob],None]] |None) – Optional list of callback functions to be invoked after each polling interval. Each callback should accept a single argument: the TrainJob object.
- Returns:
A TrainJob object that reaches the desired status.
- Raises:
ValueError – The input values are incorrect.
RuntimeError – Failed to get TrainJob or TrainJob reaches unexpected Failed status.
TimeoutError – Timeout to wait for TrainJob status.
- delete_job(name: str)[source]¶
Delete the TrainJob.
- Parameters:
name (
str) – Name of the TrainJob.- Raises:
TimeoutError – Timeout to delete TrainJob.
RuntimeError – Failed to delete TrainJob.
Trainers¶
- class kubeflow.trainer.CustomTrainer(func: Callable, func_args: dict | None = None, image: str | None = None, packages_to_install: list[str] | None = None, pip_index_urls: list[str] = <factory>, num_nodes: int | None = None, resources_per_node: dict | None = None, env: dict[str, str] | None=None) None[source]¶
Bases:
object- Custom Trainer configuration. Configure the self-contained function
that encapsulates the entire model training process.
- Parameters:
func (Callable) – The function that encapsulates the entire model training process.
func_args (Optional[dict]) – The arguments to pass to the function.
image (Optional[str]) – The optional container image to use in TrainJob.
packages_to_install (Optional[list[str]]) – A list of Python packages to install before running the function.
pip_index_urls (list[str]) – The PyPI URLs from which to install Python packages. The first URL will be the index-url, and remaining ones are extra-index-urls.
num_nodes (Optional[int]) – The number of nodes to use for training.
resources_per_node (Optional[dict]) –
- The computing resources to allocate per node.
`python resources_per_node = {"gpu": 4, "cpu": 5, "memory": "10G"} `- If your compute supports fractional GPUs (e.g. multi-instance GPU),
you can set the resources as follows (request 1 GPU slice of 5Gb) :
`python resources_per_node = {"mig-1g.5gb": 1} `
env (Optional[dict[str, str]]) – The environment variables to set in the training nodes.
- class kubeflow.trainer.CustomTrainerContainer(image: str, num_nodes: int | None = None, resources_per_node: dict | None = None, env: dict[str, str] | None = None) None[source]¶
Bases:
object- Custom Trainer Container configuration. Configure the container image
that encapsulates the entire model training process.
- Parameters:
image (str) – The container image that encapsulates the entire model training process.
num_nodes (Optional[int]) – The number of nodes to use for training.
resources_per_node (Optional[dict]) –
- The computing resources to allocate per node.
`python resources_per_node = {"gpu": 4, "cpu": 5, "memory": "10G"} `- If your compute supports fractional GPUs (e.g. multi-instance GPU),
you can set the resources as follows (request 1 GPU slice of 5Gb) :
`python resources_per_node = {"mig-1g.5gb": 1} `
env (Optional[dict[str, str]]) – The environment variables to set in the training nodes.
- class kubeflow.trainer.BuiltinTrainer(config: TorchTuneConfig) None[source]¶
Bases:
object- Builtin Trainer configuration. Configure the builtin trainer that already includes
the fine-tuning logic, requiring only parameter adjustments.
- Parameters:
config (TorchTuneConfig) – The configuration for the builtin trainer.
- config: TorchTuneConfig¶
Initializers¶
- class kubeflow.trainer.Initializer(dataset: HuggingFaceDatasetInitializer | S3DatasetInitializer | DataCacheInitializer | None = None, model: HuggingFaceModelInitializer | S3ModelInitializer | None = None) None[source]¶
Bases:
objectInitializer defines configurations for dataset and pre-trained model initialization
- Parameters:
dataset (Optional[Union[HuggingFaceDatasetInitializer, S3DatasetInitializer, DataCacheInitializer]]) – The configuration for one of the supported dataset initializers.
model (Optional[Union[HuggingFaceModelInitializer, S3ModelInitializer]]) – The configuration for one of the supported model initializers.
- dataset: HuggingFaceDatasetInitializer | S3DatasetInitializer | DataCacheInitializer | None = None¶
- model: HuggingFaceModelInitializer | S3ModelInitializer | None = None¶
- class kubeflow.trainer.HuggingFaceDatasetInitializer(storage_uri: str, ignore_patterns: list[str] | None = None, access_token: str | None = None) None[source]¶
Bases:
BaseInitializerConfiguration for downloading datasets from HuggingFace Hub.
- Parameters:
storage_uri (str) – The HuggingFace Hub dataset identifier in the format ‘hf://username/repo_name’.
ignore_patterns (Optional[list[str]]) – List of file patterns to ignore during download.
access_token (Optional[str]) – HuggingFace Hub access token for private datasets.
- class kubeflow.trainer.S3DatasetInitializer(storage_uri: str, ignore_patterns: list[str] | None = None, endpoint: str | None = None, access_key_id: str | None = None, secret_access_key: str | None = None, region: str | None = None, role_arn: str | None = None) None[source]¶
Bases:
BaseInitializerConfiguration for downloading datasets from S3-compatible storage.
- Parameters:
storage_uri (str) – The S3 URI for the dataset in the format ‘s3://bucket-name/path/to/dataset’.
ignore_patterns (Optional[list[str]]) – List of file patterns to ignore during download.
endpoint (Optional[str]) – Custom S3 endpoint URL.
access_key_id (Optional[str]) – Access key for authentication.
secret_access_key (Optional[str]) – Secret key for authentication.
region (Optional[str]) – Region used in instantiating the client.
role_arn (Optional[str]) – The ARN of the role you want to assume.
- class kubeflow.trainer.DataCacheInitializer(storage_uri: str, metadata_loc: str, num_data_nodes: int, head_cpu: str | None = None, head_mem: str | None = None, worker_cpu: str | None = None, worker_mem: str | None = None, iam_role: str | None = None) None[source]¶
Bases:
BaseInitializerConfiguration for distributed data caching system for training workloads.
- Parameters:
storage_uri (str) – The URI for the cached data in the format ‘cache://<SCHEMA_NAME>/<TABLE_NAME>’. This specifies the location where the data cache will be stored and accessed.
metadata_loc (str) – The metadata file path of an iceberg table.
num_data_nodes (int) – The number of data nodes in the distributed cache system. Must be greater than 1.
head_cpu (Optional[str]) – The CPU resources to allocate for the cache head node.
head_mem (Optional[str]) – The memory resources to allocate for the cache head node.
worker_cpu (Optional[str]) – The CPU resources to allocate for each cache worker node.
worker_mem (Optional[str]) – The memory resources to allocate for each cache worker node.
iam_role (Optional[str]) – The IAM role to use for accessing metadata_loc file.
- class kubeflow.trainer.HuggingFaceModelInitializer(storage_uri: str, ignore_patterns: list[str] | None = <factory>, access_token: str | None = None) None[source]¶
Bases:
BaseInitializerConfiguration for downloading models from HuggingFace Hub.
- Parameters:
storage_uri (str) – The HuggingFace Hub model identifier in the format ‘hf://username/repo_name’.
ignore_patterns (Optional[list[str]]) – List of file patterns to ignore during download.
access_token (Optional[str]) – HuggingFace Hub access token.
- class kubeflow.trainer.S3ModelInitializer(storage_uri: str, ignore_patterns: list[str] | None = <factory>, endpoint: str | None = None, access_key_id: str | None = None, secret_access_key: str | None = None, region: str | None = None, role_arn: str | None = None) None[source]¶
Bases:
BaseInitializerConfiguration for downloading models from S3-compatible storage.
- Parameters:
storage_uri (str) – The S3 URI for the model in the format ‘s3://bucket-name/path/to/model’.
ignore_patterns (Optional[list[str]]) – List of file patterns to ignore during download. Defaults to [‘*.msgpack’, ‘*.h5’, ‘*.bin’, ‘*.pt’, ‘*.pth’].
endpoint (Optional[str]) – Custom S3 endpoint URL.
access_key_id (Optional[str]) – Access key for authentication.
secret_access_key (Optional[str]) – Secret key for authentication.
region (Optional[str]) – Region used in instantiating the client.
role_arn (Optional[str]) – The ARN of the role you want to assume.