Quickstart

This guide walks you through training your first model with Kubeflow SDK.

Prerequisites

Before you begin, make sure you have:

  1. Python 3.10 or higher installed

  2. The Kubeflow SDK installed (see Installation)

  3. Access to a Kubernetes cluster with Kubeflow Trainer installed

    Note

    Don’t have a cluster? You can use the local backend to test on your laptop first. See Local Development (No Kubernetes) below.

Step 1: Write Your Training Function

Write a normal Python function that trains your model. Nothing special is needed - just regular PyTorch, TensorFlow, or any framework you prefer:

def train_mnist():
    """Train a simple model on MNIST."""
    import torch
    import torch.nn as nn
    from torchvision import datasets, transforms

    # Load data
    transform = transforms.ToTensor()
    train_data = datasets.MNIST('./data', train=True, download=True, transform=transform)
    train_loader = torch.utils.data.DataLoader(train_data, batch_size=64)

    # Simple model
    model = nn.Sequential(
        nn.Flatten(),
        nn.Linear(784, 128),
        nn.ReLU(),
        nn.Linear(128, 10)
    )

    optimizer = torch.optim.Adam(model.parameters())
    criterion = nn.CrossEntropyLoss()

    # Train
    for epoch in range(5):
        for batch_idx, (data, target) in enumerate(train_loader):
            optimizer.zero_grad()
            output = model(data)
            loss = criterion(output, target)
            loss.backward()
            optimizer.step()

            if batch_idx % 100 == 0:
                print(f"Epoch {epoch}, Batch {batch_idx}, Loss: {loss.item():.4f}")

Step 2: Submit the Training Job

Use the TrainerClient to submit your function to Kubernetes:

from kubeflow.trainer import TrainerClient
from kubeflow.trainer.types import CustomTrainer

# Create client (connects to your Kubernetes cluster)
client = TrainerClient()

# Submit the training job
job_name = client.train(
    trainer=CustomTrainer(func=train_mnist)
)

print(f"Training job started: {job_name}")

That’s it! Your training function is now running on Kubernetes.

Step 3: Monitor Progress

Watch the logs in real-time:

# Stream logs as they happen
for line in client.get_job_logs(job_name, follow=True):
    print(line)

Or check the job status:

job = client.get_job(job_name)
print(f"Status: {job.status}")

Step 4: Wait for Completion

Wait for the job to finish:

# Blocks until complete (or timeout)
client.wait_for_job_status(job_name)
print("Training complete!")

Local Development (No Kubernetes)

Want to test without a Kubernetes cluster? Use the local backend:

from kubeflow.trainer import TrainerClient
from kubeflow.trainer.backends.localprocess import LocalProcessBackendConfig

# Run locally instead of on Kubernetes
client = TrainerClient(backend_config=LocalProcessBackendConfig())

job_name = client.train(trainer=CustomTrainer(func=train_mnist))

This runs your training function as a local process - great for development and debugging.

What’s Next?

  • Training Models - Learn about distributed training, GPUs, and more

  • ../tune/index - Automatically tune hyperparameters