API Reference

Reference documentation for the Spark client and related APIs.

Spark Client

SparkClient for Kubeflow SDK.

class kubeflow.spark.api.spark_client.SparkClient(backend_config: KubernetesBackendConfig | None = None)[source]

Bases: object

Stateless Spark client for Kubeflow.

__init__(backend_config: KubernetesBackendConfig | None = None)[source]

Initialize SparkClient.

connect(base_url: str | None = None, token: str | None = None, num_executors: int | None = None, resources_per_executor: dict[str, str] | None = None, spark_conf: dict[str, str] | None = None, driver: Driver | None = None, executor: Executor | None = None, options: list | None = None, timeout: int = 300, connect_timeout: int = 120) SparkSession[source]

Connect to or create a SparkConnect session (KEP-107 lines 298-347).

This method supports two modes based on parameters: - Connect mode: When base_url is provided, connects to an existing Spark Connect server - Create mode: When base_url is not provided, creates a new Spark Connect session

Parameters:
  • base_url (str | None) – Optional URL to existing Spark Connect server (e.g., “sc://server:15002”). If provided, connects to existing server. If None, creates new session.

  • token (str | None) – Optional authentication token for existing server.

  • num_executors (int | None) – Number of executor instances (create mode only).

  • resources_per_executor (dict[str, str] | None) – Resource requirements per executor as dict. Format: {“cpu”: “5”, “memory”: “10Gi”} (create mode only).

  • spark_conf (dict[str, str] | None) – Spark configuration dictionary (create mode only).

  • driver (Driver | None) – Driver configuration object (create mode only).

  • executor (Executor | None) – Executor configuration object (create mode only).

  • options (list | None) – List of configuration options (create mode only). Use Name option for custom session name.

  • timeout (int) – Timeout in seconds to wait for session ready.

  • connect_timeout (int) – Timeout in seconds for SparkSession.getOrCreate() (create mode only).

Returns:

SparkSession connected to Spark (self-managing).

Examples

# Connect to existing server spark = client.connect(base_url=”sc://server:15002”)

# Create with simple parameters spark = client.connect(

num_executors=5, resources_per_executor={“cpu”: “5”, “memory”: “10Gi”}, spark_conf={“spark.sql.adaptive.enabled”: “true”}

)

# Create with custom name from kubeflow.spark.types.options import Name spark = client.connect(options=[Name(“my-session”)])

# Create with advanced configuration spark = client.connect(

driver=Driver(resources={“cpu”: “2”, “memory”: “4Gi”}), executor=Executor(

num_instances=5, resources_per_executor={“cpu”: “4”, “memory”: “8Gi”}

)

)

# Minimal - use all defaults (auto-generated name) spark = client.connect()

Note

Server port defaults to 15002 (Spark Connect gRPC). PySpark and server Spark major.minor should match; see constants and pyproject.toml [spark].

list_sessions() list[SparkConnectInfo][source]

List all SparkConnect sessions.

get_session(name: str) SparkConnectInfo[source]

Get session info by name.

delete_session(name: str) None[source]

Delete a SparkConnect session.

get_session_logs(name: str, follow: bool = False) Iterator[str][source]

Get logs from a session.