API Reference¶
Reference documentation for the Spark client and related APIs.
Spark Client¶
SparkClient for Kubeflow SDK.
- class kubeflow.spark.api.spark_client.SparkClient(backend_config: KubernetesBackendConfig | None = None)[source]¶
Bases:
objectStateless Spark client for Kubeflow.
- __init__(backend_config: KubernetesBackendConfig | None = None)[source]¶
Initialize SparkClient.
- connect(base_url: str | None = None, token: str | None = None, num_executors: int | None = None, resources_per_executor: dict[str, str] | None = None, spark_conf: dict[str, str] | None = None, driver: Driver | None = None, executor: Executor | None = None, options: list | None = None, timeout: int = 300, connect_timeout: int = 120) SparkSession[source]¶
Connect to or create a SparkConnect session (KEP-107 lines 298-347).
This method supports two modes based on parameters: - Connect mode: When base_url is provided, connects to an existing Spark Connect server - Create mode: When base_url is not provided, creates a new Spark Connect session
- Parameters:
base_url (
str|None) – Optional URL to existing Spark Connect server (e.g., “sc://server:15002”). If provided, connects to existing server. If None, creates new session.token (
str|None) – Optional authentication token for existing server.num_executors (
int|None) – Number of executor instances (create mode only).resources_per_executor (
dict[str,str] |None) – Resource requirements per executor as dict. Format: {“cpu”: “5”, “memory”: “10Gi”} (create mode only).spark_conf (
dict[str,str] |None) – Spark configuration dictionary (create mode only).driver (
Driver|None) – Driver configuration object (create mode only).executor (
Executor|None) – Executor configuration object (create mode only).options (
list|None) – List of configuration options (create mode only). Use Name option for custom session name.timeout (
int) – Timeout in seconds to wait for session ready.connect_timeout (
int) – Timeout in seconds for SparkSession.getOrCreate() (create mode only).
- Returns:
SparkSession connected to Spark (self-managing).
Examples
# Connect to existing server spark = client.connect(base_url=”sc://server:15002”)
# Create with simple parameters spark = client.connect(
num_executors=5, resources_per_executor={“cpu”: “5”, “memory”: “10Gi”}, spark_conf={“spark.sql.adaptive.enabled”: “true”}
)
# Create with custom name from kubeflow.spark.types.options import Name spark = client.connect(options=[Name(“my-session”)])
# Create with advanced configuration spark = client.connect(
driver=Driver(resources={“cpu”: “2”, “memory”: “4Gi”}), executor=Executor(
num_instances=5, resources_per_executor={“cpu”: “4”, “memory”: “8Gi”}
)
)
# Minimal - use all defaults (auto-generated name) spark = client.connect()
Note
Server port defaults to 15002 (Spark Connect gRPC). PySpark and server Spark major.minor should match; see constants and pyproject.toml [spark].