How to control the number of connections from Spark. – Yugabyte

Environment

YugabyteDB
YCQL
Spark

Issue

It has been observed that a Spark application running on a Kubernetes environment created unmanageable connections if there are many executors pod running in parallel. This causes high resource utilization.

Users may experience performance issues when running Spark applications that read or write data to YugabyteDB's YCQL API in a Kubernetes environment.

This issue is often caused by the way Spark interacts with Kubernetes.

When running Spark on Kubernetes, Spark creates a Kubernetes executor pod for each task that it executes. These pods can be distributed across multiple nodes in the Kubernetes cluster. Each executor pod then connects to the YugabyteDB cluster to execute the task.

Resolution

To address this issue, users can adjust two Spark configuration properties: spark.cassandra.connection.localConnectionsPerExecutor and spark.cassandra.connection.remoteConnectionsPerExecutor.

Overview

Both of these properties control the maximum number of connections to a YCQL node that Spark will create per executor.

localConnectionsPerExecutor controls the number of local connections set on each executor. Te setting defaults to the number of available CPU cores on the local node if not specified and not in a Spark Env.

remoteConnectionsPerExecutor controls the minimum number of remote connections per Host set on each Executor. The default value is estimated automatically based on the total number of executors in the cluster.

Steps

To optimize performance and prevent Spark from creating too many connections to YugabyteDB in a Kubernetes environment, users can calculate the values based on number of executors and expected number of connections and adjust the below properties as follows:

Set the spark.cassandra.connection.localConnectionsPerExecutor property to a desired value.
Set the spark.cassandra.connection.remoteConnectionsPerExecutor property to a desired value.

Note that these values are just starting points, and users should experiment with different settings to find the optimal configuration for their specific Kubernetes deployment.

Reference: spark-cassandra-connector/reference.md at master · datastax/spark-cassandra-connector · GitHub

Environment

Issue

Resolution

Overview

Steps

Related articles