Apache Spark Configurations Precedence
Apache Spark includes a number of different configurations. Depending on what we are trying to achieve. Many configurations fall into following category. See below some examples.
- Application Properties
spark.app.name
spark.master
spark.driver.memory
spark.executor.memory
- Runtime Environment
spark.driver.extraJavaOptions
spark.executor.extraJavaOptions
- Spark SQL
spark.sql.autoBroadcastJoinThreshold
spark.sql.broadcastTimeout
spark.sql.shuffle.partitions
- Execution Behavior
spark.executor.cores
There are many other category like Memory Management, Spark UI, Spark Streaming, Networking, Security etc. , which has many configuration properties.
Spark provides many ways to configure the system/application.
- Environment Variables (conf/spark-env.sh)
- SparkConf object, or through Java system properties.
- Hardcoded configuration files eg. spark-defaults.conf
Logging can be configured through log4j.properties
Environment Variables
We can configure certain Spark settings through environment variables, which are read from the conf/spark-env.sh script in the directory where Spark is installed. Example:
JAVA_HOME
PYSPARK_PYTHON
SparkConf
Spark properties control most application settings. We can use the SparkConf to configure the individual Spark Application. These Spark properties control how the spark application runs & how the cluster is configured.
import org.apache.spark.sparkConf
val conf = new SparkConf().setMaster("yarn").setAppName("myApp").set("<conf-name>", "conf-value")
val spark: SparkSession = SparkSession.builder().master("local[1]").appName("learn").config(conf=conf).getOrCreate()
Dynamically Loading Spark Properties
If we want to avoid hard-coding certain configurations in a SparkConf. For example, if we want to run the same application with different num-executors or different amounts of memory in different scenarios. In that case its better to provide at runtime through command line arguments.
spark-submit tool support two ways to load configurations dynamically. The first is command line options, such as — master, — num-executors as shown below in code snippet. Second spark-submit can accept any Spark property using the — conf/-c flag. Running ./bin/spark-submit — help will show the entire list of these options.
./bin/spark-submit
--master <master-url> \
--deploy-mode <deploy-mode> \
--conf <key>=<value> \
--num-executors <value>\
--driver-memory <value> \
--executor-memory <value> \
--executor-cores <number of cores> \
--jars <comma separated dependencies>
--class <main-class> \
<application-jar> \
[application-arguments]
spark-submit will also read configuration options from conf/spark-defaults.conf, in which each line consists of a key (config name) and a value (config value) separated by whitespace.
spark-defaults.conf — File is used to set some of the default configurations for all the applications.
Precedence
Any values specified as flags in spark-submit or in the properties file will be passed on to the application and merged with those specified through SparkConf.
Properties set directly on the SparkConf take highest precedence, then flags passed to spark-submit or spark-shell, then options in the spark-defaults.conf file.
As a developer it is best practice to avoid the spark-defaults.conf or environment variables related files or properties, leave it for admins. Best way to configure the spark application is by using spark-submit at the runtime.
Also we can ensure that we have correctly set the configurations values by checking the application web Spark UI on the “Environment” tab. Only values explicitly specified through spark-default.conf, SparkConf or command line will appear. For all other configurations, we can assume that default value used.
References
- https://spark.apache.org/
- Spark: The Definitive Guide! Book by Bill Chambers and Matei Zaharia.