Quick tips on tweaking spark configuration and performance on EMR

Steve Ng
2 min readApr 19, 2019

Tweaking Spark configuration for beginner might be a daunting task as the required knowledge ranges from understanding what does spark driver and executor do to determining the resource needs of them.

Tip 1: Cache and cache

The easiest way to improve performance is to identify places to .cache. In the below example, we have a logic to sort and get the first row in the following manner

val sortedDf = df.orderBy(...)
if (sortedDf.isEmpty) None else Some(sortedDf.first())

This result in evaluating the DataFrame twice if the DataFrame is not empty. A beginner to Spark may not notice this, happened to us. It flow into EMR until we notice it in spark stages.

Tip 2: Scaling up master node will not help for YARN cluster mode

EMR runs on Yarn, quoted below from AWS documentation:

By default, Amazon EMR uses YARN (Yet Another Resource Negotiator), which is a component introduced in Apache Hadoop 2.0 to centrally manage cluster resources for multiple data-processing frameworks.

Yarn supports 2 mode, cluster or client mode. If you are running yarn in cluster mode with --deploy-mode cluster , both the spark driver and instance node will reside in the core node. Thus, there is no need to have a huge master instance task.

However if you are running in client mode, and performing spark-submit from the master node, spark driver will be in the master node. In this case, depending on your spark job’s need, a higher end master instance type might be needed.

Tip 3: Using maximizeResourceAllocation for a quick fix to OOM issue

By default, EMR adjust spark related configuration such as spark.executor.memory and spark.executor.core based on the core and task instance type in cluster. It is usually set to a very low number in order to accommodate for more executors. This result in a possibility of OOM for your spark job.

There are many articles in the internet on tweaking executor and driver memory. Finding the correct value is never easy, and will generally take a considerable amount of effort in understanding and debugging the needs of your spark job.

A proposal would be to enable maximizeResourceAllocation for your spark cluster. Enabling this configuration will ramp up the driver and executor related memory and core at the expense of parallelism. The side effect might also result in under utilised cluster. Eg. your executor might be bumped to 14 GB in memory each where your spark job might only require 1GB at most.

As a side note: If you are combining tip 2 with this, you may need to manually tweak driver memory as EMR will set tweak driver memory based on the lower instance type of (master, core) nodes.

--

--

Steve Ng

simply curious about new technology on the block